Overview¶
ckanext.importer provides utilities for easily importing metadata from an external data source into CKAN and keeping the CKAN metadata up-to-date when the contents of the data source is modified.
To achieve this, each entity (package, resource, view) in CKAN is linked to its counterpart in the original data source via an external ID (EID), for example the entity’s ID in the data source.
As an example, let’s create a package with a resource:
from ckanext.importer import Importer
imp = Importer('my-importer-id')
with imp.sync_package('my-package-eid') as pkg:
# If no package with the given EID exists then it is
# automatically created. Otherwise the existing package
# is retrieved.
# The package can be modified like a dict
pkg['title'] = 'My Package Title'
# For package extras there's the special `.extras` attribute
# which provides a dict-interface:
pkg.extras['my-extra-key'] = 'my-extra-value'
with pkg.sync_resource('my-resource-eid') as res:
# Just like packages, resources are automatically created
# and retrieved based on their EID.
res['name'] = 'My Resource Name'
# Once the `sync_resource` context manager exists, the
# created/updated resource is automatically uploaded to CKAN.
# Once the `sync_package` context manager exists, the created/updated
# package is automatically uploaded to CKAN.
For more details on how to use ckanext.importer please refer to Usage.
Installation¶
ckanext.importer uses the usual installation routine for CKAN extensions:
Activate your CKAN virtualenv:
cd /usr/lib/ckan/default source bin/activate
Install ckanext.importer and its dependencies:
pip install -e git+https://github.com/stadt-karlsruhe/ckanext-importer#egg=ckanext-importer pip install -r src/ckanext-importer/requirements.txt
On a production system you’ll probably want to pin the latest release version of ckanext.importer instead:
pip install -e git+https://github.com/stadt-karlsruhe/ckanext-importer@v0.2.0#egg=ckanext-importer
Restart CKAN. For example, if you’re using Apache,
sudo service apache2 restart
Usage¶
ckanext.importer provides utilities to write Python code for importing and synchronizing CKAN metadata from an external data source.
Note
At this point in time, ckanext.importer does not provide a web UI or any command line tools.
The starting point for using ckanext.importer is an
Importer
. Each Importer
instance corresponds to
a separate data source and is identified using an ID that can be freely
chosen (but must be unique among all importers used on the target CKAN
instance):
from ckanext.importer import Importer
imp = Importer('my-importer-id')
Once you have created an importer, you use its
sync_package()
method to create/update the CKAN
metadata for a dataset. The CKAN package is linked to your external
dataset using an external ID (EID). ckanext.importer automatically
stores the EID along with the other package metadata inside CKAN. Like
the importer ID, the package’s EID can be chosen freely, but must be
unique among all packages for this importer.
with imp.sync_package(eid='my-package-eid') as pkg:
# ckanext.importer has automatically checked whether a
# package for this importer ID and package EID already
# exists and -- if that is the case -- retrieved it.
# Otherwise, a suitable package has been automatically
# created for you.
# Use the package's dict-interface to insert/update the
# metadata from your data source. For example:
pkg['title'] = 'My Package Title'
# Once the context manager exits, the modified package is
# automatically uploaded to CKAN.
Typically, you don’t have only one dataset, but an external data source (for example a database) containing many datasets to be imported:
for external_dataset in external_datasource:
with imp.sync_package(eid=external_dataset.id) as pkg:
pkg['title'] = external_dataset.name
Synchronizing a package’s resources works pretty much the same: the
object returned by sync_package()
is an instance
of Package
and provides a sync_resource()
method:
with imp.sync_package(eid='my-package-eid') as pkg:
pkg['title'] = 'My Package Title'
with pkg.sync_resource(eid='my-resource-eid') as res:
res['name'] = 'My Resource Name'
res['url'] = 'https://some-resource-url'
Resource EIDs need to be unique among all resources of the same package.
Finally, the same mechanism can be used to synchronize resource views
via the Resource.sync_view()
method (which returns a
View
instance):
with pkg.sync_resource(eid='my-resource-eid') as res:
res['name'] = 'My Resource Name'
res['url'] = 'https://some-resource-url'
with res.sync_view(eid='my-view-eid') as view:
view['view_type'] = 'text_view'
view['title'] = 'My View Title'
See the API Reference for more information.
Error Handling¶
A main design principle of ckanext.importer is to keep CKAN’s version of the
imported data in a well-defined state in case of an error. To support different
use cases, there are different approaches to error handling, which can be
configured using the on_error
argument of Importer.sync_package()
,
Package.sync_resource()
, and Resource.sync_view()
:
Re-raise the exception (
OnError.reraise
): If an exception occurs inside the context manager, log and re-raise it.Changes made inside the context manager are not uploaded to CKAN. The previous state of the entity is kept in CKAN. If a new entity was created at the beginning of the context manager then it is deleted
This is the default behavior.
Swallow the exception and keep the entity (
OnError.keep
): The exception is logged, but not re-raised.Swallow the exception and delete the entity (
OnError.delete
): The exception is logged, but not re-raised. The entity is deleted from CKAN.
License¶
Copyright © 2018, Stadt Karlsruhe.
Distributed under the GNU Affero General Public License. See the file LICENSE for details.
Changelog¶
See the file CHANGELOG.md.
API Reference¶
-
class
ckanext.importer.
Importer
(id, api=None, default_owner_org=None)¶ An importer.
This class allows you to sync packages (and, from there, resources and views) between an external data source and CKAN.
id
is the ID of this importer. The ID needs to be unique among all importers used on this CKAN instance. It is converted to Unicode automatically.api
is an optional instance ofckanapi.LocalCKAN
orckanapi.RemoteCKAN
and provides the CKAN instance with to sync data with. If not given it defaults tockanapi.LocalCKAN
, i.e. the currently running local CKAN instance.default_owner_org
is the default setting for theowner_org
field of packages created viasync_package()
and can be either the name or the ID of an existing CKAN organization.-
sync_package
(eid, on_error=OnError.reraise)¶ Sync a package.
This is a context manager that returns a
Package
instance for the CKAN package corresponding to the given EID. The package can then be modified inside the context manager. Once the context manager exits, the modified package is uploaded to CKAN.If no package exists for the given EID then one is created.
If the package is not modified inside the context manager then it is not re-uploaded to CKAN.
on_error is an instance of
OnError
and controls how exceptions inside the context manager are handled.
-
delete_unsynced_packages
()¶ Delete packages that have not been synced.
This method deletes all packages belonging to this importer for which
sync_package()
has not been called since thisImporter
instance has been created.It is intended to be called after all desired packages have been synced to delete those CKAN packages corresponding to objects that have been removed from the data source since the last import.
-
-
class
ckanext.importer.
OnError
¶ Error handling constants.
Used for the
on_error
argument ofImporter.sync_package()
,Package.sync_resource()
, andResource.sync_view()
.-
delete
= 3¶ Swallow the exception and delete the entity.
-
keep
= 2¶ Swallow the exception and keep the old version of the entity. If the entity was created at the beginning of the current context manager (i.e. if no entity for that EID existed before) then that entity is not kept.
-
reraise
= 1¶ Reraise the exception. If the entity was created at the beginning of the current context manager (i.e. if no entity for that EID existed before) then that entity is deleted before the exception is reraised.
-
-
class
ckanext.importer.
Package
(eid, pkg_dict, parent)¶ Wrapper around a CKAN package dict.
Not to be instantiated directly. Use
Importer.sync_package()
instead.The package can be modified using the standard dict-interface:
with imp.sync_package('my-eid') as pkg: pkg['title'] = 'A new title'
-
sync_resource
(eid, on_error=OnError.reraise)¶ Sync a resource of this package.
This is a context manager that returns a
Resource
instance for the package’s CKAN resource corresponding to the given EID. The resource can then be modified inside the context manager. Once the context manager exits, the modified resource is uploaded to CKAN.If no resource exists for the given EID then one is created.
If the resource is not modified inside the context manager then it is not re-uploaded to CKAN.
on_error is an instance of
OnError
and controls how exceptions inside the context manager are handled.
-
delete_unsynced_resources
()¶ Delete resources that have not been synced.
This method deletes all resources belonging to this package for which
sync_resource()
has not been called since thisPackage
instance has been created.It is intended to be called after all desired resources of this package have been synced to delete those CKAN resources corresponding to objects that have been removed from the data source since the last import.
-
extras
= None¶ dict-interface for package extras.
CKAN stores package extras as a list of key/value dicts, which makes modifying them cumbersome. This attribute allows you to access the extras like a regular dict instead:
with imp.sync_package('my-eid') as pkg: pkg.extras['my-extra'] = 'some value'
Reading an extra returns the value of the first extra with the given key or raises a KeyError if no extra with that key exists.
Writing an extra overwrites the value of the first extra with the given key or appends a new extra at the end of the extras list if no extra with the given key exists.
Deleting an extra deletes the first extra with the given key or raises a KeyError when no extra with that key exists.
If you need more control regarding extras with duplicate keys and the order of extras then you need to manage extras manually (using pkg[‘extras’] instead of pkg.extras).
-
-
class
ckanext.importer.
Resource
(eid, data_dict, parent)¶ Wrapper around a CKAN resource dict.
Do not instantiate directly, use
Package.sync_resource()
instead.The resource can be modified using the standard dict-interface:
with pkg.sync_resource('my-eid') as res: res['name'] = 'A new name'
-
sync_view
(eid, on_error=OnError.reraise)¶ Sync a view of this resource.
This is a context manager that returns a
View
instance for the resource’s CKAN view corresponding to the given EID. The view can then be modified inside the context manager. Once the context manager exits, the modified resource is uploaded to CKAN.If no view exists for the given EID then one is created.
If the view is not modified inside the context manager then it is not re-uploaded to CKAN.
on_error is an instance of
OnError
and controls how exceptions inside the context manager are handled.
-
delete_unsynced_views
()¶ Delete views that have not been synced.
This method deletes all views belonging to this resource for which
sync_view()
has not been called since thisResource
instance has been created.It is intended to be called after all desired views of this resource have been synced to delete those CKAN views that are no longer desired.
-
-
class
ckanext.importer.
View
(eid, data_dict, parent)¶ Wrapper around a CKAN view.
Do not instantiate directly. Use
Resource.sync_view()
instead.The view can be modified using the standard dict-interface:
with res.sync_view('my-eid') as view: view['title'] = 'A new title'