Overview

ckanext.importer provides utilities for easily importing metadata from an external data source into CKAN and keeping the CKAN metadata up-to-date when the contents of the data source is modified.

To achieve this, each entity (package, resource, view) in CKAN is linked to its counterpart in the original data source via an external ID (EID), for example the entity’s ID in the data source.

As an example, let’s create a package with a resource:

from ckanext.importer import Importer

imp = Importer('my-importer-id')

with imp.sync_package('my-package-eid') as pkg:
    # If no package with the given EID exists then it is
    # automatically created. Otherwise the existing package
    # is retrieved.

    # The package can be modified like a dict
    pkg['title'] = 'My Package Title'

    # For package extras there's the special `.extras` attribute
    # which provides a dict-interface:
    pkg.extras['my-extra-key'] = 'my-extra-value'

    with pkg.sync_resource('my-resource-eid') as res:
        # Just like packages, resources are automatically created
        # and retrieved based on their EID.
        res['name'] = 'My Resource Name'

    # Once the `sync_resource` context manager exists, the
    # created/updated resource is automatically uploaded to CKAN.

# Once the `sync_package` context manager exists, the created/updated
# package is automatically uploaded to CKAN.

For more details on how to use ckanext.importer please refer to Usage.

Installation

ckanext.importer uses the usual installation routine for CKAN extensions:

  1. Activate your CKAN virtualenv:

    cd /usr/lib/ckan/default
    source bin/activate
    
  2. Install ckanext.importer and its dependencies:

    pip install -e git+https://github.com/stadt-karlsruhe/ckanext-importer#egg=ckanext-importer
    pip install -r src/ckanext-importer/requirements.txt
    

    On a production system you’ll probably want to pin the latest release version of ckanext.importer instead:

    pip install -e git+https://github.com/stadt-karlsruhe/ckanext-importer@v0.2.0#egg=ckanext-importer
    
  3. Restart CKAN. For example, if you’re using Apache,

    sudo service apache2 restart
    

Usage

ckanext.importer provides utilities to write Python code for importing and synchronizing CKAN metadata from an external data source.

Note

At this point in time, ckanext.importer does not provide a web UI or any command line tools.

The starting point for using ckanext.importer is an Importer. Each Importer instance corresponds to a separate data source and is identified using an ID that can be freely chosen (but must be unique among all importers used on the target CKAN instance):

from ckanext.importer import Importer

imp = Importer('my-importer-id')

Once you have created an importer, you use its sync_package() method to create/update the CKAN metadata for a dataset. The CKAN package is linked to your external dataset using an external ID (EID). ckanext.importer automatically stores the EID along with the other package metadata inside CKAN. Like the importer ID, the package’s EID can be chosen freely, but must be unique among all packages for this importer.

with imp.sync_package(eid='my-package-eid') as pkg:
    # ckanext.importer has automatically checked whether a
    # package for this importer ID and package EID already
    # exists and -- if that is the case -- retrieved it.
    # Otherwise, a suitable package has been automatically
    # created for you.

    # Use the package's dict-interface to insert/update the
    # metadata from your data source. For example:
    pkg['title'] = 'My Package Title'

# Once the context manager exits, the modified package is
# automatically uploaded to CKAN.

Typically, you don’t have only one dataset, but an external data source (for example a database) containing many datasets to be imported:

for external_dataset in external_datasource:
    with imp.sync_package(eid=external_dataset.id) as pkg:
        pkg['title'] = external_dataset.name

Synchronizing a package’s resources works pretty much the same: the object returned by sync_package() is an instance of Package and provides a sync_resource() method:

with imp.sync_package(eid='my-package-eid') as pkg:
    pkg['title'] = 'My Package Title'

    with pkg.sync_resource(eid='my-resource-eid') as res:
        res['name'] = 'My Resource Name'
        res['url'] = 'https://some-resource-url'

Resource EIDs need to be unique among all resources of the same package.

Finally, the same mechanism can be used to synchronize resource views via the Resource.sync_view() method (which returns a View instance):

with pkg.sync_resource(eid='my-resource-eid') as res:
    res['name'] = 'My Resource Name'
    res['url'] = 'https://some-resource-url'

    with res.sync_view(eid='my-view-eid') as view:
        view['view_type'] = 'text_view'
        view['title'] = 'My View Title'

See the API Reference for more information.

Error Handling

A main design principle of ckanext.importer is to keep CKAN’s version of the imported data in a well-defined state in case of an error. To support different use cases, there are different approaches to error handling, which can be configured using the on_error argument of Importer.sync_package(), Package.sync_resource(), and Resource.sync_view():

  • Re-raise the exception (OnError.reraise): If an exception occurs inside the context manager, log and re-raise it.

    Changes made inside the context manager are not uploaded to CKAN. The previous state of the entity is kept in CKAN. If a new entity was created at the beginning of the context manager then it is deleted

    This is the default behavior.

  • Swallow the exception and keep the entity (OnError.keep): The exception is logged, but not re-raised.

  • Swallow the exception and delete the entity (OnError.delete): The exception is logged, but not re-raised. The entity is deleted from CKAN.

License

Copyright © 2018, Stadt Karlsruhe.

Distributed under the GNU Affero General Public License. See the file LICENSE for details.

Changelog

See the file CHANGELOG.md.

API Reference

class ckanext.importer.Importer(id, api=None, default_owner_org=None)

An importer.

This class allows you to sync packages (and, from there, resources and views) between an external data source and CKAN.

id is the ID of this importer. The ID needs to be unique among all importers used on this CKAN instance. It is converted to Unicode automatically.

api is an optional instance of ckanapi.LocalCKAN or ckanapi.RemoteCKAN and provides the CKAN instance with to sync data with. If not given it defaults to ckanapi.LocalCKAN, i.e. the currently running local CKAN instance.

default_owner_org is the default setting for the owner_org field of packages created via sync_package() and can be either the name or the ID of an existing CKAN organization.

sync_package(eid, on_error=OnError.reraise)

Sync a package.

This is a context manager that returns a Package instance for the CKAN package corresponding to the given EID. The package can then be modified inside the context manager. Once the context manager exits, the modified package is uploaded to CKAN.

If no package exists for the given EID then one is created.

If the package is not modified inside the context manager then it is not re-uploaded to CKAN.

on_error is an instance of OnError and controls how exceptions inside the context manager are handled.

delete_unsynced_packages()

Delete packages that have not been synced.

This method deletes all packages belonging to this importer for which sync_package() has not been called since this Importer instance has been created.

It is intended to be called after all desired packages have been synced to delete those CKAN packages corresponding to objects that have been removed from the data source since the last import.

class ckanext.importer.OnError

Error handling constants.

Used for the on_error argument of Importer.sync_package(), Package.sync_resource(), and Resource.sync_view().

delete = 3

Swallow the exception and delete the entity.

keep = 2

Swallow the exception and keep the old version of the entity. If the entity was created at the beginning of the current context manager (i.e. if no entity for that EID existed before) then that entity is not kept.

reraise = 1

Reraise the exception. If the entity was created at the beginning of the current context manager (i.e. if no entity for that EID existed before) then that entity is deleted before the exception is reraised.

class ckanext.importer.Package(eid, pkg_dict, parent)

Wrapper around a CKAN package dict.

Not to be instantiated directly. Use Importer.sync_package() instead.

The package can be modified using the standard dict-interface:

with imp.sync_package('my-eid') as pkg:
    pkg['title'] = 'A new title'
sync_resource(eid, on_error=OnError.reraise)

Sync a resource of this package.

This is a context manager that returns a Resource instance for the package’s CKAN resource corresponding to the given EID. The resource can then be modified inside the context manager. Once the context manager exits, the modified resource is uploaded to CKAN.

If no resource exists for the given EID then one is created.

If the resource is not modified inside the context manager then it is not re-uploaded to CKAN.

on_error is an instance of OnError and controls how exceptions inside the context manager are handled.

delete_unsynced_resources()

Delete resources that have not been synced.

This method deletes all resources belonging to this package for which sync_resource() has not been called since this Package instance has been created.

It is intended to be called after all desired resources of this package have been synced to delete those CKAN resources corresponding to objects that have been removed from the data source since the last import.

extras = None

dict-interface for package extras.

CKAN stores package extras as a list of key/value dicts, which makes modifying them cumbersome. This attribute allows you to access the extras like a regular dict instead:

with imp.sync_package('my-eid') as pkg:
    pkg.extras['my-extra'] = 'some value'

Reading an extra returns the value of the first extra with the given key or raises a KeyError if no extra with that key exists.

Writing an extra overwrites the value of the first extra with the given key or appends a new extra at the end of the extras list if no extra with the given key exists.

Deleting an extra deletes the first extra with the given key or raises a KeyError when no extra with that key exists.

If you need more control regarding extras with duplicate keys and the order of extras then you need to manage extras manually (using pkg[‘extras’] instead of pkg.extras).

class ckanext.importer.Resource(eid, data_dict, parent)

Wrapper around a CKAN resource dict.

Do not instantiate directly, use Package.sync_resource() instead.

The resource can be modified using the standard dict-interface:

with pkg.sync_resource('my-eid') as res:
    res['name'] = 'A new name'
sync_view(eid, on_error=OnError.reraise)

Sync a view of this resource.

This is a context manager that returns a View instance for the resource’s CKAN view corresponding to the given EID. The view can then be modified inside the context manager. Once the context manager exits, the modified resource is uploaded to CKAN.

If no view exists for the given EID then one is created.

If the view is not modified inside the context manager then it is not re-uploaded to CKAN.

on_error is an instance of OnError and controls how exceptions inside the context manager are handled.

delete_unsynced_views()

Delete views that have not been synced.

This method deletes all views belonging to this resource for which sync_view() has not been called since this Resource instance has been created.

It is intended to be called after all desired views of this resource have been synced to delete those CKAN views that are no longer desired.

class ckanext.importer.View(eid, data_dict, parent)

Wrapper around a CKAN view.

Do not instantiate directly. Use Resource.sync_view() instead.

The view can be modified using the standard dict-interface:

with res.sync_view('my-eid') as view:
    view['title'] = 'A new title'