Harvesters and Transformers¶

A harvester gathers raw data from a source using their API.

A transformer takes the raw data gathered by a harvester and maps the fields to the defined SHARE models.

Start Up¶

Install Docker.

Make sure you’re using Python3 - install with miniconda , or homebrew

Install everything inside a Virtual Enviornment - created with Conda or Virtualenv or your python enviornment of choice.

Installation (inside a virtual environment):

pip install -r requirements.txt

// Creates, starts, and sets up containers for elasticsearch,
// postgres, and the server
docker-compose build web
docker-compose run --rm web ./bootstrap.sh

To run the server in a virtual environment instead of Docker:

docker-compose stop web
python manage.py runserver

To run celery worker:

python manage.py celery worker -l DEBUG

Running Existing Harvesters and Transformers¶

To see a list of all sources and their names for harvesting, visit https://share.osf.io/api/sources/

Steps for gathering data:

Harvest data from the original source
Transform data, or create a ChangeSet` that will format the data to be saved into SHARE Models
Accept the ChangeSet` objects, and save them as AbstractCreativeWork objects in the SHARE database

Printing to the Console¶

It is possible to run the harvesters and transformers separately, and print the results out to the console for testing and debugging using ./bin/share

For general help documentation:

./bin/share --help

For harvest help:

./bin/share harvest --help

To harvest:

./bin/share harvest domain.source_name_here

If the harvester created a lot of files and you want to view a couple:

find <source dir i.e. edu.icpsr/> -type f -name '*.json' | head -<number to list>

The harvest command will by default create a new folder at the top level with the same name as the source name, but you can also specify a folder when running the harvest command with the --out argument.

To transform all harvested documents:

./bin/share transform domain.source_name_here dir_where_raw_docs_are/*

To transform just one document harvested:

./bin/share transform domain.source_name_here dir_where_raw_docs_are/filename.json

If the transformer returns an error while parsing a harvested document, it will automatically enter into a python debugger.

To instead enter into an enhanced python debugger with access to a few more variables like data, run:

./bin/share debug domain.source_name_here dir_where_raw_docs_are/filename.json

To debug:

e(data, ctx.<field>)

Running Though the Full Pipeline¶

Note: celery must be running for --async tasks

Run a harvester and transformer:

python manage.py harvest domain.sourcename --async

To automatically accept all ChangeSet objects created:

python manage.py runbot automerge --async

To automatically add all harvested and accepted documents to Elasticsearch:

python manage.py runbot elasticsearch --async

Writing a Harvester and Transformer¶

See the transformers and harvesters located in the share/transformers/ and share/harvesters/ directories for more examples of syntax and best practices.

Adding a new source¶

Determine whether the source has an API to access their metadata
Create a source folder at share/sources/{source name}
- Source names are typically the reversed domain name of the source, e.g. a source at http://example.com would have the name com.example
- If the source name starts with a new TLD (e.g. com, au, gov), please add /TLD.*/ to .gitignore in the generated harvester data section
Create a file named source.yaml in the source folder
- See Writing a source.yaml file
Determine whether the source makes their data available using the OAI-PMH protocol
- If the source is OAI see Best practices for OAI sources
Writing the harvester
- See Best practices for writing a Harvester
Writing the transformer
- See Best practices for writing a Transformer
Adding a sources’s icon
- visit www.domain.com/favicon.ico and download the favicon.ico file
- place the favicon as icon.ico in the source folder
Load the source
- To make the source available in your local SHARE, run ./manage.py loadsources in the terminal

Writing a source.yaml file¶

The source.yaml file contains information about the source itself, and one or more configs that describe how to harvest and transform data from that source.

name: com.example
long_title: Example SHARE Source for Examples
home_page: http://example.com/
user: sources.com.example
configs:
- label: com.example.oai
  base_url: http://example.com/oai/
  harvester: oai
  harvester_kwargs:
      metadata_prefix: oai_datacite
  rate_limit_allowance: 5
  rate_limit_period: 1
  transformer: org.datacite
  transformer_kwargs: {}

See the whitepaper for Source and SourceConfig tables for the available fields.

Best practices for OAI sources¶

Sources that use OAI-PMH make it easy to harvest their metadata.

Set harvester: oai in the source config.
Choose a metadata format to harvest.
- Use the ListMetadataFormats OAI verb to see what formats the source supports.
- Every OAI source supports oai_dc, but they usually also support at least one other format that has richer, more structured data, like oai_datacite or mods.
- Choose the format that seems to have the most useful data for SHARE, especially if a transformer for that format already exists.
- Choose oai_dc only as a last resort.
Add metadata_prefix: {prefix} to the harvester_kwargs in the source config.
If necessary, write a transformer for the chosen format.
- See Best practices for writing a Transformer

Best practices for writing a non-OAI Harvester¶

The harvester should be defined in share/harvesters/{harvester name}.py.
When writing the harvester:
- Inherit from share.harvest.BaseHarvester
- Add the version of the harvester VERSION = 1
- Implement do_harvest(...) (and possibly additional helper functions) to make requests to the source and to yield the harvested records.
- Check to see if the data returned by the source is paginated.
  
  There will often be a resumption token to get the next page of results.
- Check to see if the source’s API accepts a date range
  
  If the API does not then, if possible, check the date on each record returned and stop harvesting if the date on the record is older than the specified start date.
Add the harvester to entry_points in setup.py
- e.g. 'com.example = share.harvesters.com_example:ExampleHarvester',
- run python setup.py develop to make the harvester available in your local SHARE
Test by running the harvester

Best practices for writing a non-OAI Transformer¶

The transformer should be defined in share/transformers/{transformer name}.py.
When writing the transformer:
- Determine what information from the source record should be stored as part of the CreativeWork model (i.e. if the record clearly defines a title, description, contributors, etc.).
- Use the chain transformer tools as necessary to correctly parse the raw data.
  
  Alternatively, implement share.transform.BaseTransformer to create a transformer from scratch.
- Utilize the Extra class
  
  Raw data that does not fit into a defined share model should be stored here.
  
  Raw data that is otherwise altered in the transformer should also be stored here to ensure data integrity.
Add the transformer to entry_points in setup.py
- e.g. 'com.example = share.transformer.com_example:ExampleTransformer',
- run python setup.py develop to make the transformer available in your local SHARE
Test by running the transformer against raw data you have harvested.

Harvesters and Transformers¶

Start Up¶

Running Existing Harvesters and Transformers¶

Printing to the Console¶

Running Though the Full Pipeline¶

Writing a Harvester and Transformer¶

Adding a new source¶

Writing a source.yaml file¶

Best practices for OAI sources¶

Best practices for writing a non-OAI Harvester¶

Best practices for writing a non-OAI Transformer¶

Table Of Contents

Related Topics

Harvesters and Transformers¶

Start Up¶

Running Existing Harvesters and Transformers¶

Printing to the Console¶

Running Though the Full Pipeline¶

Writing a Harvester and Transformer¶

Adding a new source¶

Writing a source.yaml file¶

Best practices for OAI sources¶

Best practices for writing a non-OAI Harvester¶

Best practices for writing a non-OAI Transformer¶

SHARE Chain Transformer¶