Building our repository ingest workflow

Published in

Digital Preservation at Cambridge University Libraries

5 min readNov 7, 2023

My last post was about how we built our Deposit Service, a web-based portal that allows users to securely deposit their digital collections with the Library. This post covers what we’re planning to do next with the content we collect; more specifically the workflow that we’ve built to process and store submitted content in our repository and preservation storage.

Overview

Our overall system design is conceptually quite simple:

Conceptual flow diagram for our DPS Collection Management system

Library staff first acquire or create content using the Deposit & Transfer services, uploading to the University’s institutional repository, or by digitising physical collection items
The next stage is pre-ingest, which generally involves collating and packaging the acquired metadata and files into a standard form, ready to submit for ingest
During ingest we then process each submission through a series of steps (detailed below), before storing the content and resultant metadata in our repository
Our repository is the core of the system; everything goes into here, even if an object doesn’t end up being selected for long-term preservation, which could happen for a variety of reasons — this decision helps us to quickly make content safer and discoverable, all in one central location
Most of the content in the repository will then be copied to preservation storage, where we store multiple copies across geographically separate locations
Finally, Administration and Access will support various repository management and operational functions, including the appraisal of selected content before preservation

What happens in the ingest workflow?

The steps in the workflow are applied to every submission processed, regardless of the source. The first version scheduled to go live will ingest eTheses exported from Apollo (the University’s institutional repository built on DSpace). The team will then develop it further to support submissions from other sources in a standardised format.

The container file for the submission is unpacked
Any embedded containers (e.g. tar.gz or zip) inside the submission are then themselves recursively unpacked whilst retaining any embedded hierarchy
The supplied METS file is parsed to extract key metadata fields
The submission is checked to see if it’s a resubmission (i.e. a duplicate or an update), then handled accordingly
Each file (binary) in the submission is:

scanned for viruses & other malware
analysed to identify its file format

6. All metadata (both derived and supplied) is then stored in the repository

7. Each file is stored in the repository after the supplied checksum is validated

Each invocation of the workflow also has a unique (UUID) identifier assigned and each step generates and publishes an event, which are stored and used to track/log the progress of each submission.

How did we build it?

We built our ingest workflow in the AWS Cloud using AWS Step Functions, which helped us to meet our requirements in terms of operational cost, scalability and being able to integrate with the other components of the system, including the Fedora 6 repository where our content is ingested and preserved.

AWS Step Functions is a “a serverless orchestration service that lets you combine AWS Lambda functions and other AWS services to build business-critical applications”, which makes it ideal for our requirements. It has built-in error handling and retry, support for complex workflows with multiple steps, scaling and parallel executions, and can handle workflows running for up to a year. Application workflows (known as ‘State Machines’) are defined in a JSON-based, structured ‘Amazon States Language’ and have an easy-to-use visual workflow editor in the AWS console.

Our ingest workflow on AWS serverless infrastructure

How it works:

When each exported object is uploaded to the submissions bucket, this publishes an event to Amazon EventBridge, where a rule is configured to trigger the execution of a State Machine workflow in AWS Step Functions
The State Machine executes a series of ‘Actions’ linked by ‘Flows’; these invoke individual Lambdas to perform functions such as unpacking the submission, recursively unpacking any embedded containers and parsing the supplied METS/metadata file(s)
We use DynamoDB Actions throughout the workflow to publish events to a datastore and to retrieve configuration/routing information
To submit each file for antivirus scanning and file format identification:

the State Machine copies each file to an S3 bucket for processing — this publishes an event to Amazon EventBridge where a rule is configured to add a message to an SQS queue
at this point the operation of the state machine is paused and a ‘task token’ is issued which gets added to the object metadata in the S3 bucket
the SQS queue is monitored by a CloudWatch alarm which launches a number of Fargate tasks in the Amazon Elastic Container Service (ECS), the number of tasks being determined by the size of the queue and the configuration of an auto-scaling policy, which also then stops any running tasks (after a defined cooldown period) when the queue is empty — this helps us avoid unnecessary costs when there is nothing to process, but also allows us to rapidly scale upwards to meet spikes in demand
the ‘tasks’ in this case are containerised instances of ClamAV (an “open-source antivirus engine for detecting trojans, viruses, malware & other malicious threats”) and Siegfried (a signature-based file format identification tool that uses the UK National Archives PRONOM file format signatures)
each task container is configured as a consumer of the SQS queue, and once running will pick up any pending messages and process the associated files, before parsing the output, adding derived metadata to the objects and moving them to another location to trigger further processing
the final location in the processing bucket triggers a Lambda which retrieves the State Machine task token stored in an object’s metadata, calls the Step Function API and passes the token in order to resume the associated workflow from the paused state

The final Action of the State Machine stores all of the files and metadata in our Fedora repository using multiple calls wrapped in a single atomic transaction — this ensures that if there are any errors during the submission process the changes under the transaction will be automatically rolled back, to avoid leaving our repository in an inconsistent state and containing any errors within the scope of the ingest workflow

What’s next?

We’re in the final stages of developing and testing this initial workflow with the Fedora repository and aim to have it live by the end of this year. We’ll keep developing it further beyond that though, with plans to support ingest from additional sources and to add support for format validation and normalisation functions when required.

In my next post I’ll be taking a closer look at how we store content in our repository using OCFL and how we’re utilising extensions to that standard to support appraisal functionality and preserve our data.

Building our repository ingest workflow

Overview

What happens in the ingest workflow?

How did we build it?

What’s next?

Written by John Gostick