Under the hood: a look inside our Deposit Service

--

This post covers the technical implementation and design of the Deposit Service that we launched as an MVP (Minimal Viable Product) in October this year (2022).

What it does

Let’s start with a quick overview of what the Deposit Service does and how it is used — the diagram below shows the high-level workflow for the MVP which provides the functionality required to meet our agreed minimum requirements, allowing depositors to authenticate and securely upload their digital assets into the service:

A diagram showing the Deposit Service MVP workflow
Diagram showing the Deposit Service MVP workflow

In summary:

  • our staff create a user for each agreed depositor using their email address
  • one or more ‘deposits’ are then defined for each user, and each individual deposit includes the identifier of its accession record in our Archive Management System
  • depositors can enter their email address in the web-based UI for the service to request a passcode. A time-limited passcode is then sent to their email address, which allows them to sign into the Deposit Service using their browser and upload digital files of any type into their chosen deposit area
  • uploaded content is only visible to the user who uploaded it and our internal Archivist team
  • depositors can see the listing of files they’ve uploaded and even remove files from an open deposit, but cannot download them
  • to ensure file integrity, checksums are calculated on the user’s machine before upload, stored with each file and are used to verify the successful transfer of data

How it works

The Deposit Service has been built as a ‘cloud native’ application — this is a development approach that utilises the scalability, flexibility and resilience of cloud services and uses technologies such as microservices, serverless functions containers and APIs, along with other services provided by the cloud platform provider. Our team followed an agile methodology during development of the service and we adopted processes, such as continuous integration/continuous delivery (CI/CD), to help rapidly test and deploy new code and features. We also used an approach called infrastructure as code (IaC) to manage and provision our cloud resources through code instead of manual processes, using scripts and configuration files to define and deploy the underlying infrastructure.

We use AWS (Amazon Web Services) as our cloud provider, and the architecture is designed accordingly to take advantage of the services available on their platform.

A diagram showing the architecture of the Deposit Service on the AWS cloud platform
Diagram showing the architecture of our Deposit Service on the AWS cloud platform

Here’s a step by step of how the service works at a technical level and the AWS cloud services that we use:

  • When a depositor points their web browser at the URL of the Deposit Service, the resultant DNS request is routed to the AWS DNS service Amazon Route 53, to which we’ve delegated a managed zone that holds the records for the Deposit Service and other services that we’re creating for digital collections.
  • The DNS resolves to Amazon CloudFront, a CDN (Content Delivery Network) that we use as a caching layer (and because S3 cannot do HTTPS on its own)
  • HTTPS on the CloudFront endpoint is enabled using an SSL/TLS certificate issued by AWS Certificate Manager
  • Amazon have their own certificate authority (Amazon Trust Services) which is trusted by default by most browsers and operating systems, allowing AWS to issue public certificates directly and making the process of enabling HTTPS much quicker and easier
  • CloudFront returns the Deposit Service website files to the depositor directly if the content is already in its cache, otherwise the request is proxied on to an Amazon S3 (Simple Storage Service) bucket which serves up the frontend
  • In the user interface (UI) of the Deposit Service, depositors enter their email address to get a time-limited, single use passcode emailed to them, a process which is managed by a custom authentication workflow in Amazon Cognito, a customer identity and access management service in AWS
  • Cognito has a User Pool setup for the Deposit Service which is populated with users for each depositor by our staff, via a separate, protected admin section of the UI
  • The User Pool triggers several serverless AWS Lambda functions, that generate a challenge and email the passcode/required response to the user using Amazon SES (Simple Email Service)
  • Each Lambda function also sends its logs to Amazon CloudWatch, which allows us to search, visualise and report on their content and even trigger alarms if certain conditions are met
  • The depositor enters the received passcode in the UI and on successful authentication Cognito issues them with a token for their session
  • Using the token, the UI is then able to make authenticated calls to Amazon S3 on behalf of the user, to retrieve and display a list of defined deposits (held in a JSON document) and the existing contents/file listing for each of those deposits
  • The depositor can then use the UI to upload one or more files from their own machine to the S3 bucket used to store deposits
  • Before each file is uploaded, the UI calculates a SHA-256 checksum (more on that shortly) and includes this value and the original filename as custom metadata fields in the objects in S3 (files are called ‘objects’ once they’re stored in S3)
  • An AWS IAM (Identity and Access Management) policy is used to enforce permissions on the S3 bucket to only allow the original depositor and our archivist team to access and manage the uploaded files
  • To monitor the service, we use a 3rd party solution that integrates directly with our AWS account that gathers, reports and alerts on various relevant service metrics, including the number of objects in each bucket/their total size, bucket growth over time, the number of Lambda invocations, etc.

Ensuring and maintaining the integrity of uploaded files

As a part of the Digital Preservation Service for Cambridge University Libraries, we need to ensure that any files given to us through the Deposit Service remain unchanged not only after we receive them but also during the upload process, which we achieve through the use of checksums.

Checksums (also called “hashes”) are a string of letters and numbers produced by feeding data through a checksum algorithm. There are many different algorithms available but some of the most commonly used include MD5, SHA-256 & SHA-512. Checksums will alter significantly if even a single bit changes in the source data, so they are extremely useful for checking files and other data for errors that may have occurred during transmission or storage and to ensure that fixity has been maintained.

S3 supports uploading files up to 5GB in size using a single PUT operation, but anything larger (up to a maximum of 5TB) must be uploaded in ‘parts’ using a managed process called multipart upload which is supported by our Deposit Service.

If the MD5 checksum of a file or part is provided as the value of the Content-MD5 header in an upload request, S3 will also calculate and compare the checksums itself after upload is complete to confirm the integrity of the data (returning an error if the checksums do not match), so we utilise this feature in our Deposit Service to verify that each upload is completed successfully.

In addition to the MD5 checksum used by S3, the Deposit Service UI also calculates the SHA-256 checksum for each file before it gets uploaded. We store this value as custom metadata in the object in S3 and will use it to re-verify the file integrity at later stages of our preservation workflow. Although MD5 is a faster algorithm, it is now only considered suitable for detecting accidental data corruption and not for data security applications, whereas SHA-256 is a strong cryptographic hash algorithm well suited for both data security and safety.

Although we don’t utilise it directly, it’s also worth briefly mentioning that every S3 object also has an ‘ETag’ field; for files/objects uploaded using a single PUT operation this will contain the MD5 checksum, but for an object uploaded in parts the value is instead the MD5 checksum of all of the MD5 checksums for each individual part, appended with the total number of parts.

What’s next?

Delivering a functioning Deposit Service MVP that provides a mechanism for donors to transfer digital deposits safely and securely to the University Library was a great milestone for our team. Our immediate focus has now shifted to building the workflows and repository that will make up the backbone of our Digital Preservation Service, but there are a still few things on our roadmap that we plan to deliver:

  • Add progress bars for checksum calculation and upload in the UI
    The current UI doesn’t display what’s happening when file checksums are being calculated and when the files are then being uploaded, which can result in a poor user experience (especially when uploading larger files and large numbers of files), as there isn’t any visual feedback on what’s happening until the UI refreshes after each upload is complete.
  • Moving the deposited content into our ingest & preservation workflows
    We’re hoping to have a basic end-to-end workflow in place in 2023, on which we can then start to build additional functionality. Once this meets enough of our minimum requirements, we can start ingesting the content stored in the Deposit Service into our repository and preservation storage to make it discoverable and accessible.
  • Make the code open source
    Although the code of the Deposit Service has been kept private during initial development, once it’s a little more mature and we’ve smoothed off the rougher edges we’ll be making it open source for others in the community to use, adapt and develop as they wish, something we intend to do for all components of our Digital Preservation Service.

I hope this post has been an interesting insight into our recent activities and what we’ve been working on — a future post will focus on the end-to-end workflow we’re currently building and our implementation of a repository function to store and preserve our content.

--

--