Identification and analysis of our research repository file formats using DROID

--

Hi, I’m John Gostick and I’m the Technical Lead for Digital Preservation at Cambridge University Library, where as part of our 5-year Digital Preservation Programme we’re working to design and build a cloud native CUL Digital Preservation Service (DPS).

This post is about a piece of work I was recently involved with that used DROID to identify the format of files held in Apollo (the University’s institutional repository).

What is Apollo?

Apollo is the institutional repository (IR) of the University of Cambridge, established in 2003 as a service for curating, storing and sharing the research outputs from the University’s more than 10,000 academic, research staff and PhD/research students, and includes publications, conference proceedings, book chapters, monographs, theses, datasets, working papers and other supplementary files.

The service is built on a software application called DSpace (short for ‘Duraspace’), an open-source repository platform typically used for creating open access repositories for scholarly and/or published digital content.

What’s in it?

As of the start of February 2022, Apollo holds 3719 datasets, 9739 theses and 59076 journal articles, in addition to many other forms of content such as reports, software and chemical structures. Users can ‘Browse by Type’ to see a full and current list of the content type categories and associated number of items held in the repository, and a recent file system query reported over 1.2 million files held on the underlying storage.

What we wanted to find out

Whilst DSpace can tell us the number of files on disk and how they are categorised in the repository itself, it doesn’t do robust analysis and format identification on files when they’re ingested. Instead, it infers the format of each file from its extension, which works well enough for more common formats but doesn’t work for unknown or missing extensions and cannot identify mismatches (where the given extension is different from that normally used for that file format). There are also more generic extensions like .dat, which are normally data files that store information relating and specific to the application that created the file.

A small proportion (approx. 2%) of the files in the repository storage (mainly supplementary content and research datasets) are ‘containers’, i.e. archive file formats such as zip and tar, where multiple files are ‘serialised’ together into a single file, which is useful for portability, saving space (many container formats support compression) and organisation (containers can include both files and folders and their hierarchy). Although these containers only make up a very small percentage of the files ‘on disk’, the number of files held inside these containers was believed to be significant, however the repository contains no information on the contents of each container.

We wanted an accurate view and understanding of all files in the repository, to understand requirements for submission of this content to a preservation workflow, support data management functions, inform preservation planning, such as file format migration to guard against obsolescence, and to help facilitate and ensure access to the intellectual content. Library staff are also preparing Apollo to be certified with the Core Trust Seal standard, so this file identification work also provides information to support this application.

Our goal was therefore to identify as accurately as possible the correct format of every file in the repository, including those in containers, and to do this we used a tool called DROID.

What is DROID?

DROID (Digital Record Object IDentification) is an open-source software tool developed by The National Archives in the UK, that uses information about the identifying characteristics of known file types (a.k.a. ‘signatures’) to identify and report on the formats and versions of digital files.

A cross-platform Java application, it can run on any operating system with a Java Runtime Environment installed and includes a GUI and a command line interface (CLI) that enable its operation to be scripted.

The signature files DROID uses for format identification contain information from The National Archives’ PRONOM technical registry, which is an open, web-based registry of technical information about file formats and their software dependencies that is regularly updated.

DROID compares every file it scans to the byte sequences and other characteristics (such as extension) defined in these signatures, to try and correctly identify a format and version. The results (including the PRONOM Persistent Unique Identifier (PUID) of the identified format) are then stored in a database inside a ‘profile’ container file, that can be opened using the GUI and used to generate reports on the results and export them as tabular data in the CSV format.

Crucially for our requirements, DROID can recursively scan the contents of a wide variety of container formats, including ZIP, RAR, 7Zip, ISO and BZip2, so would also allow us to report on files inside containers and even those in containers which are themselves inside containers, giving us a complete picture of all files held in the Apollo repository.

Using DROID

I ran DROID v6.5 on a fresh copy of the live repository datastores that we mounted on a Linux development server. This was to avoid any impact on the performance and stability of the live repository service from high levels of storage I/O, memory and CPU usage while the application was being run.

For historical reasons, the files in the repository were split across 2 volumes; ‘assetstore’ and ‘assetstore2’. Before running DROID on each of these volumes, I first used the tree command to get an indication of how many files it would have to scan:

Number of files and folders found on repository storage using the tree command

Important Note!
Make sure you update the signature files before you run DROID. The GUI will prompt you to do this automatically, but you’ll need to do it manually if you are using the CLI. The latest DROID version will only include the signature versions that were available at the time of release, but new signature files get released more frequently. This caught me out when I first ran DROID, as I was running it from the command line on a server without internet access, so inadvertently used signature files that were over a year old, and a quick review of the release notes showed I was missing out on over 700 new and updated PRONOM signatures as a result!

The server I was using didn’t have a GUI so I ran DROID from the command line (in a bash shell terminal using SSH) using this command:

nohup ./droid.sh -R -a "PATH_1" "PATH_2" -p PROFILE_FILE &

Here’s a quick breakdown of each part of the command to explain what it does:

  • nohup
    This ensures the command will keep running even if you close your terminal session or get disconnected for some reason. It also redirects the outputs of the running command to a log file called nohup.out, which is useful for later troubleshooting and can be monitored during operation for error messages and to see the overall progress.
  • ./droid.sh
    The shell script that launches the DROID application and contains various runtime parameters that can be changed, such as the location of the temporary files created during each scan — we ended up using a dedicated 500GB volume as we discovered this can get very large indeed!
  • -R
    Tells DROID to recursively scan files in all subdirectories.
  • -a
    Adds one or more folders/directories for DROID to scan to build a profile.
  • -p
    Specifies the profile file containing the results to create when the scan is complete.
  • &
    The ampersand is a standard Linux operator which makes the preceding command run in the background.

As mentioned above, we observed that the working area where DROID stores its temporary files during operation can get very large indeed — at one point whilst scanning the larger asseststore2 volume it exceeded 150GB! The resultant profile file created at the end of each scan is significantly smaller, and the temporary files are deleted after use, but if you are scanning a similarly large number of files do be aware of the potential space requirements before you start — if the available space runs out (as happened to me more than once), DROID will throw an error, stop, and you’ll need to start the scan again.

To make the process a bit more manageable, I ran the tool on each volume separately and also split the (much larger) assetstore2 into 2 batches, with the intention of combining the 3 separate result set at the end.

Exporting the results in each profile to a CSV file was done using this command:

./droid.sh -p PROFILE_FILE -e EXPORT_FILE.csv

It is worth mentioning that DROID assigns an identifier to each item which is unique within the results of each scan, however when you combine the results of separate scans together (as we did), the identifiers will no longer be unique and you’ll get duplicate values, meaning you cannot then use the identifier to determine the relationship between items. To mitigate this, before combining our individual result sets I used the awk command to add a string prefix to the values of the ID and PARENT_ID columns:

awk -F ',' -v OFS=',' '{if(NR==1){print; next}; gsub(/"/, "", $1); gsub(/"/, "", $2); $1="\"<prefix>"$1"\""; $2="\"<prefix>"$2"\""; print}' export<#>.csv > export<#>_pf.csv

To combine the 3 separate CSV files into one, I used the sed command to remove the header row from the 2nd and 3rd export files before appending them to the first:

cp export1_pf.csv exports_combined.csv && sed '1d' export2_pf.csv >> exports_combined.csv && sed '1d' export3_pf.csv >> exports_combined.csv

The table below shows how long DROID took to process each volume, how many files and folders/directories were scanned and the size of the resultant profiles and CSV exports:

Now we had the results of our completed scans, the next step was to analyse them and report on what we found.

Analysing the results

To help contextualise the DROID results, we imported the combined results CSV file into a custom table in the DSpace PostgreSQL database where we could link each file to an associated repository item and include information such as the category, date of deposit and what collection the materials are in. To do this, SQL queries were developed that linked the NAME field in the DROID results to the bitstream internal_id in the Apollo database, as DSpace names each file in storage using its ‘internal identifier’ rather than retaining the original file name. For files stored inside containers though, the NAME field instead contains the original filename (as it’s only the parent container file that has an internal identifier assigned) — this meant we couldn’t use the NAME field to link these directly to an existing record in Apollo, and to get around this we used a regular expression to extract the parent internal identifier from the DROID URI path field.

Being able to link the results to the repository data was important, as there were some files included in the results that we wanted to exclude from our analysis, including derivatives and surrogates such as image thumbnails and text only versions generated from the original content to aid search and discovery. Files within an item in Apollo/DSpace are linked to ‘bundles’, e.g. ‘ORIGINAL’ for the originals and ‘THUMBAIL’ for the thumbnails, which allowed us to focus our analysis accordingly. Some files had also been ‘soft deleted’ from the repository but hadn’t yet been purged from disk, so we excluded those as well.

With the DROID results successfully imported into the database, we were then able to query them and the existing repository data using SQL and visualise the data using a separate product called Grafana.

Although the analysis work is still ongoing, here is brief overview of what we’ve found so far:

  • DROID scanned 6,164,393 files and folders in our repository storage file system, including files and folders inside containers
  • Only 2 files in the entire file system could not be scanned at all
  • Excluding folders, DROID found and scanned 5,249,799 files, which is nearly 4½ times the number of files known to the Apollo repository
  • 77% of the files scanned are held inside containers
  • This means that 2% of the files in the repository storage (the containers) contain 77% of the files overall

Visualising and querying the data using Grafana also showed some interesting results:

  • By file count, the most common format name was Tagged Image Format (fmt/353) at 12% of the files scanned:
  • DROID was unable to identify 41% of the files it scanned in our repository storage
  • There were 14471 unique file extensions!
  • Of the files that DROID was unable to identify, the largest single grouping by extension within that set was .dat:
  • By size, the majority of files DROID identified were videos (8TB), followed by datasets (2.34TB):

A few other observations we made:

  • DSpace renames files when it stores them using their system identifier and removes the original file extension, so this information is unavailable to DROID at the point of scanning, however files inside containers are unchanged, so DROID is able to see their original filename and extension.
  • Approximately 17K rows (0.3%) of the DROID results contained more than one identified format (each with a PUID, MIME type and format name). These results have a value greater than 1 in the FORMAT_COUNT column. The DROID user guide explains that this scenario occurs either when a format is identified purely based on its file extension (as different formats and versions can share the same extension), or when a file is matched to more than one signature (which would indicate that the signatures need to be tweaked to more clearly differentiate the different formats). Most of our results with multiple format matches were indeed matched on the file extension alone, however a small number (989) were matched by signatures.
  • I looked at a sample of these where DROID was reporting the format matched both x-fmt/263 (‘ZIP Format’) and fmt/280 (‘LaTeX (Master document)’) and compared the bytes of the file (using a hex editor tool) to the byte patterns specified in the PRONOM signatures for each of those formats. LaTex files are apparently plain text files (normally with a .tex extension), that contain markup commands, which are then fed through a LaTex ‘compiler’ to render a document (typically in PDF format). The files in my sample were indeed ZIP files, and each contained a PDF seemingly rendered from a LaTex file and an XML file that included LaTex commands. The PRONOM signature for LaTex document files simply looks for a string ‘\documentclass’ at any byte offset from 0 to 4096, so where the XML file is serialised ‘first’ in an uncompressed ZIP file, that byte pattern is appearing within the byte range specified in the signature. I’m not sure if there is any better way of identifying the LaTex document format than the current signature, but this is an interesting example of where distinguishing and identifying formats programmatically can be difficult!
Screenshot showing the PRONOM signature for LaTeX (Master document) and the matching bytes in the ZIP file opened in a hex editor

What will be do with this information?

As discussed in our previous blog post on surveying formats, it is important to have up to date and accurate information about the content of the collections we are working on that are in scope for digital preservation, and we can now use the data gathered to us help plan and implement suitable workflows for ingest, preservation and content management under our Digital Preservation Service.

The repository managers have also found this information extremely helpful to understand what is in Apollo and will use it to guide and inform future format-related risk analysis and decision making.

--

--