Back to Class : Capturing the University of Cambridge Domain

--

The following post is by Caylin Smith and Leontien Talboom.

Although Cambridge University Libraries (CUL) have contributed to the UK Web Archive as part of its responsibilities as a UK Legal Deposit Library, this year marked the first time CUL staff attended the International Internet Preservation Consortium’s Web Archiving Conference.

The IIPC is a global community focused on the preservation of the web, specifically:

  • identify and develop best practices for selecting, harvesting, collecting, preserving and providing access to Internet content;
  • foster broad international coverage in web archive content through outreach and building curated collaborative collections;
  • develop international advocacy for initiatives and legislation that encourage the collection and preservation of Internet content;
  • encourage and facilitate research use of archived Internet content.

You can read more about the IIPC’s work on the About section of its website.

Besides learning about the ongoing work of members within this community worldwide, Leontien and I presented on scoping the preservation of Cambridge University’s web domain — https://www.cam.ac.uk. This blog post provides an overview of this presentation, and our slide deck can be found in the University’s open access repository.

The University’s web domain is going through a period of transformation. Part of this work involves reviewing existing content to determine whether it’s still relevant and up to date or whether it’s out of date and should be taken down. It also involves creating new templates for content published on websites and pages by members of the University community. And while it’s important to have relevant, up to date content on the live website, it’s important to have a record of what members of the University community have published online in an archive.

Visitors to the University’s homepage are already presented with the new web template, which is more device and user friendly for displaying content on desktop and mobile devices, as well as appears more visually contemporary and aligned with current web design principles.

Gif showing the University homepage with embedded video and links to social media accounts. Created on May 2, 2023.

A visitor might also notice links to University-created content that is hosted elsewhere on the web. A video hosted on YouTube is commonly embedded in the middle of the homepage, and embedded links also direct visitors to University-created content published on social media platforms.

The University’s website hasn’t always looked this way. The Internet Archive, from what we can tell, created the earliest capture of this website in 1997. The functionality and design are expected for the time: the flat style of the page would’ve been common for the mid-1990s but would look of date to a present-day user.

Screen capture taken of Internet Archive capture from February 12, 1997.
Screen capture taken of Internet Archive capture from February 12, 1997.

What’s also interesting to note is the University’s rich history in advancing science and technology disciplines, including developments that led to, or underpin, current web technology. For example, the year 1993 saw the world’s first webcam launched: the Trojan Room Coffee Machine webcam created by the University’s Computer Laboratory allowed staff and researchers to check whether a fresh post of coffee was brewing without leaving their desks.

So, although it’s important to capture University content published online to have a record of information relating to teaching, learning, research, administration, amongst other topics, researchers consulting archived websites and pages might also be interested in how staff, researchers, and students have used the web over time to communicate about their work at the University.

Screen capture of the webpage about the Trojan Room Coffee Machine.
Screen capture of the webpage about the Trojan Room Coffee Machine.

The University Archives are a main reason for capturing University websites and pages. Here’s a brief overview of this archive:

The University Archives is responsible for the selection, transfer and preservation of the internal, administrative records of the University of Cambridge, dating from 1266 to the present, and for making them available for administrative and research purposes. The University Archives aims to provide a full and richly varied picture over time of the University’s organisation and governance, its key functions and activities, major developments and achievements.

Historically, these materials have been in physical formats but are increasingly created in digital formats, including common office formats and for the web. The documents that inform whether something should be deposited to the Archives — the Records Retention Schedule and the University Archives Collection Policy — are both format agnostic, leaving room to include other formats within this archive as technology continues to develop and present creators with new ways of creating and publishing content.

Publicly available content published to websites or pages within the cam.ac.uk domain is already either captured by the UKWA as part of its annual domain crawl or manually added to this archive by CUL staff. And while CUL will continue to contribute to the UKWA as part of its responsibilities as a UK Legal Deposit Library and take part in this collaborative network of web archiving experts across the LDLs, CUL could benefit from its own web archiving service that would be in addition to the UKWA.

These reasons are both policy and technology driven. For content published online, the UK’s Legal Deposit Regulations include only publicly available content, so anything behind a login screen is out of scope. This therefore creates a loss of content created by members of the University community that sits behind such a screen. On the technical side, the crawler used by the UKWA isn’t set up to capture online content that sits behind a login screen.

Screen capture of Cambridge Univeristy login screen.
Screen capture of Cambridge Univeristy login screen.

Additionally, the University Archives are entirely separate to the Legal Deposit collection. This archive is guided by its own collection policy and criteria, as well as access conditions. Both the physical and digital files deposited to the University Archives are either held onsite or on CUL-managed infrastructure, further setting apart this archive from the UKWA, which is managed on behalf of all six UK Legal Deposit Libraries by the British Library.

It came down to us testing two tools: Conifer and HTTrack. Conifer is browser-based and outputs the results into a WARC file, making it directly ready for digital preservation. HTTrack is a website copier and creates the output in the original website structure, including the HTML, CSS, and any additional images or other documents. Both did a great job of capturing behind the log-in screens, but they also have their own benefits and drawbacks.

Conifer is great as no installation of software is needed, but it takes a long time to capture as pages that needs to be captured have to be visited one by one. HTTrack needs to be downloaded on to a local device, however the capture is done automatically. But the HTTrack output is not a WARC file, which raises questions around the suitability for preservation.

It has been good to explore some of these tools and the possibility of capturing the Intranet, but there are still a number of other tools that we would like to try in the future, we are especially interested in the Browser Based Crawling System For All project and whether the outputs of this project are suitable for our needs.

In term of next steps, the Digital Preservation team will continue to:

  • Work with CUL archivists to scope web content for University Archives or other areas of collecting (e.g., project websites).
  • Assess web archiving tools and services against requirements.
  • Create a proposal and plan for immediate action.
  • Address content that could disappear as part of Cambridge University website redevelopment.
  • Draft a proposal and plan for business-as-usual web archiving.

If you have any experience and lessons learned for capturing web content behind a login screen or anything else relevant to this post, please get in touch at digitalpreservation@lib.cam.ac.uk!

--

--