Sharing Health Information at Scale: Using the datasets of the Archive of Tomorrow Project

Published in

Digital Preservation at Cambridge University Libraries

7 min readAug 30, 2023

The Archive of Tomorrow (AoT) project ran from 2022–2023 and looked at collecting health information online. Cambridge University Library was one of the partners on this project and contributed to the Talking About Health collection, which can now be found on the UK Web Archive. One of our goals during the project was to look at how to make the material available to users who may want to use this material at scale for their research.

This was a difficult task, due to the rules around the UK’s non-print legal deposit legislation: digitally published materials collected under this legislation must be accessed onsite at designated terminals at legal deposit libraries, except when a licence has been acquired. Additionally, text data mining or any type of computational access is not allowed. These restrictions mean that the material is usable for research if websites are viewed individually, but as more and more users are interested in using digital material as data we decided to explore what is possible within the boundaries of the legislation.

This part of the project resulted in a JSON dataset that encapsulates all the metadata that sits with every target (e.g. a website or part of a website selected by a project member) in the Talking about Health collection. This metadata includes information such as when the target was added to the UK Web Archive, what URLs are associated with that target, descriptions and labels. Alongside this metadata we created a datasheet for this dataset following the template designed by Emily Maemura et al. and also a set of Notebooks exploring the data.

My colleague Mark Haydn, Metadata Specialist on the project, and I recently presented this work at the IIPC Conference, including on the possibilities this dataset could bring to researchers, but also pointed out the limitations of it. The dataset is rich, but it will always be metadata, not the actual data that was crawled. However, this work is a great starting point for exploring access to archived web content beyond just viewing the pages.

Furthermore, the project was very fortunate to work with two Methods Fellows funded by Cambridge Digital Humanities on this dataset for their own research. We were able to provide them with the dataset and any support that they needed, this included an in-depth tour of the backend tool for adding targets to the UK Web Archive and support in explaining what the different fields meant in the JSON dataset.

The two fellows that worked with us were Dr Andrea Kocsis and Susan (Chen) Qu. Andrea focused her research on misinformation and Susan focused on health inequalities that may be present in the dataset. Below you can read about their experience and findings.

Andrea

During my fellowship, I proposed to play with the pandemic-related dataset of the Talking about Health collection. In my trial project, I was interested in understanding how Covid-19 impacted the online filter bubble. For this research, I took an element of Eli Parser’s concept (cf. 2011): the users believe that it is all the available information what we can see on the internet. As I have already seen it inverted in the case of technological misinformation (5G), I hypothesised that an opposite bias is working in connection to COVID-19: the users overestimate the amount of hidden information, which leads to a general mistrust in the official information channels. In order to investigate this theory, I planned to compare the official and unofficial narratives in connection to the pandemic to see where the users add or deduct knowledge to and from those offered them online.

As a first step, I had to find an appropriate definition and framework for understanding what official and unofficial information is. In this case, are we differentiating between information and misinformation? Verified or non-verified sources? Edited or non-edited media? None of the above seemed flawless in the context of the online spread of information about the pandemic in the UK. To ground my categorisation, I looked into frameworks the fact checker media use in practice for evaluating the content during their fight against fake news. This is how I ended up avoiding applying binary oppositions and settled with the categories of credible, non-credible, and questionable articles.

Secondly, I decided on a workflow to obtain, clean, and interrogate the data based on Digital Humanities methodologies (Fig. 1). As the UK Web Archive does not allow direct distant reading of its content, I was working from JSON files containing metadata of the collected articles prepared by the AoT web archivists. In this process, firstly, I checked the URLs catalogued as Talking about Health articles to see if they were still available at the time of re-accessing them. Then I scraped the available full texts, the keywords, the summaries, and the links into a data frame. In the third step, I had to find a subset of the articles discussing the pandemic in any form, which I did by combining automated and manual filtering. Then, I manually categorised the articles into credible, non-credible, and questionable groups. Finally, I ran LDA Topic Modelling on the three categories to discover the most discussed subjects within those credibility frames.

Figure 1: The workflow of the pilot project

The preliminary results of the Topic Modelling verified my hypothesis on inverting the filter bubble. The information available online is not everything: the credible sources (e.g. NHS, governmental websites, universities) withheld information because science and policy were still in progress. As a result, the non-credible sources (e.g. anti-vax blogs) cherry-picked their own experts they trusted more than the ones issuing the credible sources. The questionable sources (e.g. individuals in forum discussions, news outlets, and magazines) pointed out that the science is not ready yet. Therefore, they can mostly speculate, repeat credible statements, or look for alternative answers and community support.

As I used data that is still being collected, I consider this workflow a pilot project for a more extensive investigation. I have already started to recollect the data from the finished Talking about Health dataset and also enlarged the pool with the help of the COVID-19 dataset of the UK Web Archive. However, as the resource is incredibly rich, I planned a second part of the project as well. It would be an additional step to analyse, with the help of Machine Learning, if the prompted alteration of the narrative appearing in the non-credible sources is predictable based on different phrasings and wordings of the credible statements.

Nonetheless, my investigation had its limitations. Firstly, I couldn’t research the narrative changes on a timescale, as the need to redownload the text for the distant reading enabled me only to see the current timestamp of the articles, which resulted in a significant loss of information. Secondly, my data collection couldn’t be systematic, as the Talking about Health dataset geographically and genre-wise is not balanced. However, by factoring these flaws in the research, it is possible to counter-balance them, and by posing the fitting research questions, the Talking about Health dataset is a fascinatingly rich resource for understanding how contemporary society communicates about health-related topics in the UK.

Susan

A privilege my research on public health, policy impact and the pandemic uses this unique archive — this echoes my enduring interest in engineering and biology since childhood. Here I outline my research experience in the AoT journey in archive use and reflections on this interdisciplinary research topic.

The UK Web Archive allows my access to the Annotation Curation Tool, a browser-based tool where web archivists can add targets to the UK Web Archive, where I can see interesting collections, even European Parliament Elections and Film in the UK, and export data in 3 formats. But an archive is often unlike Google, even in an electrical format, a keyword-based search can be challenging — especially my topic is niche.

Deeper exploration based on exported health & medicine datasets has been proved helpful. I selected data that included keywords and counted the most common words in them (see Fig. 2). We can see below it indicates mental health and Ireland might be most cared about in this dataset. However, was this the truth and how much or well can this represent the data content?

Figure 2: Most common words in the dataset

I then, for instance, found some reports/blogs collected related to general topics, like “health”, rather than policies in Covid-19. This showed a flaw of the AoT dataset — even topic modelling/sentimental analysis can be done for text analysis — unstructured data from various but limited sources without preset strategies can influence targeted research, qualitatively or quantitatively, not to mention time-consuming data cleaning and ethical challenges.

Nevertheless, I see the potential of e-archive in social, political and health studies. The selected texts, after further text analysis, indicated public health in the pandemic had been considered from several facets different from my hypothesis on physical activities and lockdowns. Certain topics on fatherhood in childcare are also enlightening, expanding debates on public health and enabling a historical policy thread — an advantage of and amusement in archive studies. And due to emerging combinations of AI and Python, I believe we have Archives of, but also more Hopes for Tomorrow.

I thank the team and CDH community, especially Anne Alexander, Caylin Smith, Leontien Talboom and the other fantastic fellow Andrea Kocsis; encouraging to learn from and chat with you. This blog is dedicated to all who (have) fought against SARS and Covid-19 in China, the UK and global.

Sharing Health Information at Scale: Using the datasets of the Archive of Tomorrow Project

Written by Leontien Talboom