Archiving Public Health Discourse on the Web

Published in

Digital Preservation at Cambridge University Libraries

6 min readMay 23, 2022

I am Leontien Talboom, the recently appointed web archivist at Cambridge University Library (CUL). As CUL is one of the six UK Legal Deposit Libraries, they contribute to the UK Web Archive (UKWA), which is trying to capture as much of the UK activity on the web as possible. My role entails ensuring that the web presence of the University is captured in the UKWA, but the biggest part of my job is dedicated to the Archive of Tomorrow (AoT) Project. Led by the National Library of Scotland, with Cambridge University Library, the Bodleian Libraries and Edinburgh University Library as co-partners, this project is collecting public health discourse information and we have recently started collecting material for this collection.

The material within this collection will try and cover as many different perspectives as possible on different topics around health. Just to name a few examples, this content could be official government information, a blog post by someone online, or a support group on a social media platform. Topics being collected are also very broad, ranging from reproductive health to cancer. But it should be noted that as the collection framework starts in 2019, a big part of this collection will be focused on information around Covid-19 and the pandemic.

This blog post will explore how online information is collected and labelled around a certain topic. The topic that will be used as an example throughout the post will be a deep dive into treatments for Covid-19. I will first outline how information is found in the online sphere, then I will talk a little bit about how we are tagging the collected information.

Finding information online

As part of this project, we are trying to ensure that we are covering as much primary source material as possible to ensure we capture as many perspectives as possible around a certain topic. This does bring into question how to start looking for information around a topic. With the topic around treatments for Covid-19 there are several very useful Wikipedia pages which outline several of the main treatments which can be explored. This is a great starting point and gives a good overview of potential search terms that can be used to find information online. However, it should be kept in mind that these Wikipedia pages could potentially have been edited by anyone.

Other starting points could be official pages, such as those created by the UK’s National Health Service (NHS), as these also give a concise overview of treatments. However, this is an official source and therefore only covers approved treatments for Covid-19 and not necessarily home remedies or potentially dangerous treatments that we would also like to capture in this collection.

After having a look at pages like this and getting an idea of potential search terms, different search engines can be used to navigate and find targets (the term used for websites to archive) online. In order to ensure maximum coverage, the choice has been made to use several different search engines, as they seem to give different results. An example of this is a treatment promoted for Covid-19 called ‘Miracle Mineral Solution’, it is also referred to as ‘Mineral Miracle Solution’ or ‘Sodium Chlorite Solutions’ but is most often abbreviated to ‘MMS’.

When looking for this term on two different search engines different results can be seen (see Figure 1 & 2). DuckDuckGo gives several pages with information on what MMS is and where to buy it, whereas Google’s top search results provide a very different view and showcases a number of official sources on why this treatment is a health risk. MMS is essentially diluted bleach and therefore a health risk. The approach that Google has taken is part of their wider mission in tackling misinformation online. This is great for the average user of Google, but not as great for us web archivists who are trying to cover as many different information outputs, not only the official sources warning about the potential risks.

Search engine results from DuckDuckGo for the search term ‘Mineral Miracle Solution’ — ***Figure 1 — Searching for ‘Mineral Miracle Solution’ in an incognito browser on DuckDuckGo***

Search engine results from Google for the search term ‘Mineral Miracle Solution’ — ***Figure 2 — Searching for ‘Mineral Miracle Solution’ in an incognito browser on DuckDuckGo***

This is part of a wider issue around this information, where it is easier to capture and preserve the official information found around health information. This is because the official information comes from sources such as the UK government or the NHS, meaning that there is an infrastructure in place to maintain this information online. However, the smaller grassroots organisations, campaigns and social media pages may not have the same resources, which makes their published web content a higher risk as they may not be online as long as the more official sources and therefore, therefore we may miss out on capturing them.

Also, another issue here is that social media platforms are tackling misinformation online. As is showcased in the search results of Google in Figure 2, they may provide extra information or index their search results in a different way. Other social media platforms take a more aggressive approach, such as Facebook, which tries to get rid of this information online, to stop the spread of misinformation.

This is great, but something that the AoT projects needs to keep in mind again, as this information will not end up in our collection, and therefore we may miss out on a source of information. We could also potentially archive an echo of this information. An example of this is this BBC article which outlines fake health advice around Covid-19. This article includes a number of screenshots and links to misinformation, which means that we can capture the article and showcase what this misinformation looked like, but this is not the primary source.

Labelling collections with public health discourse

After capturing web material, the web archivists working on the project label these targets in our collection. This is done to make it easier for users in the future to navigate our collection. This is especially of importance as the project is aiming to archive 10,000 URLs. It will make it possible for users to explore and navigate through a number of themes within the collection.

We have stepped away from using labels as ‘information’ and ‘misinformation’. This choice was made quite early on in the project as the divide between information and misinformation does not seem to be as black and white as first thought. This is highlighted well by the Poynter database set up by a number of fact-checking organisations. Their different ratings showcase how many different labels can be attributed to misinformation, ranging from ‘missing context’, to ‘misleading’, to ‘no evidence’.

Also, the available information evolves over time. This is especially true for the more unsure areas of health and a great example of this is the treatments for Covid-19. Above MMS was discussed, which is quite easily seen as something dangerous, as it is essentially diluted bleach. However, there are other treatments, such as the drug Ivermectin, where this divide is not as clear. Ivermectin is a drug used for other treatments and diseases and in 2021 it was part of a clinical trial at Oxford and seen as a potential treatment for Covid-19. The drug was seen as a possible cure at the time but has now been proved to be ineffective. This highlights how difficult it could be to label this and how changing trials and perspectives over time are part of this collection.

Instead of labelling information whether it may be true or false, we are labelling it into certain topics, such as treatments or Covid-19 for the above examples. We also hope to be more transparent in why we have added certain targets to our collection or why we have labelled a target in a certain way. We are still finding a way to present this to the user, but for now we are all making sure to document this in a shared spreadsheet. This is also applicable to making misinformation available as part of this collection — as we are not labelling it explicitly, there needs to be some type of warning or disclaimer for the user when they use and access our collection.

This blog outlines some of the current challenges that we are facing when collecting and labelling material for this project. The collaborative nature of this project is great in helping with these challenges, as there is enough room for discussion with multiple organisations, but also with partner organisations such as the British Library. Also, as the project is trying to involve stakeholders and potential users of the collections there are a number of workshops being run over the course of the project, to get feedback and ideas for these challenges.

Archiving Public Health Discourse on the Web

Finding information online

Labelling collections with public health discourse

Written by Leontien Talboom