Archives Unleashed Project scales up with Archive-It for better collection and analysis of digital history

Tuesday, July 28, 2020

By Joe Petrik, Cheriton School of Computer Science. This story is jointly issued by the faculties of Arts and Mathematics at the University of Waterloo and York University Libraries.

Suppose you’re an archivist, librarian, or historian who’s trying to document and preserve for posterity a narrative of the COVID-19 pandemic or the ongoing Black Lives Matters protests. You’ll naturally be gathering documents from the web, and with tools available today, it won’t be difficult to accumulate thousands or even millions of relevant records. How can you make sure that a scholar down the road can actually use the material that you’ve collected?

computer servers

Right now, working with data at scale is difficult for historians and other scholars in the humanities and social sciences. Since 2017, the Archives Unleashed Project has been at the forefront of making this possible, through accessible tools, platforms, and learning materials. This next project will combine the Archives Unleashed Project’s analytical tools with the Internet Archive’s Archive-It service, a best-in-class web archiving and access  solution and infrastructure, to further lower barriers in web archiving and provide an end-to-end process for collecting and studying archived web records and data.

The Andrew W. Mellon Foundation has awarded a $1,084,087 CAD grant to the University of Waterloo to support the “Integrating Archives Unleashed Cloud with Archive-It” project. Led by Professor Ian Milligan, from the University of Waterloo’s Department of History, alongside co-investigators Jimmy Lin, Professor and Cheriton Chair at Waterloo’s Cheriton School of Computer Science, Nick Ruest, Digital Assets Librarian in the Digital Scholarship Infrastructure department of York University Libraries, and Jefferson Bailey, Director of Web Archiving & Data Services at the Internet Archive, this project represents the next stage of the Archives Unleashed Project. With this funding from The Andrew W. Mellon Foundation, the project hopes to bridge the current gulf between web archiving collection, access, and data-driven analysis.

“Data is rapidly becoming the building blocks of our histories.”

- Prof. Ian Milligan

Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. It’s critical to preserve webpages: we have all encountered “404 Not Found” errors as we browse the web, reminders that information is continually lost, gone missing, or is deleted. Think of how many people have experienced the world during the social distancing of COVID-19: our news, social interactions, learning, working, and beyond. “Data is rapidly becoming the building blocks of our histories,” Milligan explains. As future historians try to piece together our current moment, from exploring misinformation to privacy concerns to social media phenomena, they will need tools and platforms to make sense of all this information. 

Interdisciplinary collaboration essential

Partnering with librarians and archivists such as Bailey and Ruest is essential both to be able to apply cutting-edge approaches to the ethically-informed extraction and arrangement of web archival data, but also for the creation of documentation and learning guides to ensure people can use these materials. Combined with Lin’s information retrieval background, and Milligan’s subject-matter expertise of a historian, the interdisciplinary team is confident that future users will be able to make sense of the web archive data their tools generate.

This project represents a follow-up to an effort that began in 2017 with the same name, also funded by The Andrew W. Mellon Foundation, to develop web archive search and data analysis tools. Armed with these powerful tools, researchers, scholars and archivists now have the ability to access, share and investigate our online history since the early days of the World Wide Web, including many culturally significant events that are interwoven into the basic fabric of our collective consciousness such as 9/11.

The success of Archives Unleashed has resulted in The Andrew W. Mellon Foundation funding a new three-year phase of the project. This new effort will combine the services that Archives Unleashed has developed with those of the Internet Archive’s Archive-It and Archive-It Research Services programs. Archive-It is a web archiving and digital preservation service used by over 700 institutions around the world. Users, from universities and cultural organizations to governments and NGOs, have used the service to preserve tens of billions of web records and many petabytes of data. “Researchers, from both the sciences and the humanities, are finally starting to realize the massive trove of archived web materials that can support a wide variety of computational research,” said Bailey. “We are excited to scale up our collaboration with Archives Unleashed to make the petabytes of web and data archives collected by Archive-It partners and others more useful for scholarly analysis.”

Next logical step

“Our first stage of the Archives Unleashed Project,” explains Lin, “built a stand-alone service that turns web archive data into a format that scholars could easily use. We developed several tools, methods and cloud-based platforms that allow researchers to download a large web archive from which they can analyze all sorts of information, from text and network data to statistical information. The next logical step is to integrate our service with the Internet Archive, which will allow a scholar to run the full cycle of collecting and analyzing web archival content through one portal.”

In the example of future historians struggling with thousands upon thousands of COVID-19 pages, with Archives Unleashed Project tools and platforms, they could take all the documents collected about COVID-19 and use them to explore research questions such as what were the most common words people used to describe the pandemic, or what were the links to information about COVID-19. Were people linking to the Public Health Agency of Canada, to Public Health Ontario, to the World Health Organization, to various news websites, or to personal websites, maybe even conspiracy theory websites?

The funding from The Andrew W. Mellon Foundation will make the indispensable integration between collection and these forms of analysis a reality. Ruest explains, “With this new funding from The Andrew W. Mellon Foundation, the integration of Archives Unleashed and the Internet Archive’s analysis service will further unleash the potential of web archive data. Imagination, not tools, will become the limit of scholarship.”

Photo: Internet Archive servers (by Ian Milligan).