The Many Shapes of Archive-It

Shawn M. Jones; Alexander Nwala; Michele C. Weigle; Michael L. Nelson

arXiv:1806.06878·cs.DL·January 26, 2021

The Many Shapes of Archive-It

Shawn M. Jones, Alexander Nwala, Michele C. Weigle, Michael L. Nelson

PDF

TL;DR

This paper explores structural metadata in Archive-It web collections to understand their curation and crawling behaviors, enabling automatic semantic categorization with high accuracy, thus saving time and resources.

Contribution

It introduces structural features and a classification method to categorize Archive-It collections, bridging structural data with semantic understanding.

Findings

01

Random Forest classifier achieved 0.720 F1 score in predicting collection categories.

02

Structural features effectively reveal curation and crawling behaviors.

03

Method reduces time and bandwidth needed for collection analysis.

Abstract

Web archives, a key area of digital preservation, meet the needs of journalists, social scientists, historians, and government organizations. The use cases for these groups often require that they guide the archiving process themselves, selecting their own original resources, or seeds, and creating their own web archive collections. We focus on the collections within Archive-It, a subscription service started by the Internet Archive in 2005 for the purpose of allowing organizations to create their own collections of archived web pages, or mementos. Understanding these collections could be done via their user-supplied metadata or via text analysis, but the metadata is applied inconsistently between collections and some Archive-It collections consist of hundreds of thousands of seeds, making it costly in terms of time to download each memento. Our work proposes using structural metadata…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.