The Many Shapes of Archive-It
Shawn M. Jones, Alexander Nwala, Michele C. Weigle, Michael L. Nelson

TL;DR
This paper explores structural metadata in Archive-It web collections to understand their curation and crawling behaviors, enabling automatic semantic categorization with high accuracy, thus saving time and resources.
Contribution
It introduces structural features and a classification method to categorize Archive-It collections, bridging structural data with semantic understanding.
Findings
Random Forest classifier achieved 0.720 F1 score in predicting collection categories.
Structural features effectively reveal curation and crawling behaviors.
Method reduces time and bandwidth needed for collection analysis.
Abstract
Web archives, a key area of digital preservation, meet the needs of journalists, social scientists, historians, and government organizations. The use cases for these groups often require that they guide the archiving process themselves, selecting their own original resources, or seeds, and creating their own web archive collections. We focus on the collections within Archive-It, a subscription service started by the Internet Archive in 2005 for the purpose of allowing organizations to create their own collections of archived web pages, or mementos. Understanding these collections could be done via their user-supplied metadata or via text analysis, but the metadata is applied inconsistently between collections and some Archive-It collections consist of hundreds of thousands of seeds, making it costly in terms of time to download each memento. Our work proposes using structural metadata…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
