GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs
Ying-Hsiang Huang, Claire Gong, Shreya Shaji, Alison Yan, Leslie Harka, Albert Du, Anjali Gopal, Samuel J Klein, Shannon Zejiang Shen, Mark Phillips, Trevor Owens, Kyle Deeds, Benjamin Charles Germain Lee

TL;DR
GovScape is a multimodal search system enabling advanced filtering, semantic, and visual searches across over 10 million government PDFs, enhancing discoverability of federal documents.
Contribution
It introduces a scalable, multimodal search platform for large-scale government PDFs, combining metadata, text, and visual search capabilities with open source code.
Findings
Supports four primary search modes including semantic and visual search.
Pre-processing 10 million PDFs costs approximately $1,500, demonstrating scalability.
System architecture and open source code are detailed for community use.
Abstract
Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains, among other file types, millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. For example, current affordances for browsing the End of Term PDFs are limited to downloading and browsing individual PDFs, as well as performing basic keyword search across them. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) - to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search over these 10…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
