Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval
Adri\`a Molina, Oriol Ramos Terrades, Josep Llad\'os

TL;DR
Fetch-A-Set is a large-scale, OCR-free benchmark designed to improve retrieval systems for complex, historical documents spanning from the 17th century, addressing challenges of legibility and extractive tasks in cultural heritage analysis.
Contribution
The paper introduces Fetch-A-Set, a comprehensive benchmark dataset for historical document retrieval, filling a gap in large-scale, OCR-free evaluation resources for cultural heritage research.
Findings
Provides a vast repository of 17th-century documents
Includes baseline results for retrieval tasks
Addresses varying levels of document legibility
Abstract
This paper introduces Fetch-A-Set (FAS), a comprehensive benchmark tailored for legislative historical document analysis systems, addressing the challenges of large-scale document retrieval in historical contexts. The benchmark comprises a vast repository of documents dating back to the XVII century, serving both as a training resource and an evaluation benchmark for retrieval systems. It fills a critical gap in the literature by focusing on complex extractive tasks within the domain of cultural heritage. The proposed benchmark tackles the multifaceted problem of historical document analysis, including text-to-image retrieval for queries and image-to-text topic extraction from document fragments, all while accommodating varying levels of document legibility. This benchmark aims to spur advancements in the field by providing baselines and data for the development and evaluation of robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Natural Language Processing Techniques
