TL;DR
This paper introduces a new task of detecting Latin fragments in historical documents using large language models, providing a benchmark dataset and evaluating model performance in a challenging, multimodal setting.
Contribution
It presents a novel multimodal benchmark dataset and evaluates large foundation models for Latin detection in noisy, mixed-language historical texts, establishing a baseline for future research.
Findings
Zero-shot models can reliably detect Latin fragments.
Current models lack deep understanding of Latin language.
Benchmark dataset and code are publicly available.
Abstract
This paper presents a novel task of extracting low-resourced and noisy Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary zero-shot models is achievable, yet these models lack a functional comprehension of Latin. This study establishes a comprehensive baseline for processing Latin within mixed-language corpora, supporting quantitative analysis in intellectual history and historical linguistics. Both the dataset and code are available at https://github.com/COMHIS/EACL26-detect-latin.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
