Quid est VERITAS? A Modular Framework for Archival Document Analysis
Leonardo Bassanini, Ludovico Biancardi, Alfio Ferrara, Andrea Gamberini, Sergio Picascia, Folco Vaglienti

TL;DR
VERITAS is a modular framework that enhances digitisation of historical documents by integrating transcription, layout analysis, and semantic enrichment, improving accuracy and efficiency.
Contribution
It introduces a schema-driven, integrated workflow for archival document analysis, combining multiple stages into a flexible, model-agnostic pipeline.
Findings
Achieved 67.6% reduction in word error rate over commercial OCR.
Reduced end-to-end processing time threefold with manual correction.
Demonstrated utility in historical inquiry through retrieval-augmented generation.
Abstract
The digitisation of historical documents has traditionally been conceived as a process limited to character-level transcription, producing flat text that lacks the structural and semantic information necessary for substantive computational analysis. We present VERITAS (Vision-Enhanced Reading, Interpretation, and Transcription of Archival Sources), a modular, model-agnostic framework that reconceptualises digitisation as an integrated workflow encompassing transcription, layout analysis, and semantic enrichment. The pipeline is organised into four stages - Preprocessing, Extraction, Refinement, and Enrichment - and employs a schema-driven architecture that allows researchers to declaratively specify their extraction objectives. We evaluate VERITAS on the critical edition of Bernardino Corio's Storia di Milano, a Renaissance chronicle of over 1,600 pages. Results demonstrate that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
