Quid est VERITAS? A Modular Framework for Archival Document Analysis

Leonardo Bassanini; Ludovico Biancardi; Alfio Ferrara; Andrea Gamberini; Sergio Picascia; Folco Vaglienti

arXiv:2603.28108·cs.DL·March 31, 2026

Quid est VERITAS? A Modular Framework for Archival Document Analysis

Leonardo Bassanini, Ludovico Biancardi, Alfio Ferrara, Andrea Gamberini, Sergio Picascia, Folco Vaglienti

PDF

TL;DR

VERITAS is a modular framework that enhances digitisation of historical documents by integrating transcription, layout analysis, and semantic enrichment, improving accuracy and efficiency.

Contribution

It introduces a schema-driven, integrated workflow for archival document analysis, combining multiple stages into a flexible, model-agnostic pipeline.

Findings

01

Achieved 67.6% reduction in word error rate over commercial OCR.

02

Reduced end-to-end processing time threefold with manual correction.

03

Demonstrated utility in historical inquiry through retrieval-augmented generation.

Abstract

The digitisation of historical documents has traditionally been conceived as a process limited to character-level transcription, producing flat text that lacks the structural and semantic information necessary for substantive computational analysis. We present VERITAS (Vision-Enhanced Reading, Interpretation, and Transcription of Archival Sources), a modular, model-agnostic framework that reconceptualises digitisation as an integrated workflow encompassing transcription, layout analysis, and semantic enrichment. The pipeline is organised into four stages - Preprocessing, Extraction, Refinement, and Enrichment - and employs a schema-driven architecture that allows researchers to declaratively specify their extraction objectives. We evaluate VERITAS on the critical edition of Bernardino Corio's Storia di Milano, a Renaissance chronicle of over 1,600 pages. Results demonstrate that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.