Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin
Thibault Cl\'erice, Rachel Bawden, Anthony Glaise, Ariane Pinche, David Smith

TL;DR
This paper introduces Pre-Editorial Normalization (PEN), a method to convert ATR outputs of medieval manuscripts into normalized, editorially consistent text, bridging the gap between palaeographic fidelity and practical usability.
Contribution
It defines the PEN task, creates a large training corpus and evaluation set, and develops a normalization model that significantly improves over previous approaches.
Findings
Achieved a 6.7% CER with the normalization model.
Created a new dataset aligned with Old French and Latin editions.
Demonstrated the effectiveness of ByT5-based models for normalization.
Abstract
Recent advances in Automatic Text Recognition (ATR) have improved access to historical archives, yet a methodological divide persists between palaeographic transcriptions and normalized digital editions. While ATR models trained on more palaeographically-oriented datasets such as CATMuS have shown greater generalizability, their raw outputs remain poorly compatible with most readers and downstream NLP tools, thus creating a usability gap. On the other hand, ATR models trained to produce normalized outputs have been shown to struggle to adapt to new domains and tend to over-normalize and hallucinate. We introduce the task of Pre-Editorial Normalization (PEN), which consists in normalizing graphemic ATR output according to editorial conventions, which has the advantage of keeping an intermediate step with palaeographic fidelity while providing a normalized version for practical usability.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Digital Humanities and Scholarship · Handwritten Text Recognition Techniques
