Two Spelling Normalization Approaches Based on Large Language Models

Miguel Domingo; Francisco Casacuberta

arXiv:2506.23288·cs.CL·July 1, 2025

Two Spelling Normalization Approaches Based on Large Language Models

Miguel Domingo, Francisco Casacuberta

PDF

Open Access

TL;DR

This paper introduces two large language model-based methods for spelling normalization in historical documents, comparing their effectiveness across diverse datasets and concluding that statistical machine translation remains the most effective approach.

Contribution

The study presents two novel large language model approaches for spelling normalization, including an unsupervised method and a machine translation-based method, evaluated across multiple languages and periods.

Findings

01

Both approaches showed promising results.

02

Statistical machine translation outperformed the other method.

03

Effective normalization across diverse historical datasets.

Abstract

The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue, spelling normalization endeavors to align a document's orthography with contemporary standards. In this study, we propose two new approaches based on large language models: one of which has been trained without a supervised training, and a second one which has been trained for machine translation. Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Topic Modeling