Token and Span Classification for Entity Recognition in French Historical Encyclopedias
Ludovic Moncla, H\'edi Zeghidi

TL;DR
This paper evaluates various NER methods on 18th-century French encyclopedias, highlighting the effectiveness of transformers and exploring few-shot prompting for low-resource historical texts.
Contribution
It introduces a dual token and span classification framework for nested entities and assesses generative models for low-resource historical NER tasks.
Findings
Transformer models outperform classical methods on nested entities.
Few-shot prompting shows promise in low-resource scenarios.
Hybrid approaches are suggested for complex historical texts.
Abstract
Named Entity Recognition (NER) in historical texts presents unique challenges due to non-standardized language, archaic orthography, and nested or overlapping entities. This study benchmarks a diverse set of NER approaches, ranging from classical Conditional Random Fields (CRFs) and spaCy-based models to transformer-based architectures such as CamemBERT and sequence-labeling models like Flair. Experiments are conducted on the GeoEDdA dataset, a richly annotated corpus derived from 18th-century French encyclopedias. We propose framing NER as both token-level and span-level classification to accommodate complex nested entity structures typical of historical documents. Additionally, we evaluate the emerging potential of few-shot prompting with generative language models for low-resource scenarios. Our results demonstrate that while transformer-based models achieve state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Web Data Mining and Analysis · Advanced Computational Techniques and Applications
