Token and Span Classification for Entity Recognition in French Historical Encyclopedias

Ludovic Moncla; H\'edi Zeghidi

arXiv:2506.02872·cs.CL·June 4, 2025

Token and Span Classification for Entity Recognition in French Historical Encyclopedias

Ludovic Moncla, H\'edi Zeghidi

PDF

Open Access

TL;DR

This paper evaluates various NER methods on 18th-century French encyclopedias, highlighting the effectiveness of transformers and exploring few-shot prompting for low-resource historical texts.

Contribution

It introduces a dual token and span classification framework for nested entities and assesses generative models for low-resource historical NER tasks.

Findings

01

Transformer models outperform classical methods on nested entities.

02

Few-shot prompting shows promise in low-resource scenarios.

03

Hybrid approaches are suggested for complex historical texts.

Abstract

Named Entity Recognition (NER) in historical texts presents unique challenges due to non-standardized language, archaic orthography, and nested or overlapping entities. This study benchmarks a diverse set of NER approaches, ranging from classical Conditional Random Fields (CRFs) and spaCy-based models to transformer-based architectures such as CamemBERT and sequence-labeling models like Flair. Experiments are conducted on the GeoEDdA dataset, a richly annotated corpus derived from 18th-century French encyclopedias. We propose framing NER as both token-level and span-level classification to accommodate complex nested entity structures typical of historical documents. Additionally, we evaluate the emerging potential of few-shot prompting with generative language models for low-resource scenarios. Our results demonstrate that while transformer-based models achieve state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Web Data Mining and Analysis · Advanced Computational Techniques and Applications