Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0
Francesco De Toni, Christopher Akiki, Javier de la Rosa, Cl\'ementine, Fourrier, Enrique Manjavacas, Stefan Schweter, Daniel van Strien

TL;DR
This paper investigates the zero-shot capabilities of T0 models for multilingual named entity recognition, date, and language prediction on historical texts, highlighting potential and limitations for under-resourced languages.
Contribution
It demonstrates the application of T0 models to historical texts for NER, date, and language prediction, revealing both challenges and opportunities in this domain.
Findings
Naive prompt-based NER is error-prone on historical texts.
T0 models can predict publication date and language of historical documents.
Potential for zero-shot methods in under-resourced historical language processing.
Abstract
In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-based zero-shot multilingual Named Entity Recognition is error-prone, but highlights the potential of such an approach for historical languages lacking labeled datasets. Moreover, we also find that T0-like models can be probed to predict the publication date and language of a document, which could be very relevant for the study of historical texts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
