Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0

Francesco De Toni; Christopher Akiki; Javier de la Rosa; Cl\'ementine; Fourrier; Enrique Manjavacas; Stefan Schweter; Daniel van Strien

arXiv:2204.05211·cs.CL·April 12, 2022

Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0

Francesco De Toni, Christopher Akiki, Javier de la Rosa, Cl\'ementine, Fourrier, Enrique Manjavacas, Stefan Schweter, Daniel van Strien

PDF

Open Access 1 Repo

TL;DR

This paper investigates the zero-shot capabilities of T0 models for multilingual named entity recognition, date, and language prediction on historical texts, highlighting potential and limitations for under-resourced languages.

Contribution

It demonstrates the application of T0 models to historical texts for NER, date, and language prediction, revealing both challenges and opportunities in this domain.

Findings

01

Naive prompt-based NER is error-prone on historical texts.

02

T0 models can predict publication date and language of historical documents.

03

Potential for zero-shot methods in under-resourced historical language processing.

Abstract

In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-based zero-shot multilingual Named Entity Recognition is error-prone, but highlights the potential of such an approach for historical languages lacking labeled datasets. Moreover, we also find that T0-like models can be probed to predict the publication date and language of a document, which could be very relevant for the study of historical texts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bigscience-workshop/historical_texts
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies