PrOnto: Language Model Evaluations for 859 Languages

Luke Gessler

arXiv:2305.12612·cs.CL·March 29, 2024·1 cites

PrOnto: Language Model Evaluations for 859 Languages

Luke Gessler

PDF

Open Access 1 Repo

TL;DR

PrOnto introduces an automated method to create evaluation datasets for hundreds of languages using New Testament translations, enabling assessment of pretrained language models without manual annotation.

Contribution

The paper presents a novel, annotation-free approach to generate language evaluation datasets from New Testament translations, covering 859 languages.

Findings

01

Effective dataset creation for 859 languages.

02

Demonstrated method accurately assesses language model quality.

03

Publicly released datasets for broad language evaluation.

Abstract

Evaluation datasets are critical resources for measuring the quality of pretrained language models. However, due to the high cost of dataset annotation, these resources are scarce for most languages other than English, making it difficult to assess the quality of language models. In this work, we present a new method for evaluation dataset construction which enables any language with a New Testament translation to receive a suite of evaluation datasets suitable for pretrained language model evaluation. The method critically involves aligning verses with those in the New Testament portion of English OntoNotes, and then projecting annotations from English to the target language, with no manual annotation required. We apply this method to 1051 New Testament translations in 859 and make them publicly available. Additionally, we conduct experiments which demonstrate the efficacy of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lgessler/pronto
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification