From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive   Grammars

Albert Kornilov; Tatiana Shavrina

arXiv:2411.15577·cs.CL·December 30, 2024

From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars

Albert Kornilov, Tatiana Shavrina

PDF

Open Access

TL;DR

This paper introduces benchmarks and a retrieval-augmented approach to evaluate and improve language models' ability to interpret complex linguistic descriptions for low-resource languages, enabling better NLP tasks like translation.

Contribution

It presents the first comprehensive benchmarks and a novel retrieval-augmented method for extracting linguistic features from formal grammar descriptions.

Findings

01

Benchmark covers 248 languages and 142 families.

02

RAG-based approach improves feature extraction accuracy.

03

Provides publicly available code and data for further research.

Abstract

Recent advances in language modeling have demonstrated significant improvements in zero-shot capabilities, including in-context learning, instruction following, and machine translation for extremely under-resourced languages (Tanzer et al., 2024). However, many languages with limited written resources rely primarily on formal descriptions of grammar and vocabulary. In this paper, we introduce a set of benchmarks to evaluate how well models can extract and classify information from the complex descriptions found in linguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based approach that leverages these descriptions for downstream tasks such as machine translation. Our benchmarks encompass linguistic descriptions for 248 languages across 142 language families, focusing on typological features from WALS and Grambank. This set of benchmarks offers the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training