From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars
Albert Kornilov, Tatiana Shavrina

TL;DR
This paper introduces benchmarks and a retrieval-augmented approach to evaluate and improve language models' ability to interpret complex linguistic descriptions for low-resource languages, enabling better NLP tasks like translation.
Contribution
It presents the first comprehensive benchmarks and a novel retrieval-augmented method for extracting linguistic features from formal grammar descriptions.
Findings
Benchmark covers 248 languages and 142 families.
RAG-based approach improves feature extraction accuracy.
Provides publicly available code and data for further research.
Abstract
Recent advances in language modeling have demonstrated significant improvements in zero-shot capabilities, including in-context learning, instruction following, and machine translation for extremely under-resourced languages (Tanzer et al., 2024). However, many languages with limited written resources rely primarily on formal descriptions of grammar and vocabulary. In this paper, we introduce a set of benchmarks to evaluate how well models can extract and classify information from the complex descriptions found in linguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based approach that leverages these descriptions for downstream tasks such as machine translation. Our benchmarks encompass linguistic descriptions for 248 languages across 142 language families, focusing on typological features from WALS and Grambank. This set of benchmarks offers the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
