Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions
Kexun Zhang, Yee Man Choi, Zhenqiao Song, Taiqi He, William Yang Wang,, Lei Li

TL;DR
This paper introduces LINGOLLM, a prompt-based method that leverages linguistic resources like dictionaries and grammar books to enable large language models to process and translate endangered languages without additional training.
Contribution
The paper presents a training-free approach that incorporates linguistic knowledge into LLM prompts, significantly improving translation performance for unseen endangered languages.
Findings
LINGOLLM improves GPT-4's BLEU score from 0 to 10.5 on endangered languages.
Using linguistic resources in prompts enhances LLM translation capabilities.
The approach demonstrates the value of linguistic knowledge for low-resource language processing.
Abstract
How can large language models (LLMs) process and translate endangered languages? Many languages lack a large corpus to train a decent LLM; therefore existing LLMs rarely perform well in unseen, endangered languages. On the contrary, we observe that 2000 endangered languages, though without a large corpus, have a grammar book or a dictionary. We propose LINGOLLM, a training-free approach to enable an LLM to process unseen languages that hardly occur in its pre-training. Our key insight is to demonstrate linguistic knowledge of an unseen language in an LLM's prompt, including a dictionary, a grammar book, and morphologically analyzed input text. We implement LINGOLLM on top of two models, GPT-4 and Mixtral, and evaluate their performance on 5 tasks across 8 endangered or low-resource languages. Our results show that LINGOLLM elevates translation capability from GPT-4's 0 to 10.5 BLEU for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Interpreting and Communication in Healthcare
MethodsLinear Layer · Layer Normalization · Byte Pair Encoding · Dropout · Multi-Head Attention · Attention Is All You Need · Softmax · Dense Connections · Label Smoothing · Adam
