Making Large Language Models Speak Tulu: Structured Prompting for an Extremely Low-Resource Language
Prathamesh Devadiga, Paras Chopra

TL;DR
This study explores how large language models can converse in Tulu, a low-resource language, using structured prompts without fine-tuning, achieving high grammatical accuracy and minimal vocabulary contamination.
Contribution
The paper demonstrates that structured prompting techniques enable LLMs to effectively converse in an extremely low-resource language like Tulu without additional training.
Findings
Vocabulary contamination reduced from 80% to 5%.
Achieved 85% grammatical accuracy in Tulu.
Negative constraints improve performance across models.
Abstract
Can large language models converse in languages virtually absent from their training data? We investigate this question through a case study on Tulu, a Dravidian language with over 2 million speakers but minimal digital presence. Rather than fine-tuning an LLM, we examine whether structured prompts alone can elicit basic conversational ability under controlled prompting. We systematically tackle various challenges posed by absence of training data for Tulu by combining explicit grammar documentation, negative constraints to suppress high-probability tokens from related languages, romanization standardization, and quality-controlled synthetic data generation via self-play. Evaluated on a manually curated held-out set across three LLMs (Gemini 2.0 Flash, GPT-4o, Llama 3.1 70B) and validated by native speakers, our approach reduces vocabulary contamination from 80% to 5% while achieving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Language and cultural evolution
