Resource-Lean Lexicon Induction for German Dialects
Robert Litschko, Barbara Plank, Diego Frassinelli

TL;DR
This paper demonstrates that simple statistical models trained on string similarity features can effectively induce German dialect lexicons, outperforming large language models and enabling resource-efficient cross-dialect transfer.
Contribution
It introduces a resource-lean, effective approach using random forests for dialect lexicon induction, surpassing LLMs and supporting cross-dialect transfer in low-resource settings.
Findings
Random forests outperform LLMs like Mistral-123b in bilingual lexicon induction.
Using dialect dictionaries for query expansion improves IR metrics significantly.
Models transfer effectively across different German dialects with limited training data.
Abstract
Automatic induction of high-quality dictionaries is essential for building lexical resources, yet low-resource languages and dialects pose several challenges: limited access to annotators, high degree of spelling variations, and poor performance of large language models (LLMs). We empirically show that statistical models (random forests) trained on string similarity features are surprisingly effective for inducing German dialect lexicons. They outperform LLMs, enable cross-dialect transfer, and offer a lightweight data-driven alternative. We evaluate our models intrinsically on bilingual lexicon induction (BLI) and extrinsically on dialect information retrieval (IR). On BLI, random forests outperform Mistral-123b while being more resource-lean. On dialect IR with BM25, using our dialect dictionaries for query expansion yields relative improvements of up to 28.9% in nDCG@10 and 50.7% in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
