Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages
Robert Litschko, Oliver Kraus, Verena Blaschke, Barbara Plank

TL;DR
This paper addresses the challenges of cross-dialect information retrieval, introduces a new German dialect dataset, and evaluates methods including lexical, transfer, and translation-based approaches to improve retrieval accuracy in low-resource dialects.
Contribution
The paper presents the first German dialect retrieval dataset, WikiDIR, and analyzes the limitations of existing lexical and transfer methods, proposing translation as an effective solution.
Findings
Lexical methods struggle with high dialect lexical variation.
Zero-shot transfer with multilingual encoders performs poorly in low-resource dialects.
Translation significantly reduces the dialect gap in retrieval tasks.
Abstract
A large amount of local and culture-specific knowledge (e.g., people, traditions, food) can only be found in documents written in dialects. While there has been extensive research conducted on cross-lingual information retrieval (CLIR), the field of cross-dialect retrieval (CDIR) has received limited attention. Dialect retrieval poses unique challenges due to the limited availability of resources to train retrieval models and the high variability in non-standardized languages. We study these challenges on the example of German dialects and introduce the first German dialect retrieval dataset, dubbed WikiDIR, which consists of seven German dialects extracted from Wikipedia. Using WikiDIR, we demonstrate the weakness of lexical methods in dealing with high lexical variation in dialects. We further show that commonly used zero-shot cross-lingual transfer approach with multilingual encoders…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
