Investigating Language Preference of Multilingual RAG Systems
Jeonghyun Park, Hwanhee Lee

TL;DR
This paper investigates language preferences in multilingual RAG systems, revealing biases in retrieval and generation, and proposes DKM-RAG to mitigate these biases and improve multilingual performance.
Contribution
It systematically analyzes language biases in mRAG and introduces DKM-RAG, a novel framework that fuses translated passages to reduce language preference issues.
Findings
Retrievers prefer high-resource and query languages.
Generators favor query language or Latin scripts.
DKM-RAG improves multilingual retrieval and generation performance.
Abstract
Multilingual Retrieval-Augmented Generation (mRAG) systems enhance language models by integrating external multilingual information to produce context-aware responses. However, mRAG systems struggle with retrieving relevant information due to linguistic variations between queries and documents, generating inconsistent responses when multilingual sources conflict. In this work, we systematically investigate language preferences in both retrieval and generation of mRAG through a series of experiments. Our analysis indicates that retrievers tend to prefer high-resource and query languages, yet this preference does not consistently improve generation performance. Moreover, we observe that generators prefer the query language or Latin scripts, leading to inconsistent outputs. To overcome these issues, we propose Dual Knowledge Multilingual RAG (DKM-RAG), a simple yet effective framework that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and dialogue systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Adam · Softmax · Dropout · Weight Decay · BART · WordPiece · Layer Normalization
