Retrieval-augmented generation in multilingual settings
Nadezhda Chirkova, David Rau, Herv\'e D\'ejean, Thibault Formal,, St\'ephane Clinchant, Vassilina Nikoulina

TL;DR
This paper explores how retrieval-augmented generation (RAG) can be adapted for multilingual use, addressing challenges like prompt engineering, evaluation metrics, and language-specific issues to establish a strong baseline for future research.
Contribution
It introduces a multilingual RAG pipeline (mRAG), analyzing necessary components and adjustments for effective multilingual retrieval and generation, and highlights key challenges and solutions.
Findings
Prompt engineering is essential for multilingual generation.
Evaluation metrics require adjustments for multilingual settings.
Current models face issues with code-switching and document reading accuracy.
Abstract
Retrieval-augmented generation (RAG) has recently emerged as a promising solution for incorporating up-to-date or domain-specific knowledge into large language models (LLMs) and improving LLM factuality, but is predominantly studied in English-only settings. In this work, we consider RAG in the multilingual setting (mRAG), i.e. with user queries and the datastore in 13 languages, and investigate which components and with which adjustments are needed to build a well-performing mRAG pipeline, that can be used as a strong baseline in future works. Our findings highlight that despite the availability of high-quality off-the-shelf multilingual retrievers and generators, task-specific prompt engineering is needed to enable generation in user languages. Moreover, current evaluation metrics need adjustments for multilingual setting, to account for variations in spelling named entities. The main…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Weight Decay · Multi-Head Attention · Residual Connection · WordPiece · Softmax · Byte Pair Encoding · Layer Normalization
