IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems

Pasunuti Prasanjith; Prathmesh B More; Anoop Kunchukuttan; Raj Dabre

arXiv:2506.01615·cs.CL·June 4, 2025

IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems

Pasunuti Prasanjith, Prathmesh B More, Anoop Kunchukuttan, Raj Dabre

PDF

Open Access 2 Datasets

TL;DR

IndicRAGSuite provides essential large-scale datasets and a multilingual benchmark for developing and evaluating Retrieval-Augmented Generation systems tailored to Indian languages, addressing critical resource gaps.

Contribution

It introduces IndicMSMarco, a multilingual benchmark for 13 Indian languages, and large-scale datasets derived from Indian language Wikipedias and translated MS MARCO data.

Findings

01

Created IndicMSMarco benchmark with 1000 queries in 13 languages.

02

Built large-scale datasets from 19 Indian language Wikipedias.

03

Enriched training data with translated MS MARCO datasets.

Abstract

Retrieval-Augmented Generation (RAG) systems enable language models to access relevant information and generate accurate, well-grounded, and contextually informed responses. However, for Indian languages, the development of high-quality RAG systems is hindered by the lack of two critical resources: (1) evaluation benchmarks for retrieval and generation tasks, and (2) large-scale training datasets for multilingual retrieval. Most existing benchmarks and datasets are centered around English or high-resource languages, making it difficult to extend RAG capabilities to the diverse linguistic landscape of India. To address the lack of evaluation benchmarks, we create IndicMSMarco, a multilingual benchmark for evaluating retrieval quality and response generation in 13 Indian languages, created via manual translation of 1000 diverse queries from MS MARCO-dev set. To address the need for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling