IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

Aviral Dawar; Roshan Karanth; Vikram Goyal; Dhruv Kumar

arXiv:2604.13686·cs.CL·April 16, 2026

IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

Aviral Dawar, Roshan Karanth, Vikram Goyal, Dhruv Kumar

PDF

1 Repo

TL;DR

IndicDB is a comprehensive multilingual Text-to-SQL benchmark for Indian languages, highlighting challenges in cross-lingual semantic parsing with realistic schemas and data complexity.

Contribution

It introduces a new benchmark with realistic schemas, a novel data generation pipeline, and evaluates state-of-the-art models across Indic languages, revealing significant performance gaps.

Findings

01

9% performance drop from English to Indic languages

02

IndicDB includes 20 databases with 237 tables and over 15,000 tasks

03

The benchmark exposes schema linking and structural ambiguity challenges

Abstract

While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages. The relational schemas are sourced from open-data platforms, including the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP), ensuring realistic administrative data complexity. IndicDB comprises 20 databases across 237 tables. To convert denormalized government data into rich relational structures, we employ an iterative three-agent framework (Architect, Auditor, Refiner) to ensure structural rigor and high relational density (11.85 tables per database; join depths up to six). Our pipeline is value-aware,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://anonymous.4open.science/r/multilingualText2Sql-Indic--DDCC
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.