TL;DR
IndicDB is a comprehensive multilingual Text-to-SQL benchmark for Indian languages, highlighting challenges in cross-lingual semantic parsing with realistic schemas and data complexity.
Contribution
It introduces a new benchmark with realistic schemas, a novel data generation pipeline, and evaluates state-of-the-art models across Indic languages, revealing significant performance gaps.
Findings
9% performance drop from English to Indic languages
IndicDB includes 20 databases with 237 tables and over 15,000 tasks
The benchmark exposes schema linking and structural ambiguity challenges
Abstract
While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages. The relational schemas are sourced from open-data platforms, including the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP), ensuring realistic administrative data complexity. IndicDB comprises 20 databases across 237 tables. To convert denormalized government data into rich relational structures, we employ an iterative three-agent framework (Architect, Auditor, Refiner) to ensure structural rigor and high relational density (11.85 tables per database; join depths up to six). Our pipeline is value-aware,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
