NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions
Shizheng Hou, Wenqi Pei, Nuo Chen, Quang-Trung Ta, Peng Lu, Beng Chin Ooi

TL;DR
NL2SQLBench is a modular benchmarking framework that systematically evaluates LLM-enabled NL2SQL systems across core modules, revealing significant gaps and guiding future improvements.
Contribution
It introduces a comprehensive, modular evaluation framework for NL2SQL systems, including novel metrics and multi-agent benchmarking across diverse approaches and datasets.
Findings
Existing NL2SQL methods show substantial accuracy gaps.
Current approaches are computationally inefficient.
Benchmark datasets and evaluation rules have critical shortcomings.
Abstract
Natural Language to SQL (NL2SQL) technology empowers non-expert users to query relational databases without requiring SQL expertise. While large language models (LLMs) have greatly improved NL2SQL algorithms, their rapid development outpaces systematic evaluation, leaving a critical gap in understanding their effectiveness, efficiency, and limitations. To this end, we present NL2SQLBench, the first modular evaluation and benchmarking framework for LLM-enabled NL2SQL approaches. Specifically, we dissect NL2SQL systems into three core modules: Schema Selection, Candidate Generation, and Query Revision. For each module, we comprehensively review existing strategies and propose novel fine-grained metrics that systematically quantify module-level effectiveness and efficiency. We further implement these metrics in a flexible multi-agent framework, allowing configurable benchmarking across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
