Beyond Text-to-SQL: Can LLMs Really Debug Enterprise ETL SQL?
Jing Ye, Yiwen Duan, Yonghong Yu, Victor Ma, Yang Gao, Xing Chen

TL;DR
This paper introduces OurBench, a comprehensive benchmark for evaluating large language models' ability to debug complex enterprise SQL code, highlighting significant performance gaps and challenges.
Contribution
It presents a novel automated workflow for creating realistic SQL debugging benchmarks and an evaluation framework tailored for enterprise SQL reasoning.
Findings
Best model achieves only ~36% accuracy on syntax errors.
Most models score below 20% on semantic errors.
The benchmark includes highly complex, large-scale SQL queries.
Abstract
SQL is central to enterprise data engineering, yet generating fully correct SQL code in a single attempt remains difficult, even for experienced developers and advanced text-to-SQL LLMs, often requiring multiple debugging iterations. We introduce OurBench, the first benchmark for enterprise-level SQL reasoning and debugging. Our benchmark is built on two key innovations: (1) an automated construction workflow that uses reverse engineering to systematically inject realistic bugs into large-scale SQL code, enabling scalable and diverse benchmark generation; and (2) an execution-free evaluation framework tailored to enterprise settings, providing fast, accurate, and resource-efficient assessment. OurBench comprises 469 OurBenchSyn queries featuring syntax errors with explicit error messages, and 516 OurBenchSem queries targeting semantic errors in which the code fails to meet user…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Web Application Security Vulnerabilities · Logic, programming, and type systems
