Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards
Tengjun Jin, Yoojin Choi, Yuxuan Zhu, Daniel Kang

TL;DR
This study reveals high annotation error rates in key text-to-SQL benchmarks, demonstrating that these errors significantly distort agent performance evaluations and leaderboard rankings, thereby impacting research progress and deployment decisions.
Contribution
The paper empirically quantifies annotation errors in major text-to-SQL benchmarks and shows their substantial effect on performance metrics and rankings.
Findings
Annotation error rates are 52.8% in BIRD Mini-Dev and 62.8% in Spider 2.0-Snow.
Performance and ranking of agents vary significantly when evaluated on corrected datasets.
Rankings on uncorrected data correlate strongly with full dataset, but weakly on corrected data.
Abstract
Researchers have proposed numerous text-to-SQL techniques to streamline data analytics and accelerate the development of data-driven applications. To compare these techniques and select the best one for deployment, the community depends on public benchmarks and their leaderboards. Since these benchmarks heavily rely on human annotations during question construction and answer evaluation, the validity of the annotations is crucial. In this paper, we conduct an empirical study that (i) benchmarks annotation error rates for two widely used text-to-SQL benchmarks, BIRD and Spider 2.0-Snow, and (ii) corrects a subset of the BIRD development (Dev) set to measure the impact of annotation errors on text-to-SQL agent performance and leaderboard rankings. Through expert analysis, we show that BIRD Mini-Dev and Spider 2.0-Snow have error rates of 52.8% and 62.8%, respectively. We re-evaluate all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Software Engineering Research · Advanced Database Systems and Queries
