Evaluating Cross-Domain Text-to-SQL Models and Benchmarks
Mohammadreza Pourreza, Davood Rafiei

TL;DR
This paper critically examines cross-domain Text-to-SQL benchmarks, revealing that perfect model performance is unachievable due to inherent ambiguities, and demonstrates that recent GPT-4 models can outperform reference queries in human evaluations.
Contribution
The study provides a comprehensive re-evaluation of Text-to-SQL benchmarks, highlighting limitations of current evaluation methods and uncovering that GPT-4 can surpass reference queries in certain benchmarks.
Findings
Perfect performance on benchmarks is unfeasible due to multiple interpretations.
Model performance is underestimated and changes after re-evaluation.
GPT-4 surpasses reference queries in human evaluation on Spider.
Abstract
Text-to-SQL benchmarks play a crucial role in evaluating the progress made in the field and the ranking of different models. However, accurately matching a model-generated SQL query to a reference SQL query in a benchmark fails for various reasons, such as underspecified natural language queries, inherent assumptions in both model-generated and reference queries, and the non-deterministic nature of SQL output under certain conditions. In this paper, we conduct an extensive study of several prominent cross-domain text-to-SQL benchmarks and re-evaluate some of the top-performing models within these benchmarks, by both manually evaluating the SQL queries and rewriting them in equivalent expressions. Our evaluation reveals that attaining a perfect performance on these benchmarks is unfeasible due to the multiple interpretations that can be derived from the provided samples. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Quality and Management · Semantic Web and Ontologies
