TL;DR
PolySQL introduces a novel method for evaluating text-to-SQL models across different SQL dialects without manual query translation, revealing significant performance gaps and dialect-specific challenges.
Contribution
It presents a dual-execution approach for cross-dialect evaluation, along with datasets and a framework to facilitate large-scale, accurate benchmarking of SQL dialect robustness.
Findings
SQLite performance does not reliably indicate other dialects' performance.
Cross-dialect evaluation shows a 10.1% accuracy drop from SQLite to others.
Most errors are logical rather than syntactic.
Abstract
SQL dialects vary in syntax, types, and functions across database engines. Text-to-SQL benchmarks, however, predominantly support only SQLite. This creates a critical evaluation gap: cross-dialect evaluation reveals weak per-query agreement (Cohen's ), showing that SQLite performance is an unreliable proxy for other dialects. Yet such evaluation remains prohibitively difficult: existing approaches either require expensive manual query transpilation or rely on tools that often fail on complex SQL. To close this gap, we introduce PolySQL, a novel dual-execution method that eliminates the need for query transpilation by comparing normalized execution results. Notably, our approach achieves higher evaluation fidelity than query transpilation with 100% query coverage. PolySQL comprises three datasets, enabling the first large-scale cross-dialect study. Our study reveals a 10.1% average…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
