SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification
Rocky Klopfenstein, Yang He, Andrew Tremante, Yuepeng Wang, Nina Narodytska, Haoze Wu

TL;DR
SpotIt introduces a formal verification approach to evaluate Text-to-SQL systems, revealing that traditional test-based methods may overlook differences between generated and ground-truth queries, thus providing a more reliable assessment.
Contribution
The paper presents a novel formal verification pipeline for Text-to-SQL evaluation, extending existing verifiers to support richer SQL subsets and improving evaluation reliability.
Findings
Test-based evaluation often misses differences between queries.
Formal verification uncovers more nuanced differences in system performance.
Current evaluation methods may overestimate the accuracy of Text-to-SQL models.
Abstract
Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance…
Peer Reviews
Decision·ICLR 2026 Poster
S1. Strong and timely problem framing; evaluation reliability is critical for Text-to-SQL progress. S2. Clean pipeline with validation and cross-checking; minimal CEXs make root causes transparent. S3. Substantive empirical findings (accuracy/ranking shifts; gold SQL issues; ambiguity prevalence). S4. Formal treatment and proofs for extended string/date operators and set-semantics equivalence. S5. Practicality: seconds-scale per-instance runtime; high coverage relative to prior verifier.
W1. Bound sensitivity (K): The paper does not quantify how many non-equivalences require bounds beyond K=5, so the false-negative rate is unknown. A detection-vs-K curve and examples missed at K=5 are needed to calibrate trust in results. W2. Dataset scope: All results are on BIRD-dev, leaving external validity uncertain. Including Spider 2.0 and one enterprise-style schema would test generality and robustness of conclusions. W3. Cause attribution (gold vs ambiguity vs model): The labeling relie
1. This paper tackles an important problem in Text-to-SQL, assessing SQL correctness beyond execution equivalence. By searching for minimal counterexample databases where two SQL queries diverge, SpotIt offers a more rigorous correctness check and greatly simplifies debugging (e.g., identifying faulty ground truths in BIRD). 2. The paper extends the SQL syntax that can be encoded into an SMT solver, which is a great contribution. It consistently covers more than 90 percent of the queries generat
1. SPOTIT only searches for divergences up to a fixed tuple-count bound K. Semantic differences that manifest only on larger databases (e.g., join multiplicity) may go undetected if no small counterexample exists. 2. As schema size or query complexity grows (many tables, join many tables together, deep nesting, large numbers of predicates), the SMT solver’s search space can explode, leading to longer solve times. 3. There are still some unsupported SQL features, such as window and analytics func
In text to SQL evaluation plays a vital role and this paper acknowledged the issue. Current methods relying on test-based execution on a static database are optimistic. The proposed solution, SPOTIT, replaces the weak test-based check with a much stronger, formal verification method. This also showed that the reported accuracy of these methods drops by 11.3%–14.2% when switching from the official test-based evaluation to SPOTIT
The sensitivity of the bound parameter (k) is not analyzed, leaving unclear how verification completeness and runtime scale with k. The paper does not report statistics on timeouts or unsupported queries.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Application Security Vulnerabilities · Advanced Database Systems and Queries · Natural Language Processing Techniques
