ROSE: An Intent-Centered Evaluation Metric for NL2SQL
Wenqi Pei, Shizheng Hou, Boyan Li, Han Chen, Zhichao Shi, Yuyu Luo

TL;DR
ROSE is a new intent-centered evaluation metric for NL2SQL that better aligns with human judgment by focusing on whether the SQL answers the question, not just syntactic similarity.
Contribution
Introduces ROSE, an adversarial Prover-Refuter based metric that improves evaluation reliability for NL2SQL by focusing on semantic correctness aligned with user intent.
Findings
ROSE outperforms existing metrics by 24% in Cohen's Kappa.
ROSE achieves the best agreement with human experts on validation set.
Re-evaluation of 19 NL2SQL methods reveals new insights.
Abstract
Execution Accuracy (EX), the widely used metric for evaluating the effectiveness of Natural Language to SQL (NL2SQL) solutions, is becoming increasingly unreliable. It is sensitive to syntactic variation, ignores that questions may admit multiple interpretations, and is easily misled by erroneous ground-truth SQL. To address this, we introduce ROSE, an intent-centered metric that focuses on whether the predicted SQL answers the question, rather than consistency with the ground-truth SQL under the reference-dependent paradigm. ROSE employs an adversarial Prover-Refuter cascade: SQL Prover assesses the semantic correctness of a predicted SQL against the user's intent independently, while Adversarial Refuter uses the ground-truth SQL as evidence to challenge and refine this judgment. On our expert-aligned validation set ROSE-VEC, ROSE achieves the best agreement with human experts,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
