Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems

Taslim Jamal Arif; Kuldeep Singh

arXiv:2604.28049·cs.AI·May 1, 2026

Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems

Taslim Jamal Arif, Kuldeep Singh

PDF

TL;DR

This paper introduces STEF, a schema-agnostic evaluation framework for Text-to-SQL systems that enables production-level monitoring and improvement without relying on database schemas or reference queries.

Contribution

The paper presents STEF, a novel production-native evaluation system for Text-to-SQL that operates solely on natural language inputs and generated SQL, removing schema dependency.

Findings

01

STEF provides a 0-100 accuracy score based on semantic alignment.

02

Enables continuous monitoring and feedback for production Text-to-SQL agents.

03

Handles schema variations and default heuristics for robust evaluation.

Abstract

Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address. Current evaluation methodologies whether rule-based SQL matching or schema-dependent semantic parsers assume access to ground-truth queries and structured database schema, constraints that are rarely satisfied in real-world deployments. This disconnect leaves production T2SQL agents largely unevaluated beyond developer-time testing, creating silent quality degradation with no feedback mechanism for continuous improvement. We present STEF (Schema-agnostic Text-to-SQL Evaluation Framework), a production-native evaluation system that operates exclusively on natural language inputs the user question, an enriched reformulation, and the generated SQL without requiring database schema or reference queries. STEF extracts semantic specifications from both natural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.