Node-Level Uncertainty Estimation in LLM-Generated SQL
Hilaf Hasson, Ruocheng Guo

TL;DR
This paper introduces a node-level uncertainty estimation framework for detecting errors in LLM-generated SQL queries by analyzing individual syntax tree nodes, improving error detection accuracy and interpretability.
Contribution
The authors propose a novel, semantically aware, node-level error detection method that outperforms token probability baselines and enables targeted query repair and review.
Findings
Average AUC improves by +27.44% over token log-probabilities
Method maintains robustness across multiple databases
Enables fine-grained error diagnostics and targeted corrections
Abstract
We present a practical framework for detecting errors in LLM-generated SQL by estimating uncertainty at the level of individual nodes in the query's abstract syntax tree (AST). Our approach proceeds in two stages. First, we introduce a semantically aware labeling algorithm that, given a generated SQL and a gold reference, assigns node-level correctness without over-penalizing structural containers or alias variation. Second, we represent each node with a rich set of schema-aware and lexical features - capturing identifier validity, alias resolution, type compatibility, ambiguity in scope, and typo signals - and train a supervised classifier to predict per-node error probabilities. We interpret these probabilities as calibrated uncertainty, enabling fine-grained diagnostics that pinpoint exactly where a query is likely to be wrong. Across multiple databases and datasets, our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Quality and Management · Web Application Security Vulnerabilities
