Bridging Global Intent with Local Details: A Hierarchical Representation Approach for Semantic Validation in Text-to-SQL
Rihong Qiu, Zhibang Yang, Xinke Jiang, Weibin Liao, Xin Gao, Xu Chu, Junfeng Zhao, Yasha Wang

TL;DR
This paper introduces HEROSQL, a hierarchical SQL representation method combining global intent and local details, employing neural networks and data augmentation to improve semantic validation accuracy in Text-to-SQL systems.
Contribution
HEROSQL is a novel hierarchical approach that integrates logical plans and syntax trees with neural message passing and data augmentation for enhanced semantic validation.
Findings
Outperforms state-of-the-art in semantic inconsistency detection
Achieves 9.40% improvement in AUPRC
Achieves 12.35% improvement in AUROC
Abstract
Text-to-SQL translates natural language questions into SQL statements grounded in a target database schema. Ensuring the reliability and executability of such systems requires validating generated SQL, but most existing approaches focus only on syntactic correctness, with few addressing semantic validation (detecting misalignments between questions and SQL). As a consequence, effective semantic validation still faces two key challenges: capturing both global user intent and SQL structural details, and constructing high-quality fine-grained sub-SQL annotations. To tackle these, we introduce HEROSQL, a hierarchical SQL representation approach that integrates global intent (via Logical Plans, LPs) and local details (via Abstract Syntax Trees, ASTs). To enable better information propagation, we employ a Nested Message Passing Neural Network (NMPNN) to capture inherent relational information…
Peer Reviews
Decision·Submitted to ICLR 2026
S1. Novel combination of LP-level global semantics and AST-level local structure for semantic validation, with a principled NMPNN over the hierarchy. S2. AST-driven negative augmentation specifically for validation (vs. generation) is useful. S3. Strong empirical results across diverse datasets and a useful demonstration of fine-grained feedback to LLMs.
W1. Limited justification for decoder-only embeddings vs. strong encoder baselines. Encoder-only models (e.g., E5, GTE, BGE, Contriever) are strong baselines for text embedding; decoder-only choices increase compute and may not help validation. W2.Reproducibility gaps (placeholders; limited detail on LP extraction). Placeholders (runs, stdevs) and missing versions reduce confidence and hinder replication. W3. Practical latency/throughput for interactive validation not reported. Validation is of
S1. Fine-grained negative augmentation enabling sub-SQL feedback is practical and useful. S2. The KV-cache and schema compression make the proposed method more practical. S3. The proposed method with small LLMs shows reasonable performance compared to baseline methods.
W1. My primary concern - using LPs to represent global intent is not convincing as LP is also based on the generated SQL, which may not reflect the semantic meaning in the original NL question. W2. The experimental evaluation is not solid. The authors only provide limited qualitative analysis. The chosen baselines are weak or general purpose error detection, not specific to Text-to-SQL. Hence the results are not convincing to show the actual effectiveness of the proposed method in real-world se
1. This paper offers a clear and rigorous definition of semantic correctness in Text-to-SQL, grounded in the alignment between unstructured user intent and structured SQL components (at the AST level). While many prior works use the term loosely, this formalization is precise and actionable, and sets a strong foundation for future research in this area. 2. The technical design is solid and well-motivated. By combining AST-level and logical-plan encodings through a Nested Message Passing Neural
1. AST perturbations are limited to the hand-crafted rule set (e.g. swapping operators, dropping predicates), which may not mimic the full spectrum of mistakes that LLMs actually make. LLM-generated negatives help diversify errors, but they’re still filtered purely by execution mismatch, so edge-case logic bugs that happen to produce the same result go unnoticed. 2. LLM-based baselines only include small LLMs. Could you justify this setup? What would be the performance if you use a commercial m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Advanced Database Systems and Queries · Web Application Security Vulnerabilities
