CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description

Shaoming Duan; Zirui Wang; Chuanyi Liu; Zhibin Zhu; Yuhao Zhang; Peiyi Han; Liang Yan; Zewu Peng

arXiv:2508.12769·cs.CL·August 21, 2025

CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description

Shaoming Duan, Zirui Wang, Chuanyi Liu, Zhibin Zhu, Yuhao Zhang, Peiyi Han, Liang Yan, Zewu Peng

PDF

TL;DR

CRED-SQL improves large-scale database Text-to-SQL parsing by using cluster retrieval and an intermediate language to better match natural language questions with SQL queries, achieving state-of-the-art results.

Contribution

The paper introduces CRED-SQL, a novel framework combining cluster-based schema retrieval and an intermediate language to enhance large-scale database Text-to-SQL accuracy.

Findings

01

Achieves state-of-the-art performance on large-scale benchmarks

02

Effectively reduces semantic mismatch and drift in SQL generation

03

Demonstrates scalability across cross-domain datasets

Abstract

Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description. CRED-SQL first performs cluster-based large-scale schema retrieval to pinpoint the tables and columns most relevant to a given NLQ, alleviating schema mismatch. It then introduces an intermediate natural language representation-Execution Description Language (EDL)-to bridge the gap between NLQs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.