DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model

Yuanzhen Xie; Liu Ye; Jiqun Chu; Mochi Gao; Hehuan Liu; Yunzhi Tan; Bo Hu; Zang Li

arXiv:2510.23284·cs.CL·October 28, 2025

DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model

Yuanzhen Xie, Liu Ye, Jiqun Chu, Mochi Gao, Hehuan Liu, Yunzhi Tan, Bo Hu, Zang Li

PDF

TL;DR

This paper introduces an automated data-centric pipeline and multi-model collaboration training approach for text-to-SQL tasks, significantly improving accuracy by repairing, augmenting data, and leveraging multiple models.

Contribution

It proposes a novel fully automated data-centric pipeline with adaptive data repair and error data augmentation, along with a multi-model collaboration training schema for text-to-SQL.

Findings

01

Achieved first place in lightweight text-to-SQL models (within 70B).

02

Demonstrated effectiveness of data repair and augmentation strategies.

03

Validated multi-model collaboration improves task accuracy.

Abstract

Text-to-SQL tasks have gained attractive improvements since the release of ChatGPT. Among them, agent-based frameworks have been widely used in this field. However, the impact of data-centric strategies on text-to-SQL tasks has rarely been explored. In this paper, we systemically design a fully automated data-centric pipeline for text-to-SQL tasks, including \emph{adaptive data repair}, which can automatically find and fix errors in the training dataset; and \emph{error data augmentation}, where we specifically diffuse and enhance erroneous data predicted by the initially trained models. Meanwhile, we propose a Multi-Model collaboration training schema, aiming to train multiple models with different augmented data, enabling them to possess distinct capabilities and work together to complement each other, because it has been found that the capability of a single fine-tuned model is very…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.