CodeS: Towards Building Open-source Language Models for Text-to-SQL
Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu,, Renjie Wei, Hongyan Pan, Cuiping Li, Hong Chen

TL;DR
CodeS introduces an open-source series of language models tailored for text-to-SQL translation, achieving state-of-the-art accuracy and robustness while addressing privacy and accessibility concerns associated with proprietary models.
Contribution
The paper presents CodeS, a fully open-source, pre-trained language model series specifically designed for text-to-SQL tasks, with innovative training and augmentation techniques.
Findings
CodeS achieves new SOTA accuracy on multiple benchmarks.
CodeS demonstrates high robustness across diverse datasets.
Open-source models outperform comparable closed-source models.
Abstract
Language models have shown promising performance on the task of translating natural language questions into SQL queries (Text-to-SQL). However, most of the state-of-the-art (SOTA) approaches rely on powerful yet closed-source large language models (LLMs), such as ChatGPT and GPT-4, which may have the limitations of unclear model architectures, data privacy risks, and expensive inference overheads. To address the limitations, we introduce CodeS, a series of pre-trained language models with parameters ranging from 1B to 15B, specifically designed for the text-to-SQL task. CodeS is a fully open-source language model, which achieves superior accuracy with much smaller parameter sizes. This paper studies the research challenges in building CodeS. To enhance the SQL generation abilities of CodeS, we adopt an incremental pre-training approach using a specifically curated SQL-centric corpus.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Distributed and Parallel Computing Systems · Semantic Web and Ontologies
MethodsLinear Layer · Dropout · Layer Normalization · Byte Pair Encoding · Multi-Head Attention · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax
