SQLong: Enhanced NL2SQL for Longer Contexts with LLMs

Dai Quoc Nguyen; Cong Duy Vu Hoang; Duy Vu; Gioacchino Tangari; Thanh Tien Vu; Don Dharmasiri; Yuan-Fang Li; Long Duong

arXiv:2502.16747·cs.CL·May 21, 2025

SQLong: Enhanced NL2SQL for Longer Contexts with LLMs

Dai Quoc Nguyen, Cong Duy Vu Hoang, Duy Vu, Gioacchino Tangari, Thanh Tien Vu, Don Dharmasiri, Yuan-Fang Li, Long Duong

PDF

TL;DR

SQLong is a data augmentation framework that improves large language models' ability to handle long-context NL2SQL tasks by extending database schemas with synthetic data, leading to better performance on complex datasets.

Contribution

SQLong introduces a novel data augmentation method that enhances LLM training for long-context NL2SQL tasks by generating synthetic schema extensions and data, improving real-world applicability.

Findings

01

Significant performance gains on Spider and BIRD datasets.

02

Effective simulation of long-context scenarios during training.

03

Enhanced ability of LLMs to handle complex schemas.

Abstract

Open-weight large language models (LLMs) have significantly advanced performance in the Natural Language to SQL (NL2SQL) task. However, their effectiveness diminishes when dealing with large database schemas, as the context length increases. To address this limitation, we present SQLong, a novel and efficient data augmentation framework designed to enhance LLM performance in long-context scenarios for the NL2SQL task. SQLong generates augmented datasets by extending existing database schemas with additional synthetic CREATE TABLE commands and corresponding data rows, sampled from diverse schemas in the training data. This approach effectively simulates long-context scenarios during finetuning and evaluation. Through experiments on the Spider and BIRD datasets, we demonstrate that LLMs finetuned with SQLong-augmented data significantly outperform those trained on standard datasets. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.