BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain

Rahul Kumar; Amar Raja Dibbu; Shrutendra Harsola; Vignesh; Subrahmaniam; Ashutosh Modi

arXiv:2406.07860·cs.CL·June 13, 2024

BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain

Rahul Kumar, Amar Raja Dibbu, Shrutendra Harsola, Vignesh, Subrahmaniam, Ashutosh Modi

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces BookSQL, a large-scale Text-to-SQL dataset tailored for the accounting domain, addressing the lack of domain-specific datasets and highlighting the need for specialized models.

Contribution

The paper presents a new dataset with 100k NL-SQL pairs for accounting, filling a critical gap in domain-specific resources for Text-to-SQL tasks.

Findings

01

Existing models show significant performance gaps on BookSQL.

02

The dataset enables benchmarking and development of more focused models.

03

Analysis highlights the need for domain-adapted approaches.

Abstract

Several large-scale datasets (e.g., WikiSQL, Spider) for developing natural language interfaces to databases have recently been proposed. These datasets cover a wide breadth of domains but fall short on some essential domains, such as finance and accounting. Given that accounting databases are used worldwide, particularly by non-technical people, there is an imminent need to develop models that could help extract information from accounting databases via natural language queries. In this resource paper, we aim to fill this gap by proposing a new large-scale Text-to-SQL dataset for the accounting and financial domain: BookSQL. The dataset consists of 100k natural language queries-SQL pairs, and accounting databases of 1 million records. We experiment with and analyze existing state-of-the-art models (including GPT-4) for the Text-to-SQL task on BookSQL. We find significant performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

exploration-lab/booksql
noneOfficial

Datasets

Exploration-Lab/BookSQL
dataset· 57 dl
57 dl

Videos

BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain· underline

Taxonomy

TopicsFinancial Reporting and XBRL