BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain
Rahul Kumar, Amar Raja Dibbu, Shrutendra Harsola, Vignesh, Subrahmaniam, Ashutosh Modi

TL;DR
This paper introduces BookSQL, a large-scale Text-to-SQL dataset tailored for the accounting domain, addressing the lack of domain-specific datasets and highlighting the need for specialized models.
Contribution
The paper presents a new dataset with 100k NL-SQL pairs for accounting, filling a critical gap in domain-specific resources for Text-to-SQL tasks.
Findings
Existing models show significant performance gaps on BookSQL.
The dataset enables benchmarking and development of more focused models.
Analysis highlights the need for domain-adapted approaches.
Abstract
Several large-scale datasets (e.g., WikiSQL, Spider) for developing natural language interfaces to databases have recently been proposed. These datasets cover a wide breadth of domains but fall short on some essential domains, such as finance and accounting. Given that accounting databases are used worldwide, particularly by non-technical people, there is an imminent need to develop models that could help extract information from accounting databases via natural language queries. In this resource paper, we aim to fill this gap by proposing a new large-scale Text-to-SQL dataset for the accounting and financial domain: BookSQL. The dataset consists of 100k natural language queries-SQL pairs, and accounting databases of 1 million records. We experiment with and analyze existing state-of-the-art models (including GPT-4) for the Text-to-SQL task on BookSQL. We find significant performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsFinancial Reporting and XBRL
