BEAVER: An Enterprise Benchmark for Text-to-SQL

Peter Baile Chen; Devin Yang; Weiyue Li; Fabian Wenz; Yi Zhang; Nesime Tatbul; Michael Cafarella; \c{C}a\u{g}atay Demiralp; Michael Stonebraker

arXiv:2409.02038·cs.CL·May 14, 2026·3 cites

BEAVER: An Enterprise Benchmark for Text-to-SQL

Peter Baile Chen, Devin Yang, Weiyue Li, Fabian Wenz, Yi Zhang, Nesime Tatbul, Michael Cafarella, \c{C}a\u{g}atay Demiralp, Michael Stonebraker

PDF

5 Datasets 1 Video

TL;DR

BEAVER is a new enterprise-focused Text-to-SQL benchmark derived from private data warehouses, highlighting the challenges and performance gaps of current models in complex, real-world scenarios.

Contribution

It introduces BEAVER, the first private enterprise data warehouse-based Text-to-SQL benchmark with detailed evaluation metrics and analysis of model performance on complex queries.

Findings

01

State-of-the-art models achieve only 10.8% accuracy on BEAVER.

02

Providing subtask hints increases accuracy to 30.1%.

03

Major challenges include handling advanced functions and complex query structures.

Abstract

Existing text-to-SQL benchmarks have largely been constructed from public databases with well-structured schemas and simplistic question-SQL pairs. While large language models (LLMs) excel on these settings, their efficacy in complex private enterprise environments, characterized by intricate schemas, domain knowledge, and analytical user queries involving sophisticated structures and functions, remains unproven. To bridge this gap, we introduce BEAVER, the first text-to-SQL benchmark derived from private data warehouses. It comprises 9128 question-SQL pairs sourced from real-world query logs and 812 tables across 19 diverse domains. Building this benchmark is challenging because (1) enterprise query logs are scarce due to privacy constraints, and (2) existing all-or-nothing evaluation metrics based on accuracy make error diagnosis difficult -- especially when producing a correct query…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

BEAVER: An Enterprise Benchmark for Text-to-SQL· underline