BEAVER: An Enterprise Benchmark for Text-to-SQL
Peter Baile Chen, Devin Yang, Weiyue Li, Fabian Wenz, Yi Zhang, Nesime Tatbul, Michael Cafarella, \c{C}a\u{g}atay Demiralp, Michael Stonebraker

TL;DR
BEAVER is a new enterprise-focused Text-to-SQL benchmark derived from private data warehouses, highlighting the challenges and performance gaps of current models in complex, real-world scenarios.
Contribution
It introduces BEAVER, the first private enterprise data warehouse-based Text-to-SQL benchmark with detailed evaluation metrics and analysis of model performance on complex queries.
Findings
State-of-the-art models achieve only 10.8% accuracy on BEAVER.
Providing subtask hints increases accuracy to 30.1%.
Major challenges include handling advanced functions and complex query structures.
Abstract
Existing text-to-SQL benchmarks have largely been constructed from public databases with well-structured schemas and simplistic question-SQL pairs. While large language models (LLMs) excel on these settings, their efficacy in complex private enterprise environments, characterized by intricate schemas, domain knowledge, and analytical user queries involving sophisticated structures and functions, remains unproven. To bridge this gap, we introduce BEAVER, the first text-to-SQL benchmark derived from private data warehouses. It comprises 9128 question-SQL pairs sourced from real-world query logs and 812 tables across 19 diverse domains. Building this benchmark is challenging because (1) enterprise query logs are scarce due to privacy constraints, and (2) existing all-or-nothing evaluation metrics based on accuracy make error diagnosis difficult -- especially when producing a correct query…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
