Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
Germ\'an T. Eizaguirre, Lars Tissen, Marc S\'anchez-Artigas

TL;DR
This paper evaluates the performance and cost implications of large language model agents in Text-to-Big SQL tasks, introducing new metrics for real-world, large-scale data scenarios.
Contribution
It proposes novel metrics for assessing Text-to-Big SQL systems, emphasizing efficiency, cost, and scalability, and evaluates frontier models using these metrics.
Findings
GPT-4o achieves 7% lower accuracy but 12.16x faster execution.
GPT-5.2 is over twice as cost-effective as Gemini 3 Pro at large scales.
Existing text-to-SQL metrics are insufficient for large-scale data evaluation.
Abstract
Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly. In the real world, Text-to-SQL systems are often embedded with Big Data workflows, such as large-scale data processing or interactive data analytics. We refer to this as ``Text-to-Big SQL''. However, existing text-to-SQL benchmarks remain narrowly scoped and overlook the cost and performance implications that arise at scale. For instance, translation errors that are minor on small datasets lead to substantial cost and latency overheads as data scales, a relevant issue completely ignored by text-to-SQL metrics. In this paper, we overcome this overlooked challenge by introducing novel and representative metrics for evaluating Text-to-Big SQL. Our study focuses on production-level LLM agents, a database-agnostic system adaptable to diverse user needs. Via an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
