Evaluating the Text-to-SQL Capabilities of Large Language Models
Nitarshan Rajkumar, Raymond Li, Dzmitry Bahdanau

TL;DR
This paper empirically evaluates the Text-to-SQL abilities of the Codex language model, showing strong zero-shot performance and improvements with few-shot prompting across multiple benchmarks.
Contribution
It demonstrates that Codex, without fine-tuning, is a competitive baseline for Text-to-SQL tasks and highlights the effectiveness of few-shot prompting in improving performance.
Findings
Codex performs well on the Spider benchmark without fine-tuning.
Few-shot prompting with in-domain examples enhances Codex's performance on GeoQuery and Scholar.
Analysis of failure modes provides insights into limitations of Codex in Text-to-SQL tasks.
Abstract
We perform an empirical evaluation of Text-to-SQL capabilities of the Codex language model. We find that, without any finetuning, Codex is a strong baseline on the Spider benchmark; we also analyze the failure modes of Codex in this setting. Furthermore, we demonstrate on the GeoQuery and Scholar benchmarks that a small number of in-domain examples provided in the prompt enables Codex to perform better than state-of-the-art models finetuned on such few-shot examples.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Semantic Web and Ontologies · Natural Language Processing Techniques
