LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings
Lifu Tu, Rongguang Wang, Tao Sheng, Sujjith Ravi, Dan Roth

TL;DR
This paper evaluates the robustness of large language models in translating natural language to SQL under various noisy and variable conditions, revealing strengths and vulnerabilities in different settings.
Contribution
Introduces a comprehensive robustness benchmark for NL2SQL systems and compares multiple state-of-the-art LLMs across traditional and agentic scenarios.
Findings
Models are robust against some perturbations but struggle with surface noise and linguistic variation.
Surface noise impacts traditional pipelines more significantly.
Linguistic variation challenges agentic NL2SQL systems.
Abstract
Robustness evaluation for Natural Language to SQL (NL2SQL) systems is essential because real-world database environments are dynamic, noisy, and continuously evolving, whereas conventional benchmark evaluations typically assume static schemas and well-formed user inputs. In this work, we introduce a robustness evaluation benchmark containing approximately ten types of perturbations and conduct evaluations under both traditional and agentic settings. We assess multiple state-of-the-art large language models (LLMs), including Grok-4.1, Gemini-3-Pro, Claude-Opus-4.6, and GPT-5.2. Our results show that these models generally maintain strong performance under several perturbations; however, notable performance degradation is observed for surface-level noise (e.g., character-level corruption) and linguistic variation that preserves semantics while altering lexical or syntactic forms.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification
