EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution
Tianshu Zhang, Kun Qian, Siddhartha Sahai, Yuan Tian, Shaddy Garg, Huan Sun, Yunyao Li

TL;DR
EvoSchema is a comprehensive benchmark designed to evaluate and improve the robustness of neural text-to-SQL models against real-world schema evolution, addressing a critical challenge in dynamic database environments.
Contribution
The paper introduces EvoSchema, a novel benchmark with a schema evolution taxonomy, and demonstrates how training on diverse schema perturbations enhances model robustness.
Findings
Table-level perturbations significantly impact model performance.
Models trained on EvoSchema data show improved robustness.
EvoSchema provides insights into model behavior under schema changes.
Abstract
Neural text-to-SQL models, which translate natural language questions (NLQs) into SQL queries given a database schema, have achieved remarkable performance. However, database schemas frequently evolve to meet new requirements. Such schema evolution often leads to performance degradation for models trained on static schemas. Existing work either mainly focuses on simply paraphrasing some syntactic or semantic mappings among NLQ, DB and SQL, or lacks a comprehensive and controllable way to investigate the model robustness issue under the schema evolution, which is insufficient when facing the increasingly complex and rich database schema changes in reality, especially in the LLM era. To address the challenges posed by schema evolution, we present EvoSchema, a comprehensive benchmark designed to assess and enhance the robustness of text-to-SQL systems under real-world schema changes.…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
For originality: Paper introduces new appoarch to enhance performance of Text-to-SQL model. Paper also defines in detail the types of structural transformations that help standardize test cases and studies the stability of the text-to-SQL model such as: adding, removing, renaming for column-level and adding, removing, renaming, splitting, merging for table-level, laying the foundation for further research. For quality: Creating augmented data by combining heuristics and GPT models ensures that
The paper has a major flaw in the benchmark datasets and experiments: 1. The method only performs benchmarking on the dataset prepared by the author himself, there is no other dataset used to benchmark the data augmentation task in the paper. We need more external benchmark datasets. 2. The paper does not explore further fine-tuning methods to improve model performance. In addition, comparisons with closed-source models such as GPT models are still limited. For example, prompting methods for GP
1. The introduction and related work sections are well organized. It clearly sketches the existing problem, the proposed argument, and related work for text-to-SQL. 2. The topic of this paper is interesting and practical since real-world databases are always evolving. 3. Intensive experiments have been conducted to verify the performance of the proposed model.
1. The method of dividing the training and testing sets is unclear in this paper. 2. How to get the new training and testing data with evolved databases is not clear. Do all the training and testing data share the same schema structure after the database evolves? If so, how can data leakage be avoided? If not, does it mean that there is only one change of database schema for each instance, and different instances share different structures of database schema? 3. Cost and Efficiency Analysis.
1. Focused on a critical problem with LLMs and overall tackled it well. 2. This is a comprehensive framework for studying and improving robustness of Text to SQL against schema updates/changes. 3. Proposed metrics are robust and can serve as benchmark/baseline for further improvements in this specific area. 4. Training paradigm is diverse and well designed to assess the perturbations correctly. 5. Well articulated paper overall and easy to read. 6. Defined metrics are intuitive and appropriate f
1. First of all, the scope of the work is narrow i.e. it only focuses on one type of robustness challenge in Text to SQL. 2. The paper doesn't offer any novelty, originality or focuses on any strong theoretical foundations behind robustness issues in LLMs or any ML models. 3. There are several other works that tackle robustness and specific papers like ADVETA which tackle the robustness of SQL generation from schema changes. This work neither compares with them nor goes into detail about why thi
1. This paper addresses a critical problem: enhancing model robustness in the face of data evolution. In the context of Text-to-SQL, data evolution specifically refers to schema changes. 2. The paper develops a taxonomy for schema evolution, mapping each type of perturbation to a real-world scenario. 3. The paper presents a thorough evaluation across a wide range of open-source models as well as state-of-the-art commercial models.
1. The benchmark generation section is too brief and lacks critical details, which weakens the validity of the paper's results. My main question here is: how do you ensure that adding tables or columns does not create alternative correct SQL answers? Adding a table can introduce alternative join paths or semantically similar columns, which could also be used to answer the NLQ accurately. For example, in Figure 2, when adding an "appointment" table, if this table contains fields such as the patie
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Database Systems and Queries
