On the Flakiness of LLM-Generated Tests for Industrial and Open-Source Database Management Systems
Alexander Berndt, Thomas Bach, Rainer Gemulla, Marcus Kessel, Sebastian Baltes

TL;DR
This study investigates the prevalence and causes of flakiness in tests generated by large language models for database systems, revealing that such tests often inherit flakiness from existing tests and emphasizing the need for tailored context in LLM prompts.
Contribution
It provides the first comprehensive analysis of LLM-generated test flakiness in database systems, identifying common root causes and differences between open-source and closed-source systems.
Findings
Generated tests have a slightly higher proportion of flaky tests than existing tests.
Most flakiness is caused by reliance on unordered collections.
Flakiness transfer from existing to generated tests is common, especially in closed-source systems.
Abstract
Flaky tests are a common problem in software testing. They produce inconsistent results when executed multiple times on the same code, invalidating the assumption that a test failure indicates a software defect. Recent work on LLM-based test generation has identified flakiness as a potential problem with generated tests. However, its prevalence and underlying causes are unclear. We examined the flakiness of LLM-generated tests in the context of four relational database management systems: SAP HANA, DuckDB, MySQL, and SQLite. We amplified test suites with two LLMs, GPT-4o and Mistral-Large-Instruct-2407, to assess the flakiness of the generated test cases. Our results suggest that generated tests have a slightly higher proportion of flaky tests compared to existing tests. Based on a manual inspection, we found that the most common root cause of flakiness was the reliance of a test on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software System Performance and Reliability · Software Engineering Research
