On the Flakiness of LLM-Generated Tests for Industrial and Open-Source Database Management Systems

Alexander Berndt; Thomas Bach; Rainer Gemulla; Marcus Kessel; Sebastian Baltes

arXiv:2601.08998·cs.SE·January 15, 2026

On the Flakiness of LLM-Generated Tests for Industrial and Open-Source Database Management Systems

Alexander Berndt, Thomas Bach, Rainer Gemulla, Marcus Kessel, Sebastian Baltes

PDF

Open Access

TL;DR

This study investigates the prevalence and causes of flakiness in tests generated by large language models for database systems, revealing that such tests often inherit flakiness from existing tests and emphasizing the need for tailored context in LLM prompts.

Contribution

It provides the first comprehensive analysis of LLM-generated test flakiness in database systems, identifying common root causes and differences between open-source and closed-source systems.

Findings

01

Generated tests have a slightly higher proportion of flaky tests than existing tests.

02

Most flakiness is caused by reliance on unordered collections.

03

Flakiness transfer from existing to generated tests is common, especially in closed-source systems.

Abstract

Flaky tests are a common problem in software testing. They produce inconsistent results when executed multiple times on the same code, invalidating the assumption that a test failure indicates a software defect. Recent work on LLM-based test generation has identified flakiness as a potential problem with generated tests. However, its prevalence and underlying causes are unclear. We examined the flakiness of LLM-generated tests in the context of four relational database management systems: SAP HANA, DuckDB, MySQL, and SQLite. We amplified test suites with two LLMs, GPT-4o and Mistral-Large-Instruct-2407, to assess the flakiness of the generated test cases. Our results suggest that generated tests have a slightly higher proportion of flaky tests compared to existing tests. Based on a manual inspection, we found that the most common root cause of flakiness was the reliance of a test on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software System Performance and Reliability · Software Engineering Research