TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

Elena Bruches; Vadim Alperovich; Dari Baturova; Roman Derunets; Daniil Grebenkin; Georgy Mkrtchyan; Oleg Sedukhin; Mikhail Klementev; Ivan Bondarenko; Nikolay Bushkov; Stanislav Moiseev

arXiv:2601.18241·cs.SE·January 27, 2026

TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

Elena Bruches, Vadim Alperovich, Dari Baturova, Roman Derunets, Daniil Grebenkin, Georgy Mkrtchyan, Oleg Sedukhin, Mikhail Klementev, Ivan Bondarenko, Nikolay Bushkov, Stanislav Moiseev

PDF

Open Access

TL;DR

TAM-Eval is a comprehensive benchmark for evaluating large language models on their ability to perform automated test suite maintenance tasks like creation, repair, and updating across multiple programming languages, reflecting real-world workflows.

Contribution

We introduce TAM-Eval, a novel framework and benchmark for assessing LLMs on test maintenance tasks at the test file level with full repository context, a significant step beyond prior function-level evaluations.

Findings

01

State-of-the-art LLMs show limited effectiveness in realistic test maintenance scenarios.

02

Marginal improvements observed in test suite effectiveness with current models.

03

Benchmark covers 1,539 scenarios across Python, Java, and Go.

Abstract

While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM-Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function-level tasks, TAM-Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real-world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM-Eval supports system-agnostic evaluation of both raw LLMs and agentic workflows, using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software System Performance and Reliability