Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

Wendk\^uuni C. Ou\'edraogo; Yinghua Li; Xueqi Dang; Xin Zhou; Anil Koyuncu; Jacques Klein; David Lo; Tegawend\'e F. Bissyand\'e

arXiv:2506.06767·cs.SE·October 21, 2025

Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

Wendk\^uuni C. Ou\'edraogo, Yinghua Li, Xueqi Dang, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, Tegawend\'e F. Bissyand\'e

PDF

Open Access

TL;DR

This paper introduces CTSES, a composite evaluation metric for LLM-generated test refactorings that balances semantic similarity, readability, and structural alignment, aiming to better reflect human judgment.

Contribution

It proposes a novel composite metric, CTSES, combining existing metrics to improve the evaluation of LLM-based test refactorings, addressing limitations of previous metrics.

Findings

01

CTSES reduces false negatives compared to individual metrics.

02

CTSES provides more interpretable signals aligned with developer judgments.

03

Evaluation on 5,000+ refactorings demonstrates the effectiveness of the composite approach.

Abstract

Large Language Models (LLMs) are increasingly used to refactor unit tests, improving readability and structure while preserving behavior. Evaluating such refactorings, however, remains difficult: metrics like CodeBLEU penalize beneficial renamings and edits, while semantic similarities overlook readability and modularity. We propose CTSES, a first step toward human-aligned evaluation of refactored tests. CTSES combines CodeBLEU, METEOR, and ROUGE-L into a composite score that balances semantics, lexical clarity, and structural alignment. Evaluated on 5,000+ refactorings from Defects4J and SF110 (GPT-4o and Mistral-Large), CTSES reduces false negatives and provides more interpretable signals than individual metrics. Our emerging results illustrate that CTSES offers a proof-of-concept for composite approaches, showing their promise in bridging automated metrics and developer judgments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Topic Modeling