RM -RF: Reward Model for Run-Free Unit Test Evaluation

Elena Bruches; Daniil Grebenkin; Mikhail Klementev; Vadim Alperovich; Roman Derunets; Dari Baturova; Georgy Mkrtchyan; Oleg Sedukhin; Ivan Bondarenko; Nikolay Bushkov; Stanislav Moiseev

arXiv:2601.13097·cs.SE·January 21, 2026

RM -RF: Reward Model for Run-Free Unit Test Evaluation

Elena Bruches, Daniil Grebenkin, Mikhail Klementev, Vadim Alperovich, Roman Derunets, Dari Baturova, Georgy Mkrtchyan, Oleg Sedukhin, Ivan Bondarenko, Nikolay Bushkov, Stanislav Moiseev

PDF

Open Access

TL;DR

RM-RF is a lightweight, run-free reward model that predicts test execution success, coverage increase, and mutation kill rate from source code alone, enabling faster and cheaper evaluation of generated unit tests.

Contribution

The paper introduces RM-RF, a novel run-free reward model trained on multilingual datasets to evaluate generated unit tests without execution.

Findings

01

Achieved an average F1 score of 0.69 across targets.

02

Substantially lower latency and infrastructure cost compared to traditional methods.

03

Effective across multiple programming languages and model tuning regimes.

Abstract

We present RM-RF, a lightweight reward model for run-free evaluation of automatically generated unit tests. Instead of repeatedly compiling and executing candidate tests, RM-RF predicts - from source and test code alone - three execution-derived signals: (1) whether the augmented test suite compiles and runs successfully, (2) whether the generated test cases increase code coverage, and (3) whether the generated test cases improve the mutation kill rate. To train and evaluate RM-RF we assemble a multilingual dataset (Java, Python, Go) of focal files, test files, and candidate test additions labeled by an execution-based pipeline, and we release an associated dataset and methodology for comparative evaluation. We tested multiple model families and tuning regimes (zero-shot, full fine-tuning, and PEFT via LoRA), achieving an average F1 of 0.69 across the three targets. Compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software System Performance and Reliability