VeRO: An Evaluation Harness for Agents to Optimize Agents

Varun Ursekar; Apaar Shanker; Veronica Chatrath; Yuan Xue; Sam Denton

arXiv:2602.22480·cs.AI·May 12, 2026

VeRO: An Evaluation Harness for Agents to Optimize Agents

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan Xue, Sam Denton

PDF

1 Datasets

TL;DR

VERO is a comprehensive evaluation framework and benchmark suite designed to systematically assess and improve the performance of coding agents through iterative optimization and structured evaluation.

Contribution

The paper introduces VERO, a reproducible evaluation harness and benchmark suite for agent optimization, enabling systematic analysis of agent performance improvements.

Findings

01

VERO facilitates structured evaluation of agent optimization strategies.

02

Empirical results identify modifications that reliably enhance agent performance.

03

The framework supports reproducible research in agent optimization.

Abstract

An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

GloriaaaM/LLM-Agent-Harness-Survey
dataset· 1.3k dl
1.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.