Towards Direct Evaluation of Harness Optimizers via Priority Ranking

Kai Tzu-iunn Ong; Minseok Kang; Dongwook Choi; Junhee Cho; Seungju Kim; Seungwon Lim; Geunha Jang; Minwoo Oh; Bogyung Jeong; Sunghwan Kim; Taeyoon Kwon; Jinyoung Yeo

arXiv:2605.22505·cs.AI·May 22, 2026

Towards Direct Evaluation of Harness Optimizers via Priority Ranking

Kai Tzu-iunn Ong, Minseok Kang, Dongwook Choi, Junhee Cho, Seungju Kim, Seungwon Lim, Geunha Jang, Minwoo Oh, Bogyung Jeong, Sunghwan Kim, Taeyoon Kwon, Jinyoung Yeo

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces a low-cost, direct evaluation method called priority ranking for harness optimizers, which predicts their multi-step optimization performance without costly rollouts.

Contribution

It proposes a novel priority ranking approach to evaluate harness optimizers directly, validated by a diverse set of scenarios and correlating well with actual optimization success.

Findings

01

Priority ranking correlates with multi-step optimization performance.

02

The method enables cost-effective, step-level evaluation of harness optimizers.

03

A collection of 182 human-verified scenarios supports evaluation across domains.

Abstract

Harness optimization enables automated agent creation by having an optimizer agent iteratively update the harness of target agents. Despite its success, current studies evaluate optimizers solely by observing target agents' performance gains. This indirect end-improvement evaluation neglects optimizers' actions at intermediate steps, which are often erroneous and hinder agent performance. Therefore, it is unclear whether harness optimization is driven by optimizers' informed update actions or simply trial-and-error. This necessitates direct evaluation of harness optimizers. However, evaluating harness optimizers directly is non-trivial and costly due to the lack of oracle harnesses. To address this, we present a simple, low-cost design to directly evaluate them, namely priority ranking. By asking harness optimizers to rank components (e.g., tools) in a given harness by their potential…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

k59118/Harness_Optimizer_Evaluation
github

Datasets

LangAGI-Lab/SHOR
dataset· 60 dl
60 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.