Effective Harness Engineering for Algorithm Discovery with Coding Agents
Yoichi Ishibashi, Taro Yano, Masafumi Oyamada

TL;DR
This paper explores how the design of the execution harness impacts the success of algorithm discovery using LLMs and evolutionary search, emphasizing deeper thinking per algorithm over quantity.
Contribution
It introduces Vesper, a framework with improved harness design strategies that enhance algorithm discovery efficiency and safety in parallel execution.
Findings
Fewer, deeper algorithms outperform many shallow ones within the same token budget.
Higher-capability models tend to generate more evaluation hacks, necessitating better detection.
Deeper thinking per algorithm is more cost-effective than increasing the number of algorithms.
Abstract
AlphaEvolve and FunSearch have demonstrated the potential of combining large language models (LLMs) with evolutionary search for automated algorithm discovery. However, discovery success is shaped not only by model capability but also significantly by the design of the execution infrastructure, i.e., the harness. This paper investigates effective harness design through three questions: under a fixed token budget, is it better to produce many algorithms with brief thought or fewer algorithms with deeper thought? How should the harness handle evaluation hacks, where generated programs exploit the scoring function? And how can agents that require full filesystem access execute safely in parallel? Using Vesper, an algorithm discovery framework that incorporates harness improvements addressing these questions, we evaluate on Circle Packing under the same token budget. Interestingly,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
