General Agent Evaluation

Elron Bandel; Asaf Yehudai; Lilach Eden; Yehoshua Sagron; Yotam Perlitz; Elad Venezian; Natalia Razinkov; Natan Ergas; Shlomit Shachor Ifergan; Segev Shlomov; Michal Jacovi; Leshem Choshen; Liat Ein-Dor; Yoav Katz; Michal Shmueli-Scheuer

arXiv:2602.22953·cs.AI·May 12, 2026·2 cites

General Agent Evaluation

Elron Bandel, Asaf Yehudai, Lilach Eden, Yehoshua Sagron, Yotam Perlitz, Elad Venezian, Natalia Razinkov, Natan Ergas, Shlomit Shachor Ifergan, Segev Shlomov, Michal Jacovi, Leshem Choshen, Liat Ein-Dor, Yoav Katz, Michal Shmueli-Scheuer

PDF

2 Repos 10 Models 3 Datasets

TL;DR

This study systematically evaluates how different agent architectures and backbone models influence performance across diverse, unfamiliar environments using a unified benchmarking framework.

Contribution

It introduces a unifying protocol, an evaluation harness, and the first Open General Agent Leaderboard to compare various agent configurations and models comprehensively.

Findings

01

Performance varies significantly with agent architecture within the same model.

02

Backbone model choice has a greater impact than architecture on overall performance.

03

Open-weight models show 'generality sinks' and are less robust than closed-source models.

Abstract

General-purpose agents perform tasks in unfamiliar environments without domain-specific manual customization. Yet no study has systematically measured how agent architecture shapes performance across heterogeneous protocols and diverse unfamiliar environments. This is the first systematic study, comparing tool-calling, MCP, code-generation, and CLI agents on the same benchmarks with the same models. Two gaps blocked such a study: existing harnesses require per-benchmark wiring or fixed protocol classes (web for BrowserGym, CLI for Harbor), and benchmarks themselves expect human-authored prompts, context, and integration glue. To enable this study, we contribute (1) a unifying protocol that bridges existing benchmark and agent protocols; (2) an evaluation harness that surfaces any benchmark to any general-purpose agent and backbone model; and (3) the first Open General Agent Leaderboard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.