Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms

Marc Lanctot; Kate Larson; Ian Gemp; Michael Kaisers

arXiv:2601.07651·cs.AI·February 12, 2026

Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms

Marc Lanctot, Kate Larson, Ian Gemp, Michael Kaisers

PDF

Open Access

TL;DR

This paper introduces a formal framework for active, online evaluation of general agents across multiple tasks, comparing baseline algorithms like Elo and Soft Condorcet, and demonstrating their effectiveness in different contexts.

Contribution

It proposes a novel online evaluation framework for agents that adaptively samples tasks and agents, and compares baseline ranking algorithms within this setting.

Findings

01

Elo rating system is reliable for ranking reduction in practice.

02

Soft Condorcet Optimization outperforms Elo on real Atari data.

03

Task selection based on proportional representation improves ranking accuracy.

Abstract

As intelligent agents become more generally-capable, i.e. able to master a wide variety of tasks, the complexity and cost of properly evaluating them rises significantly. Tasks that assess specific capabilities of the agents can be correlated and stochastic, requiring many samples for accurate comparisons, leading to added costs. In this paper, we propose a formal definition and a conceptual framework for active evaluation of agents across multiple tasks, which assesses the performance of ranking algorithms as a function of number of evaluation data samples. Rather than curating, filtering, or compressing existing data sets as a preprocessing step, we propose an online framing: on every iteration, the ranking algorithm chooses the task and agents to sample scores from. Then, evaluation algorithms report a ranking of agents on each iteration and their performance is assessed with respect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Reinforcement Learning in Robotics · Advanced Bandit Algorithms Research