
TL;DR
This paper introduces Soft Tournament Equilibrium (STE), a differentiable framework for evaluating AI agents using set-valued solutions from pairwise comparisons, addressing non-transitive interactions.
Contribution
STE is a novel, differentiable method that computes set-valued tournament solutions directly from data, with theoretical guarantees and practical evaluation.
Findings
STE accurately recovers classical solutions in the zero-temperature limit.
It demonstrates stability and consistency in cyclic and preference-based benchmarks.
The method provides calibrated membership scores for core agents.
Abstract
The evaluation of general-purpose artificial agents, particularly those based on LLMs, presents a significant challenge due to the non-transitive nature of their interactions. When agent A defeats B, B defeats C, and C defeats A, traditional ranking methods that force a linear ordering can be misleading and unstable. We argue that for such cyclic domains, the fundamental object of evaluation should not be a ranking alone but a set-valued core, as conceptualized in classical tournament theory. This paper introduces Soft Tournament Equilibrium (STE), a differentiable framework for learning and computing set-valued tournament solutions directly from pairwise comparison data. STE first learns a probabilistic tournament model, potentially conditioned on rich contextual information. It then employs differentiable operators for soft reachability and soft covering to compute continuous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
