Inferring Capabilities from Task Performance with Bayesian Triangulation
John Burden, Konstantinos Voudouris, Ryan Burnell, Danaja Rutar, Lucy Cheke, Jos\'e Hern\'andez-Orallo

TL;DR
This paper introduces a Bayesian method to infer the cognitive capabilities of machine learning systems from diverse task performance data, enabling richer evaluation beyond traditional metrics.
Contribution
It presents a novel Bayesian triangulation approach using measurement layouts to infer system capabilities from complex, non-populational data.
Findings
Successfully inferred cognitive profiles for real and synthetic agents.
Demonstrated capability-oriented evaluation in diverse scenarios.
Showcased the method's potential for richer system characterization.
Abstract
As machine learning models become more general, we need to characterise them in richer, more meaningful ways. We describe a method to infer the cognitive profile of a system from diverse experimental data. To do so, we introduce measurement layouts that model how task-instance features interact with system capabilities to affect performance. These features must be triangulated in complex ways to be able to infer capabilities from non-populational data -- a challenge for traditional psychometric and inferential tools. Using the Bayesian probabilistic programming library PyMC, we infer different cognitive profiles for agents in two scenarios: 68 actual contestants in the AnimalAI Olympics and 30 synthetic agents for O-PIAAGETS, an object permanence battery. We showcase the potential for capability-oriented evaluation.
Peer Reviews
Decision·Submitted to ICLR 2026
1. Applying psychometric ideas to AI evaluation is creative and timely. 2. Three increasingly complex scenarios provide scope for this work.
1. This is your key insight, and it's a real problem. The paper invokes "triangulation" as though it's a general principle, but the method fundamentally requires that task dimensions are sufficiently independent and align with capability structure. The core issue: In Experiment 1, you have 2 task demands (goal size, goal distance) and 2 capabilities (visual acuity, navigation). That's nearly square which is barely enough degrees of freedom. In Experiment 2, you have 12 task dimensions but 7 capa
- The paper is mostly well-written and easy to follow. - Measurement layouts provide an interesting means to evaluate ML models on static benchmarks (and the experiments demonstrate that).
- The major limitation of this work is that the measurement layout has to be manually designed by the humans. - The population of agents used to fit the measurement layouts seems to be created in an ad-hoc manner.
Originality -- This is, to my knowledge, the most thorough application of HBNs to 3D RL agents. Quality -- Leans on strong and established software (PyMC, AAI) inspiring confidence in code correctness. The three experiments span a range of complexity, validating the work in simple situations before taking the work to address complex real-world data. The inclusion of human performance is an added bonus. The paper applies HBNs to a battery of environments, with many tasks, agents, and roll-outs.
One of the stated advantages of this approach is the possibility to introduce domain knowledge. However, the "bitter lesson" suggests that this approach could be outperformed by simpler approaches combined with scaled compute/data. Although this Bayesian approach is arguably more principled, taking expert domain knowledge into account, in my opinion it needs a comparison to a baseline of a simple exploratory / confirmatory GLM / PCA such as those mentioned in the Related Work. Minor points: in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Bayesian Modeling and Causal Inference · Decision-Making and Behavioral Economics
MethodsLib
