Evaluating Multimodal Interactive Agents
Josh Abramson, Arun Ahuja, Federico Carnevale, Petko Georgiev, Alex, Goldin, Alden Hung, Jessica Landon, Timothy Lillicrap, Alistair Muldal, Blake, Richards, Adam Santoro, Tamara von Glehn, Greg Wayne, Nathaniel Wong, Chen, Yan

TL;DR
This paper introduces the Standardised Test Suite (STS), a novel, efficient evaluation method for multimodal interactive agents that uses real interaction scenarios and human annotations to better assess naturalistic human-agent interactions.
Contribution
The paper proposes the STS, a new evaluation framework that leverages real interaction data and offline analysis to improve the assessment of interactive agents.
Findings
STS is faster and more controlled than traditional methods.
STS correlates better with naturalistic interactions.
The approach is interpretable and scalable.
Abstract
Creating agents that can interact naturally with humans is a common goal in artificial intelligence (AI) research. However, evaluating these interactions is challenging: collecting online human-agent interactions is slow and expensive, yet faster proxy metrics often do not correlate well with interactive evaluation. In this paper, we assess the merits of these existing evaluation metrics and present a novel approach to evaluation called the Standardised Test Suite (STS). The STS uses behavioural scenarios mined from real human interaction data. Agents see replayed scenario context, receive an instruction, and are then given control to complete the interaction offline. These agent continuations are recorded and sent to human annotators to mark as success or failure, and agents are ranked according to the proportion of continuations in which they succeed. The resulting STS is fast,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · AI in Service Interactions
