Evaluating Multimodal Interactive Agents

Josh Abramson; Arun Ahuja; Federico Carnevale; Petko Georgiev; Alex; Goldin; Alden Hung; Jessica Landon; Timothy Lillicrap; Alistair Muldal; Blake; Richards; Adam Santoro; Tamara von Glehn; Greg Wayne; Nathaniel Wong; Chen; Yan

arXiv:2205.13274·cs.LG·July 15, 2022·1 cites

Evaluating Multimodal Interactive Agents

Josh Abramson, Arun Ahuja, Federico Carnevale, Petko Georgiev, Alex, Goldin, Alden Hung, Jessica Landon, Timothy Lillicrap, Alistair Muldal, Blake, Richards, Adam Santoro, Tamara von Glehn, Greg Wayne, Nathaniel Wong, Chen, Yan

PDF

Open Access

TL;DR

This paper introduces the Standardised Test Suite (STS), a novel, efficient evaluation method for multimodal interactive agents that uses real interaction scenarios and human annotations to better assess naturalistic human-agent interactions.

Contribution

The paper proposes the STS, a new evaluation framework that leverages real interaction data and offline analysis to improve the assessment of interactive agents.

Findings

01

STS is faster and more controlled than traditional methods.

02

STS correlates better with naturalistic interactions.

03

The approach is interpretable and scalable.

Abstract

Creating agents that can interact naturally with humans is a common goal in artificial intelligence (AI) research. However, evaluating these interactions is challenging: collecting online human-agent interactions is slow and expensive, yet faster proxy metrics often do not correlate well with interactive evaluation. In this paper, we assess the merits of these existing evaluation metrics and present a novel approach to evaluation called the Standardised Test Suite (STS). The STS uses behavioural scenarios mined from real human interaction data. Agents see replayed scenario context, receive an instruction, and are then given control to complete the interaction offline. These agent continuations are recorded and sent to human annotators to mark as success or failure, and agents are ranked according to the proportion of continuations in which they succeed. The resulting STS is fast,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · AI in Service Interactions