Using Game Play to Investigate Multimodal and Conversational Grounding   in Large Multimodal Models

Sherzod Hakimov; Yerkezhan Abdullayeva; Kushal Koshti and; Antonia Schmidt; Yan Weiser; Anne Beyer; David Schlangen

arXiv:2406.14035·cs.CL·December 12, 2024

Using Game Play to Investigate Multimodal and Conversational Grounding in Large Multimodal Models

Sherzod Hakimov, Yerkezhan Abdullayeva, Kushal Koshti and, Antonia Schmidt, Yan Weiser, Anne Beyer, David Schlangen

PDF

Open Access

TL;DR

This paper introduces a goal-oriented game evaluation paradigm for multimodal models, assessing their visual understanding and conversational grounding, revealing performance gaps between large closed and open models.

Contribution

It adapts a text model evaluation method to multimodal models, providing a new benchmark for assessing visual and conversational grounding capabilities.

Findings

01

Largest models perform well on the games

02

Open-weight models struggle with the tasks

03

Deep captioning abilities influence performance

Abstract

While the situation has improved for text-only models, it again seems to be the case currently that multimodal (text and image) models develop faster than ways to evaluate them. In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models, namely evaluation through the goal-oriented game (self) play, complementing reference-based and preference-based evaluation. Specifically, we define games that challenge a model's capability to represent a situation from visual information and align such representations through dialogue. We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them. On further analysis, we find that the exceptional deep captioning capabilities of the largest models drive some of the performance. There is still room to grow for both kinds of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsALIGN