MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models
Vanya Cohen, Raymond Mooney

TL;DR
MET-Bench introduces a comprehensive benchmark for evaluating how well vision-language models can track entity states across modalities, revealing significant gaps and emphasizing the need for better multimodal reasoning.
Contribution
This work presents MET-Bench, a novel multimodal entity tracking benchmark, and demonstrates that current models struggle with visual reasoning, proposing a reinforcement learning approach to improve performance.
Findings
Significant performance gap between text-based and image-based entity tracking.
Visual reasoning deficits are the main cause of tracking failures.
Reinforcement learning improves multimodal tracking performance.
Abstract
Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate vision-language models' ability to track entity states across modalities. Using two structured domains, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based entity tracking. We empirically show this discrepancy primarily stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet limitations remain in long-horizon multimodal tasks. We develop a reinforcement learning method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Semantic Web and Ontologies
