ViSTa Dataset: Do vision-language models understand sequential tasks?
Ev\v{z}en Wybitul, Evan Ryan Gunter, Mikhail Seleznyov, David Lindner

TL;DR
This paper introduces ViSTa, a dataset designed to evaluate vision-language models' ability to understand and judge complex sequential tasks across various environments, revealing current models' limitations in this area.
Contribution
The paper presents ViSTa, a hierarchical dataset for assessing VLMs' understanding of sequential tasks, and evaluates leading models, highlighting their shortcomings.
Findings
VLMs excel at object recognition but struggle with sequential task understanding.
GPT-4o shows some capability in understanding sequential tasks.
Most models fail to grasp task complexity beyond basic recognition.
Abstract
Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a particular final outcome. We explore VLMs' potential to supervise tasks that cannot be scored by the final state alone. To this end, we introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments. Its novel hierarchical structure -- basic single-step tasks composed into more and more complex sequential tasks -- allows a fine-grained understanding of how well VLMs can judge tasks with varying complexity. To illustrate this, we use ViSTa to evaluate state-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
