ViSTa Dataset: Do vision-language models understand sequential tasks?

Ev\v{z}en Wybitul; Evan Ryan Gunter; Mikhail Seleznyov; David Lindner

arXiv:2411.13211·cs.CV·November 22, 2024

ViSTa Dataset: Do vision-language models understand sequential tasks?

Ev\v{z}en Wybitul, Evan Ryan Gunter, Mikhail Seleznyov, David Lindner

PDF

Open Access 1 Repo

TL;DR

This paper introduces ViSTa, a dataset designed to evaluate vision-language models' ability to understand and judge complex sequential tasks across various environments, revealing current models' limitations in this area.

Contribution

The paper presents ViSTa, a hierarchical dataset for assessing VLMs' understanding of sequential tasks, and evaluates leading models, highlighting their shortcomings.

Findings

01

VLMs excel at object recognition but struggle with sequential task understanding.

02

GPT-4o shows some capability in understanding sequential tasks.

03

Most models fail to grasp task complexity beyond basic recognition.

Abstract

Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a particular final outcome. We explore VLMs' potential to supervise tasks that cannot be scored by the final state alone. To this end, we introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments. Its novel hierarchical structure -- basic single-step tasks composed into more and more complex sequential tasks -- allows a fine-grained understanding of how well VLMs can judge tasks with varying complexity. To illustrate this, we use ViSTa to evaluate state-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eugleo/vista-dataset
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training