Can Vision-Language Models Solve the Shell Game?

Tiedong Liu; Wee Sun Lee

arXiv:2603.08436·cs.CV·March 10, 2026

Can Vision-Language Models Solve the Shell Game?

Tiedong Liu, Wee Sun Lee

PDF

Open Access 1 Models 2 Datasets

TL;DR

This paper introduces VET-Bench, a synthetic testbed revealing current vision-language models' inability to track identical objects over time, and proposes SGCoT to significantly improve their tracking performance.

Contribution

The paper presents VET-Bench for diagnosing tracking limitations in VLMs, provides a theoretical analysis of model constraints, and introduces SGCoT to enhance tracking accuracy to over 90%.

Findings

01

Current VLMs perform at chance level on VET-Bench.

02

Fixed-depth transformers are fundamentally limited in tracking indistinguishable objects.

03

SGCoT achieves over 90% accuracy on VET-Bench.

Abstract

Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
tiedong/Molmo2-SGCoT
model

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition