Can Vision-Language Models Solve the Shell Game?
Tiedong Liu, Wee Sun Lee

TL;DR
This paper introduces VET-Bench, a synthetic testbed revealing current vision-language models' inability to track identical objects over time, and proposes SGCoT to significantly improve their tracking performance.
Contribution
The paper presents VET-Bench for diagnosing tracking limitations in VLMs, provides a theoretical analysis of model constraints, and introduces SGCoT to enhance tracking accuracy to over 90%.
Findings
Current VLMs perform at chance level on VET-Bench.
Fixed-depth transformers are fundamentally limited in tracking indistinguishable objects.
SGCoT achieves over 90% accuracy on VET-Bench.
Abstract
Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
