VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following
Hyesoo Hong, Minsoo Kim, Wonje Jeung, Sangyeon Yoon, Dongjae Jeon, Albert No

TL;DR
This paper investigates the failure modes of vision-language models in visual path following tasks, revealing local competition as a key challenge and testing various remedies that offer limited improvements.
Contribution
It introduces controlled tracing tasks to diagnose VLM failures and analyzes the impact of local competition and distractors on path following performance.
Findings
VLMs frequently switch to nearby distractors during path following.
Standard remedies like scaling and reasoning only partially mitigate failures.
Path-switching issues persist in complex, real-world scenes.
Abstract
Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
