VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following

Hyesoo Hong; Minsoo Kim; Wonje Jeung; Sangyeon Yoon; Dongjae Jeon; Albert No

arXiv:2605.15672·cs.CV·May 18, 2026

VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following

Hyesoo Hong, Minsoo Kim, Wonje Jeung, Sangyeon Yoon, Dongjae Jeon, Albert No

PDF

TL;DR

This paper investigates the failure modes of vision-language models in visual path following tasks, revealing local competition as a key challenge and testing various remedies that offer limited improvements.

Contribution

It introduces controlled tracing tasks to diagnose VLM failures and analyzes the impact of local competition and distractors on path following performance.

Findings

01

VLMs frequently switch to nearby distractors during path following.

02

Standard remedies like scaling and reasoning only partially mitigate failures.

03

Path-switching issues persist in complex, real-world scenes.

Abstract

Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.