TraversalBench: Challenging Paths to Follow for Vision Language Models
Clara Petrova, Zhuo Chen, Marin Solja\v{c}i\'c

TL;DR
TraversalBench is a new benchmark designed to evaluate vision-language models' ability to follow complex visual paths, highlighting the impact of self-intersections and confounding lines on performance.
Contribution
It introduces a controlled, diagnostic benchmark for assessing path-following visual reasoning in multimodal models, emphasizing structural factors and error localization.
Findings
Self-intersections are the main source of difficulty for models.
Performance drops sharply after the first crossing in the path.
Layouts favoring left-to-right reading order are more common but do not fully explain performance.
Abstract
Vision-language models (VLMs) perform strongly on many multimodal benchmarks. However, the ability to follow complex visual paths -- a task that human observers typically find straightforward -- remains under-tested. We introduce TraversalBench, a controlled benchmark for exact visual path traversal. Each instance contains a single continuous polyline, a unique start marker, and markers placed at path vertices; the task is to recover the exact ordered sequence encountered when traversing the path from start to finish. The benchmark explicitly balances key path-structural factors including self-intersection count, tortuosity, vertex count, and nearby confounding lines, while minimizing reliance on OCR, world knowledge, and open-ended planning. We find that self-intersections are the dominant source of difficulty. A first-crossing analysis shows that errors are sharply localized:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
