Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning
Guangfu Guo, Xiaoqian Lu, Yue Feng, Mingming Sun

TL;DR
This paper introduces SSV-CoT, a method that models visual reasoning as a sequential process guided by saliency maps, improving multimodal LLMs' goal-driven visual understanding.
Contribution
It proposes a novel structured sequential visual reasoning approach that explicitly models visual importance and reasoning order without external annotations.
Findings
Achieves improved performance on visual reasoning benchmarks.
Validates the effectiveness of structured, sequential visual cognition.
End-to-end training without region annotations or external tools.
Abstract
Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
