Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction
Yichang Jian, Boyuan Xiao, Zhenyuan Huang, Yifei Peng, Yao-Xiang Ding

TL;DR
This paper introduces a novel pattern induction and inference approach to enhance visual planning in vision-language models by decomposing perception into simpler, reusable steps, improving efficiency and accuracy.
Contribution
It proposes Pattern Induction and Pattern Inference strategies for VLMs, enabling active recognition and inference of visual patterns to improve planning efficiency.
Findings
Pattern Induction discovers reusable visual patterns from experience.
Pattern Inference enables direct inference of local world models.
Approaches balance accuracy and efficiency in multiple domains.
Abstract
Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
