Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

Yichang Jian; Boyuan Xiao; Zhenyuan Huang; Yifei Peng; Yao-Xiang Ding

arXiv:2605.16848·cs.CV·May 19, 2026

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

Yichang Jian, Boyuan Xiao, Zhenyuan Huang, Yifei Peng, Yao-Xiang Ding

PDF

TL;DR

This paper introduces a novel pattern induction and inference approach to enhance visual planning in vision-language models by decomposing perception into simpler, reusable steps, improving efficiency and accuracy.

Contribution

It proposes Pattern Induction and Pattern Inference strategies for VLMs, enabling active recognition and inference of visual patterns to improve planning efficiency.

Findings

01

Pattern Induction discovers reusable visual patterns from experience.

02

Pattern Inference enables direct inference of local world models.

03

Approaches balance accuracy and efficiency in multiple domains.

Abstract

Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.