Autonomous Imagination: Closed-Loop Decomposition of Visual-to-Textual Conversion in Visual Reasoning for Multimodal Large Language Models
Jingming Liu, Yumeng Li, Boyuan Xiao, Yichang Jian, Ziang Qin, Tianjia Shao, Yao-Xiang Ding, Kun Zhou

TL;DR
This paper introduces autonomous imagination, a method enabling multimodal large language models to iteratively modify visual inputs, decomposing complex visual reasoning tasks into manageable sub-steps without retraining.
Contribution
It proposes a novel closed-loop visual modification approach that enhances MLLMs' reasoning capabilities by decomposing visual-to-textual conversion into iterative steps.
Findings
MLLMs can solve previously unsolvable visual tasks with visual modification.
Closed-loop visual modification improves reasoning without retraining.
The approach is effective across various visual reasoning tasks.
Abstract
Under pure textual modality, Large Language Models (LLMs) have demonstrated remarkable success in complex reasoning tasks by decomposing them into simpler sub-problems. However, Multimodal Large Language Models (MLLMs) still struggle with some seemingly straightforward visual tasks, such as counting and solving jigsaw puzzles. We argue that these tasks challenge the ability of visual-to-textual conversion, where MLLMs convert visual information perceived from the input scene, to textual information for further reasoning and generating the answer. If the complexity of the visual input is beyond the perceptual capability of the MLLMs, without decomposing this conversion process, simply scaling inference-time reasoning cannot solve the task because it repeatedly encounters the same perceptual bottleneck. We propose an approach, autonomous imagination, to enable MLLMs to iteratively modify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsJigsaw
