Autonomous Imagination: Closed-Loop Decomposition of Visual-to-Textual Conversion in Visual Reasoning for Multimodal Large Language Models

Jingming Liu; Yumeng Li; Boyuan Xiao; Yichang Jian; Ziang Qin; Tianjia Shao; Yao-Xiang Ding; Kun Zhou

arXiv:2411.18142·cs.CV·October 7, 2025

Autonomous Imagination: Closed-Loop Decomposition of Visual-to-Textual Conversion in Visual Reasoning for Multimodal Large Language Models

Jingming Liu, Yumeng Li, Boyuan Xiao, Yichang Jian, Ziang Qin, Tianjia Shao, Yao-Xiang Ding, Kun Zhou

PDF

Open Access 2 Datasets

TL;DR

This paper introduces autonomous imagination, a method enabling multimodal large language models to iteratively modify visual inputs, decomposing complex visual reasoning tasks into manageable sub-steps without retraining.

Contribution

It proposes a novel closed-loop visual modification approach that enhances MLLMs' reasoning capabilities by decomposing visual-to-textual conversion into iterative steps.

Findings

01

MLLMs can solve previously unsolvable visual tasks with visual modification.

02

Closed-loop visual modification improves reasoning without retraining.

03

The approach is effective across various visual reasoning tasks.

Abstract

Under pure textual modality, Large Language Models (LLMs) have demonstrated remarkable success in complex reasoning tasks by decomposing them into simpler sub-problems. However, Multimodal Large Language Models (MLLMs) still struggle with some seemingly straightforward visual tasks, such as counting and solving jigsaw puzzles. We argue that these tasks challenge the ability of visual-to-textual conversion, where MLLMs convert visual information perceived from the input scene, to textual information for further reasoning and generating the answer. If the complexity of the visual input is beyond the perceptual capability of the MLLMs, without decomposing this conversion process, simply scaling inference-time reasoning cannot solve the task because it repeatedly encounters the same perceptual bottleneck. We propose an approach, autonomous imagination, to enable MLLMs to iteratively modify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsJigsaw