Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning

Yunhao Gou; Kai Chen; Zhili Liu; Lanqing Hong; Xin Jin; Zhenguo Li; James T. Kwok; Yu Zhang

arXiv:2506.04559·cs.CV·March 24, 2026

Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning

Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T. Kwok, Yu Zhang

PDF

Open Access 2 Models 3 Reviews

TL;DR

This paper introduces RAPID, a modular approach that decouples perception and reasoning in multi-modal models, enabling scalable, inference-time improvements by pairing perception modules with external reasoning LLMs, without retraining.

Contribution

It proposes a perception-reasoning decoupling framework and a novel reinforcement learning algorithm, VPO, to improve multi-modal reasoning without costly internal model updates.

Findings

01

Significant performance gains on multi-modal reasoning benchmarks.

02

Enables inference-time scaling by pairing perception modules with external reasoners.

03

Achieves faithful, query-relevant captioning through VPO.

Abstract

Recent breakthroughs in reasoning language models have significantly advanced text-based reasoning. On the other hand, Multi-modal Large Language Models (MLLMs) still lag behind, hindered by their outdated internal LLMs. Upgrading these LLMs is often prohibitively expensive, as it requires costly vision-language alignment retraining. To address this issue, we introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable. This approach redefines the MLLM's role to convert multi-modal inputs into detailed textual outputs that can be processed by any powerful, external, text-only LLM reasoners. To align the MLLM's perceptual output with the final reasoning task, we propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO). VPO rewards the MLLM based on the correctness of answers generated by…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

1. The paper is very well-written and logically structured, making the core ideas easy to understand. The figures are clear and effectively illustrate the main concepts. 2. The motivation is straightforward.

Weaknesses

1. This paper lacks novelty. The paper presents a relatively straightforward idea. It uses a multimodal large language model (MLLM) to describe an input image and then relies on a separate language model for textual reasoning. This setup mainly combines existing components rather than introducing a new methodological contribution. The multimodal stage performs description rather than genuine reasoning, which limits the conceptual depth of the approach. 2. The proposed pipeline does not offer cl

Reviewer 02Rating 8Confidence 4

Strengths

1. The paper proposes training a visual captioner using the accuracy of LLM responses as a reward signal. Although combining VLM-based captioning with LLM reasoning has become common, to the best of my knowledge, introducing RL for training a visual captioner is novel. Therefore, the work demonstrates strong originality. 2. The experiments are very solid, particularly as the conclusions are validated across models of multiple scales, with comprehensive ablation studies provided. 3. The paper i

Weaknesses

The main limitation of the paper is that it does not provide a sufficiently detailed justification for the necessity of adopting a VLM captioner + LLM reasoner framework. In fact, many existing open-source and closed-source MLLMs still rely on unified reasoning, and the unified model appears to be an important trend in the development of large models. The authors could elaborate on their perspective regarding the future direction of large models: whether they believe the field will move toward

Reviewer 03Rating 4Confidence 4

Strengths

1. The pipeline diagrams and prompt templates (referenced figures/appendices) make it easy to follow the two-stage flow and what exactly is optimized. 2. Thorough ablations: perception variants (none / cap / qcap / sol / cap+sol / qcap+sol), with and without VPO/GRPO, with/without penalties, and different LLMs for training vs. inference. 3. VPO is a neat twist on GRPO for caption supervision by outcome: the reward comes from a downstream verifier (the reasoner’s answer correctness), not from cap

Weaknesses

1. Comparisons to verification-augmented or tool-enabled LMMs (e.g., visual verification modules, external OCR/detection tools) are missing. This matters because a strong verifier might reduce drift without multi-turn agent interaction. 2. The method introduces two RL phases (GRPO then VPO) plus group rollouts and external LLM calls for rewards. While appendices mention batch sizes/steps, the wall-clock/cost, GPU hours, and reasoner-call counts per step are not surfaced prominently in the main t

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques