Training-Free Reasoning and Reflection in MLLMs

Hongchen Wei; Zhenzhong Chen

arXiv:2505.16151·cs.CV·May 23, 2025

Training-Free Reasoning and Reflection in MLLMs

Hongchen Wei, Zhenzhong Chen

PDF

Open Access 3 Reviews

TL;DR

FRANK is a training-free method that enhances multimodal large language models with reasoning and reflection abilities by hierarchically merging visual and textual processing layers, achieving state-of-the-art results without additional training.

Contribution

The paper introduces a novel layer-wise fusion mechanism that imparts reasoning capabilities to off-the-shelf MLLMs without retraining or extra supervision.

Findings

01

FRANK-38B achieves 69.2% accuracy on MMMU benchmark.

02

Outperforms baseline InternVL2.5-38B by +5.3%.

03

Surpasses proprietary GPT-4o in multimodal reasoning tasks.

Abstract

Recent advances in Reasoning LLMs (e.g., DeepSeek-R1 and OpenAI-o1) have showcased impressive reasoning capabilities via reinforcement learning. However, extending these capabilities to Multimodal LLMs (MLLMs) is hampered by the prohibitive costs of retraining and the scarcity of high-quality, verifiable multimodal reasoning datasets. This paper introduces FRANK Model, a training-FRee ANd r1-liKe MLLM that imbues off-the-shelf MLLMs with reasoning and reflection abilities, without any gradient updates or extra supervision. Our key insight is to decouple perception and reasoning across MLLM decoder layers. Specifically, we observe that compared to the deeper decoder layers, the shallow decoder layers allocate more attention to visual tokens, while the deeper decoder layers concentrate on textual semantics. This observation motivates a hierarchical weight merging approach that combines a…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- The paper replaces heuristic task-arithmetic with a per-layer merger derived from a second-order Taylor approximation under NTK-style linearization, yielding analytic coefficients (their Eq. (13), prior-weighted in Eq. (15)) that require no grid search or supervision. - The attention-guided prior is fit to the model’s observed decay of visual attention across depth and then used to bias fusion via a simple exponential schedule—tightening the link between empirical signal and architectural cho

Weaknesses

- The closed-form per-layer coefficients rest on NTK-style linearization and an isotropic Hessian surrogate, but the paper provides no error bounds or diagnostics quantifying deviation from these ideals in finite-width, multimodal transformers; optimality can drift when curvature/anisotropy is non-negligible. - The derivation effectively requires near-orthogonality between the vision and reasoning deltas; when vectors correlate, performance materially degrades, and the method lacks a correlatio

Reviewer 02Rating 4Confidence 3

Strengths

The method requires no gradient updates, reinforcement learning, or extra labeled data. Fusion coefficients are given in closed form, avoiding grid search and validation-set tuning. Under NTK linearization and approximate orthogonality of task vectors, a second-order Taylor expansion yields closed-form fusion weights whose dependence only on layer-wise task-vector norms is both concise and interpretable. Empirical results demonstrate successful transfer of R1-like reasoning and self-reflection b

Weaknesses

1. A central insight that shallow layers handle perception while deep layers handle reasoning (Fig. 2) has already been articulated in [1], which diminishes the contribution. 2. The paper lacks discussion and comparison with closely related work. Both the proposed method and [2] reformulate differences between the merged model and task-specific models via Taylor approximation (paired with NTK linearization and high-dimensional approximate orthogonality) into a data-free computable objective. In

Reviewer 03Rating 4Confidence 3

Strengths

- The derivation of closed-form fusion weights using Taylor expansion and NTK linearization is rigorous and well-motivated. - Comprehensive experiments across multiple benchmarks and model scales (8B to 38B) show consistent and significant improvements over strong baselines. - The paper includes thorough ablations to validate the contribution of each component (e.g., modality priors, layer-wise fusion).

Weaknesses

- The method relies heavily on the orthogonality of task vectors and the NTK linearization assumption, which may not hold universally, especially for smaller models or non-standard architectures. The authors are encouraged to provide results on more mainstream architectures such as LLaVA, Qwen, etc. Have the authors considered the potential negative impact of task vector interference when the orthogonality assumption is violated? Are there fallback mechanisms? - The approach is only validated o

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Topic Modeling

MethodsSoftmax · Attention Is All You Need