Training-Free Reasoning and Reflection in MLLMs
Hongchen Wei, Zhenzhong Chen

TL;DR
FRANK is a training-free method that enhances multimodal large language models with reasoning and reflection abilities by hierarchically merging visual and textual processing layers, achieving state-of-the-art results without additional training.
Contribution
The paper introduces a novel layer-wise fusion mechanism that imparts reasoning capabilities to off-the-shelf MLLMs without retraining or extra supervision.
Findings
FRANK-38B achieves 69.2% accuracy on MMMU benchmark.
Outperforms baseline InternVL2.5-38B by +5.3%.
Surpasses proprietary GPT-4o in multimodal reasoning tasks.
Abstract
Recent advances in Reasoning LLMs (e.g., DeepSeek-R1 and OpenAI-o1) have showcased impressive reasoning capabilities via reinforcement learning. However, extending these capabilities to Multimodal LLMs (MLLMs) is hampered by the prohibitive costs of retraining and the scarcity of high-quality, verifiable multimodal reasoning datasets. This paper introduces FRANK Model, a training-FRee ANd r1-liKe MLLM that imbues off-the-shelf MLLMs with reasoning and reflection abilities, without any gradient updates or extra supervision. Our key insight is to decouple perception and reasoning across MLLM decoder layers. Specifically, we observe that compared to the deeper decoder layers, the shallow decoder layers allocate more attention to visual tokens, while the deeper decoder layers concentrate on textual semantics. This observation motivates a hierarchical weight merging approach that combines a…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper replaces heuristic task-arithmetic with a per-layer merger derived from a second-order Taylor approximation under NTK-style linearization, yielding analytic coefficients (their Eq. (13), prior-weighted in Eq. (15)) that require no grid search or supervision. - The attention-guided prior is fit to the model’s observed decay of visual attention across depth and then used to bias fusion via a simple exponential schedule—tightening the link between empirical signal and architectural cho
- The closed-form per-layer coefficients rest on NTK-style linearization and an isotropic Hessian surrogate, but the paper provides no error bounds or diagnostics quantifying deviation from these ideals in finite-width, multimodal transformers; optimality can drift when curvature/anisotropy is non-negligible. - The derivation effectively requires near-orthogonality between the vision and reasoning deltas; when vectors correlate, performance materially degrades, and the method lacks a correlatio
The method requires no gradient updates, reinforcement learning, or extra labeled data. Fusion coefficients are given in closed form, avoiding grid search and validation-set tuning. Under NTK linearization and approximate orthogonality of task vectors, a second-order Taylor expansion yields closed-form fusion weights whose dependence only on layer-wise task-vector norms is both concise and interpretable. Empirical results demonstrate successful transfer of R1-like reasoning and self-reflection b
1. A central insight that shallow layers handle perception while deep layers handle reasoning (Fig. 2) has already been articulated in [1], which diminishes the contribution. 2. The paper lacks discussion and comparison with closely related work. Both the proposed method and [2] reformulate differences between the merged model and task-specific models via Taylor approximation (paired with NTK linearization and high-dimensional approximate orthogonality) into a data-free computable objective. In
- The derivation of closed-form fusion weights using Taylor expansion and NTK linearization is rigorous and well-motivated. - Comprehensive experiments across multiple benchmarks and model scales (8B to 38B) show consistent and significant improvements over strong baselines. - The paper includes thorough ablations to validate the contribution of each component (e.g., modality priors, layer-wise fusion).
- The method relies heavily on the orthogonality of task vectors and the NTK linearization assumption, which may not hold universally, especially for smaller models or non-standard architectures. The authors are encouraged to provide results on more mainstream architectures such as LLaVA, Qwen, etc. Have the authors considered the potential negative impact of task vector interference when the orthogonality assumption is violated? Are there fallback mechanisms? - The approach is only validated o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Topic Modeling
MethodsSoftmax · Attention Is All You Need
