TL;DR
VOLD introduces a novel framework that effectively transfers reasoning skills from text-only models to vision-language models using on-policy distillation and reinforcement learning, significantly enhancing reasoning performance.
Contribution
The paper presents VOLD, a new method combining on-policy distillation with reinforcement learning for transferring reasoning capabilities from text-only teachers to vision-language models.
Findings
VOLD outperforms baseline models on multiple reasoning benchmarks.
Cold-start alignment is crucial for effective reasoning transfer.
VOLD achieves state-of-the-art results across diverse datasets.
Abstract
Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Clear, principled framework: Unifies RL (GRPO) with on-policy distillation on shared rollouts, offering dense token-level guidance at minimal extra cost. Strong empirical results: Consistent improvements across diverse multimodal benchmarks, including MathVision and LogicVista, despite training exclusively on text. Careful ablations: Cold-start alignment analysis convincingly shows that distributional alignment is a prerequisite for effective on-policy distillation. Component analysis isolates
Limitations of visual training and performance ceiling: The proposed method primarily targets the transfer and enhancement of reasoning patterns, with training signals derived entirely from the text domain (the RL stage also uses purely text-based, verifiable tasks). It does not explicitly strengthen visual representations or cross-modal alignment. Consequently, for perception-intensive tasks that rely on fine-grained visual perception and spatial relationship modeling, the performance ceiling a
1. the paper correctly identifies policy alignment as a premise for on-policy distillation
1. The paper studies whether complex multimodal reasoning can be significantly improved by training only on textual reasoning data. The authors motivate this by the scarcity of multimodal data, but they fail to address the necessity of it. - From both an intuitive and logical perspective, multimodal reasoning is not just textual reasoning with an image attached; it is an integrated task where visual perception and abstract reasoning are inextricably linked. - In human cognition, perception inf
1. The paper is well written. 2. The motivation is clear: on-policy distillation enables effective transfer of reasoning capabilities from text-only domain to vision language task. The SFT "cold-start" aligns the distribution between teacher and student. 3. The proposed method is simple yet remains effective according to the empirical results 4. Comprehensive ablations are presented to analyze the proposed method
1. **Limited novelty**. Distillation from large language models has been widely studied on various tasks (including Qwen3) [1,2,3]. VOLD presents a simple combination of normal RL training and on-policy distillation. Besides, recent work [4] also studies the transfer between text to visual reasoning with a two-stage recipe (SFT for stage one and RL for stage two). These works raise my concern about the novelty and contribution of the method. 2. **Lack of analysis** about the behavior of finetune
1. Novel and Effective Framework: The core strength is the VOLD framework itself—a well-motivated and coherent two-stage process that effectively combines SFT, RL, and knowledge distillation for cross-modal reasoning transfer. 2. Thorough and Convincing Ablation Studies: The paper's quality is significantly elevated by its rigorous ablation studies. The experiments systematically validate the necessity of the policy alignment stage (Table 2), the contribution of each component (Table 3), the im
1. Generalizability to Heterogeneous Models: The method's reliance on a shared tokenizer between the teacher and student is a key requirement for the KL divergence calculation. This is explicitly mentioned but also means the study is confined to the Qwen model family. This raises questions about the framework's generalizability to scenarios where one might want to use a teacher and student from different model families (e.g., a GPT-4 teacher and a Llama-based VLM student). A discussion of potent
1. The method is well-formulated, with clear mathematical definitions and a coherent training pipeline. Ablation studies and component analyses validate the necessity of each design choice. 2. The paper is clearly structured and easy to follow, explaining both theoretical motivations and empirical findings with sufficient detail
1. The related work section omits several near-contemporary methods in teacher-guided or reasoning-enhanced multimodal training, which weakens the positioning of its originality claim. Specifically, it does not sufficiently distinguish itself from recent concurrent works that explore similar ideas of reasoning transfer or guided RL training, such as: [1] Yan, J., Li, Y., Hu, Z., Wang, Z., Cui, G., Qu, X., Cheng, Y., & Zhang, Y. (2025, April 21). Learning to Reason under Off-Policy Guidance. arX
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
