Empowering Small VLMs to Think with Dynamic Memorization and Exploration
Jiazhen Liu, Yuchuan Deng, and Long Chen

TL;DR
This paper introduces DyME, a novel training paradigm that dynamically balances memorization and exploration to enhance small vision-language models with thinking capabilities, improving their performance on specialized tasks.
Contribution
The paper proposes DyME, a new method that adaptively combines supervised fine-tuning and reinforcement learning with verifiable reward for stable, effective training of small VLMs.
Findings
DyME stabilizes training and improves performance across diverse tasks.
Dynamic selection between SFT and RLVR enhances model thinking abilities.
Experimental results show significant gains over existing methods.
Abstract
Small-scale Vision-Language Models (SVLMs) are exceptionally well-suited for proprietary tasks. Equipping them with thinking capabilities is a critical step to enhance their performance and reliability in these specific domains. However, existing training paradigms, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR), impose substantial demands on the base VLM, exceeding the capacity of SVLMs. Consequently, directly applying these paradigms to SVLMs fails to instill the desired thinking abilities. A natural solution is to combine SFT and RLVR, leveraging their complementarity to reduce the dependence on model capacity. Yet the core challenge lies in managing the inherent trade-off: excessive reliance on SFT can force the model to memorize pseudo thinking traces, while over-emphasizing RLVR can lead to unstable exploration (i.e., advantage…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed new training procedures seem to be simple yet effective. DyME can be applied to SVLM and can achieve significant performance gains across different domains. The method can also reduce the advantage collapse and constrained exploration. 2. The experiment is comprehensive. The paper compares the proposed training strategy with two-stage, GRPO, and SFT on three different models with 0.5-1B parameters. The ablation study shows the importance of the proposed training strategy and visu
1. Some baselines are pretty old. The paper needs to include some newer LVLMs such as QWen-2.5VL, etc. The experiment section is also purely a quantitative evaluation. Some qualitative evaluation or human evaluation can help readers to understand the quality of the chain better. For example, the length of COT after using DyME compared to the two-stage. Adding additional experiments, such as pure textual or pure vision tasks, can help readers understand the performance gain better. The current ev
1. The paper tackles a practical and important problem of enabling complex reasoning on small, efficient models. 2. The paper is generally well-written, and the problem is clearly motivated. Figure 1, in particular, provides a good illustration of why existing paradigms might fail on SVLMs and how DyME tries to solve this. 3. Strong empirical results. The authors show in Tab. 1 that standard SFT, RLVR, and a two-stage approach degrade the performance of SVLMs on these tasks, while DyME consist
1. The "vision supervision" modules (Visual Checker and Refiner) involves additional dependencies. The modules are critical to the method's performance (as shown in Tab. 2), but they are implemented via prompting an external Qwen2.5-14B. This involves additional knowledges and causes unfair comparisons. 2. Question regarding novelty of this work. Compared with existing hybrid SFT+RL methods, the authors claim the main novelty is the dynamic switching criterion. However, the criterion used (fall
1. Timely and Meaningful Problem: Focuses on reasoning for small, efficient VLMs—highly relevant for real-world deployment on edge devices. 2. Elegant and Effective Approach: The dynamic “memorize–explore” mechanism intuitively balances stability and exploration, well-suited to SVLM limitations. 3. Strong Experimental Evidence: - Baselines clearly show SFT/RLVR failures, motivating DyME. - Consistent, significant improvements across all domains. - Ablation studies confirm each component’s
1. Reliance on External LLM: The visual checker and refiner rely on a large external model (Qwen2.5-14B), introducing extra complexity, cost, and dependency. This makes performance partly contingent on the external LLM’s capability, slightly undermining the goal of a self-contained small-model framework. 2. Rigid Switching Heuristic: The binary rule (“if one correct → RLVR, else → SFT”) may cause abrupt shifts; a softer, reward-based switch could yield smoother training. 3. Limited Task Generali
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
