Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, Jeany Son

TL;DR
This paper proposes a novel distillation framework for vision-language models that enhances visual evidence utilization during reasoning by masking salient reasoning prefixes, leading to improved multimodal reasoning performance.
Contribution
It introduces a masking-based distillation method that encourages students to rely more on visual evidence, improving reasoning capabilities in VLMs.
Findings
Outperforms recent open-source VLMs on reasoning benchmarks.
Enhances visual evidence utilization during reasoning.
Improves student model performance through novel masking strategies.
Abstract
Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their computational cost becomes substantial, especially for larger VLMs. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace, as long think-answer traces suffer from visual forgetting issues. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
