Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Seonghoon Yu; Dongjun Nam; Byung-Kwan Lee; Jeany Son

arXiv:2605.11651·cs.CV·May 18, 2026

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, Jeany Son

PDF

TL;DR

This paper proposes a novel distillation framework for vision-language models that enhances visual evidence utilization during reasoning by masking salient reasoning prefixes, leading to improved multimodal reasoning performance.

Contribution

It introduces a masking-based distillation method that encourages students to rely more on visual evidence, improving reasoning capabilities in VLMs.

Findings

01

Outperforms recent open-source VLMs on reasoning benchmarks.

02

Enhances visual evidence utilization during reasoning.

03

Improves student model performance through novel masking strategies.

Abstract

Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their computational cost becomes substantial, especially for larger VLMs. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace, as long think-answer traces suffer from visual forgetting issues. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.