Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation
Yiming Ren, Yujiu Yang, Junjie Wang

TL;DR
This paper introduces Input-Adaptive Depth Aggregation (IADA), a lightweight method that enhances reasoning in vision-language models by improving cross-depth access, significantly boosting reasoning scores with minimal additional parameters.
Contribution
The paper reveals that preserving cross-depth representations is crucial for reasoning in VLMs and proposes IADA, a novel input-adaptive mechanism that improves reasoning performance efficiently.
Findings
IADA improves reasoning scores by 9.5 points on Qwen3-VL-2B.
IADA enhances perception scores by 3.3 points.
IADA requires only 0.14M additional parameters, especially effective in low-rank settings.
Abstract
Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tuning. Building on this observation, we propose Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. On Qwen3-VL-2B, IADA improves the average reasoning score by 9.5 points and the average perception score by points over LoRA-only fine-tuning with only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
