Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation

Yiming Ren; Yujiu Yang; Junjie Wang

arXiv:2603.26330·cs.CV·March 30, 2026

Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation

Yiming Ren, Yujiu Yang, Junjie Wang

PDF

TL;DR

This paper introduces Input-Adaptive Depth Aggregation (IADA), a lightweight method that enhances reasoning in vision-language models by improving cross-depth access, significantly boosting reasoning scores with minimal additional parameters.

Contribution

The paper reveals that preserving cross-depth representations is crucial for reasoning in VLMs and proposes IADA, a novel input-adaptive mechanism that improves reasoning performance efficiently.

Findings

01

IADA improves reasoning scores by 9.5 points on Qwen3-VL-2B.

02

IADA enhances perception scores by 3.3 points.

03

IADA requires only 0.14M additional parameters, especially effective in low-rank settings.

Abstract

Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tuning. Building on this observation, we propose Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. On Qwen3-VL-2B, IADA improves the average reasoning score by 9.5 points and the average perception score by $3.3$ points over LoRA-only fine-tuning with only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.