TL;DR
This paper investigates how individual layers in vision-language models can hinder or help downstream tasks, introducing a method to bypass interfering layers at inference time to improve performance.
Contribution
It systematically analyzes layer influence in VLMs, introduces the Task-Layer Interaction Vector, and proposes TaLo, a training-free method to bypass interfering layers for better task performance.
Findings
Intervening on specific layers can improve task performance.
Task-interfering layers exhibit task-specific sensitivity patterns.
TaLo enhances model accuracy without parameter updates.
Abstract
Current VLMs have demonstrated capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. We systematically investigate how individual layers influence different tasks via layer intervention. Specifically, we measure the change in performance relative to the base model after intervening on each layer and observe improvements when bypassing specific layers. This improvement can be generalizable across models and datasets, indicating the presence of Task-Interfering Layers that harm downstream tasks' performance. We introduce Task-Layer Interaction Vector, which quantifies the effect of…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1) The proposed TaLo framework is training-free and lightweight, aligning with the current direction of test-time adaptation for Vision-Language Models. It can make a certain and meaningful contribution to this area and provide useful inspiration and reference for future studies. 2) The experimental design is rigorous and comprehensive: Across five multimodal benchmarks (MMStar, MMBench, MMMU, ScienceQA, SEEDBench), TaLo achieves consistent gains—up to +10.4% on LLaVA and +16.6% on Qwen-VL—demo
1) The paper explores the internal structure of vision-language models at a coarse, layer-wise granularity, investigating how entire layers influence downstream task performance. However, it does not articulate the advantages of this approach over prior studies that conduct finer-grained, parameter-level analyses within each layer (e.g., Regularized Mask Tuning [1]). In fact, the latter may provide a more precise and informative perspective, as parameters within the same layer can simultaneously
S1. The idea and empirical observations are very interesting, and the motivation of this paper is clear. S2. The approach is simple. It is easy to understand and to implement. S3. The “Task-Adaptive Layer Knockout” method is effective for some tasks and computationally cheap during test-time. S4. Their backpropagation-free “Dynamic Layer Selection” stage is lightweight computationally, which is a great advantage over fine-tuning methods. It is practical for large models under resource-constr
W1. Limited novelty. The empirical novelty is limited, overlapping substantially with existing literature on layer importance, redundancy, and pruning. The core analysis tool, the "Task-Layer Interaction Vector," is very similar to the layer importance analysis presented in prior work in [1]. Furthermore, the intervention techniques employed are not novel. Parameter zeroing is a standard technique extensively explored in the model pruning literature. Similarly, the uniform weight intervention st
The paper is well written and easy to follow. The finding that knocking out certain layers can increase accuracy on certain tasks is novel. The study of layer importance has been prevalent in LLMs, and the application to VLMs is new.
I spot the following weaknesses. Imprecise description of “layer intervention”. The intervention occurs on the attention part, and the MLPs are never altered. In this case, the wording of “parameter zeroing” for “individual layers” is a bit misleading, and the paper has never ablated the effects of MLPs. Do they have similar effects? Will the claim still hold for MLPs? If not, the theme of the paper should be adjusted to put an emphasis on the importance of the attention mechanism, not on the f
1. The writing and illustrations are clear, concise, and easy to follow. 2. The authors present a well-designed empirical framework, including layer-wise intervention, the introduction of the Task-Layer Interaction Vector, and systematic evaluation across multiple tasks and models. 3. The results are consistently validated across diverse VLM architectures (LLaVA, Qwen-VL, InternVL), benchmarks, and test sets, demonstrating robustness of the proposed method.
1. While the paper offers hypotheses about the origins of task-interfering layers, it's still difficult to understand why certain benchmarks consistently exhibit the phenomenon, while others seem to require all layers. 2. The study primarily evaluates pretrained models and does not investigate whether instructed finetuning (e.g., using task-specific training data such as extensive math datasets) alters the presence or impact of task-interfering layers. Assessing instructed finetuned models would
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
