LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Tongtian Yue, Longteng Guo, Yepeng Tang, Zijia Zhao, Xinxin Zhu, Hua Huang, Jing Liu

TL;DR
LaVi introduces an efficient vision-language model that modulates internal features within large language models, significantly reducing computational costs while maintaining state-of-the-art performance across multiple benchmarks.
Contribution
LaVi proposes a novel internal feature modulation mechanism for seamless vision-language integration, avoiding long-context expansion and improving scalability and efficiency.
Findings
Achieves state-of-the-art multimodal performance on 15 benchmarks.
Reduces FLOPs by 94.0% compared to LLaVA-OV-7B.
Increases inference speed by 3.1 times and halves memory usage.
Abstract
Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or introduce severe long-context computational burden, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present LaVi, a novel LVLM that enables seamless and efficient vision-language fusion through internal feature modulation within the Large Language Models (LLMs). Unlike dominant LVLMs that rely on visual token concatenation, LaVi bypasses long-context expansion by introducing a lightweight and adaptive transformation, which incorporates visual context by injecting token-wise vision-conditioned deltas into the affine parameters of layer normalization. This mechanism directly modulates linguistic hidden…
Peer Reviews
Decision·Submitted to ICLR 2026
1. LaVI uses layer-normalized affine parameters, avoiding complexity issues caused by excessively long contexts. 2. LaVI achieves significant efficiency improvements.
1. Lack of necessary theoretical analysis (detailed in questions). 2. Lack of comparison with some existing work: [a] Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng. LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token. [b] Bo Tong, Bokai Lai, Yiyi Zhou, Gen Luo, Yunhang Shen, Ke Li, Xiaoshuai Sun, Rongrong Ji. FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression.
1. The approach is well-motivated to improve the efficiency of vision-language models by altering the way of injecting vision information into language models. 2. The overall idea and implementation of feature modulation injection is simple. 3. Empirical results show that LaVi is highly-efficient, using less than 10% computation, 50% latency to achieve a similar performance to token concatenation baselines. 4. Ablation studies are conducted to show the effectiveness of the proposed modules
1. One of the major flaw of the proposed feature modulation injection is that, similar to the adaLN in diffusion transformers, support only one set of visual input. For multiple visual inputs (multiple input images, not multiple frames or image tiles mentioned in the manuscript), the difficulty lies in choosing which visual input to condition the layernorm. This limits the application of the proposed feature modulation injection to broader applications of VLMs, hence hardly able to be employed i
1. **Addresses a Critical Problem:** The paper tackles the highly relevant and important problem of computational inefficiency in Large Vision-Language Models (LVLMs). As models grow in capability and are applied to longer visual contexts (e.g., high-resolution images, videos), the quadratic complexity of self-attention becomes a major bottleneck. The work's focus on creating a more scalable and practical framework is well-motivated and timely. 2. **Impressive Efficiency Gains:** The primary
1. **Limited Novelty:** The core idea of modulating normalization layer parameters with external conditioning is not new. This concept is well-established in computer vision, most notably with Adaptive Instance Normalization (AdaIN) for style transfer (Dumoulin et al., 2016) and conditional normalization in generative models (Brock et al., 2018). The paper's primary contribution is the application of this existing idea to the domain of LVLMs. While this extension is acknowledged, it can be view
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
