From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion

Cheng Chen; Yuyu Guo; Pengpeng Zeng; Jingkuan Song; Peng Di; Hang Yu; Lianli Gao

arXiv:2601.10710·cs.CV·January 16, 2026

From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion

Cheng Chen, Yuyu Guo, Pengpeng Zeng, Jingkuan Song, Peng Di, Hang Yu, Lianli Gao

PDF

Open Access

TL;DR

This paper introduces Cross-Layer Injection (CLI), a dynamic framework that enhances vision-language models by enabling flexible, real-time integration of hierarchical visual features, significantly improving multimodal understanding.

Contribution

The paper proposes a novel, lightweight cross-layer injection framework with adaptive modules that dynamically connect vision and language models, surpassing static architectures.

Findings

01

CLI improves performance on 18 benchmarks.

02

Dynamic feature integration enhances multimodal understanding.

03

CLI is scalable and adaptable across models.

Abstract

Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications