Sparse Shortcuts: Facilitating Efficient Fusion in Multimodal Large Language Models

Jingrui Zhang; Feng Liang; Yong Zhang; Wei Wang; Runhao Zeng; Xiping Hu

arXiv:2602.00505·cs.CV·February 3, 2026

Sparse Shortcuts: Facilitating Efficient Fusion in Multimodal Large Language Models

Jingrui Zhang, Feng Liang, Yong Zhang, Wei Wang, Runhao Zeng, Xiping Hu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SparseCut, a novel architecture with sparse shortcut connections for multimodal large language models, enabling efficient hierarchical visual feature fusion that improves performance without added computational costs.

Contribution

SparseCut provides a new cross-modal fusion method with sparse shortcuts and multi-grained feature fusion, enhancing semantic integration in MLLMs efficiently and scalably.

Findings

01

Significantly improves MLLM performance on multiple benchmarks.

02

Enables hierarchical visual feature fusion without increasing computational overhead.

03

Demonstrates generality across different base LLMs.

Abstract

With the remarkable success of large language models (LLMs) in natural language understanding and generation, multimodal large language models (MLLMs) have rapidly advanced in their ability to process data across multiple modalities. While most existing efforts focus on scaling up language models or constructing higher-quality training data, limited attention has been paid to effectively integrating cross-modal knowledge into the language space. In vision-language models, for instance, aligning modalities using only high-level visual features often discards the rich semantic information present in mid- and low-level features, limiting the model's ability of cross-modality understanding. To address this issue, we propose SparseCut, a general cross-modal fusion architecture for MLLMs, introducing sparse shortcut connections between the cross-modal encoder and the LLM. These shortcut…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. Addresses two critical pain points of existing MLLMs—loss of mid/low-level visual semantics (by leveraging multi-level vision encoder layers) and high computation from multi-resolution features (by fusing features before shortcut injection)—filling gaps in current cross-modal fusion designs. 2. SparseCut is compatible with diverse base LLMs (Vicuna, Phi-3) and scales across model sizes (3.5B–13B). The shortcut pattern (order/distribution/density) is configurable, making it a flexible framewor

Weaknesses

1. While the paper tests sparse/uniform, dense/bottom patterns, it lacks a systematic exploration of why the U-shaped order is optimal (e.g., no comparison to linear/ random connection orders) or how to dynamically adjust shortcut density/distribution for different tasks (e.g., fine-grained recognition vs. coarse visual reasoning). 2. The paper mentions freezing the vision encoder during training but provides no analysis of training stability (e.g., whether sparse shortcuts mitigate overfitting

Reviewer 02Rating 4Confidence 4

Strengths

1. The overall idea is conceptually clear and well motivated. 2. The evaluation covers a reasonably broad set of benchmarks, demonstrating the generalization capabilities of the approach. 3. The manuscript is clearly written and easy to read.

Weaknesses

1. The experiments are confined to the Vicuna-based LLaVA framework. To support the claim of wide applicability, additional validation on more diverse and up-to-date LLM backbones (e.g., Qwen2.5 series) and multimodal architectures (e.g., Qwen2.5-VL, InternVL-2.5) would be essential. 2. The paper emphasizes efficiency on the language side but neglects the additional cost incurred by processing higher-resolution images through the vision encoder. Reporting overall metrics such as end-to-end FLOPs

Reviewer 03Rating 6Confidence 3

Strengths

1. This paper proposes a method that can efficiently integrates multi-level and multi-resolution visual features without increasing computational cost. 2. Through shortcut connections, SparseCut effectively incorporates multi-granularity visual features into the LLM while preserving its original context length and computational efficiency. 3. The experimental results demonstrate strong generalization and scalability across different base LLMs.

Weaknesses

1. The choice of shortcut pattern (density, distribution) may require manual tuning. 2. The method relies on a frozen vision encoder, potentially limiting deeper cross-modal alignment.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications