Understanding and Harnessing Sparsity in Unified Multimodal Models

Shwai He; Chaorui Deng; Ang Li; Shen Yan

arXiv:2512.02351·cs.CV·December 3, 2025

Understanding and Harnessing Sparsity in Unified Multimodal Models

Shwai He, Chaorui Deng, Ang Li, Shen Yan

PDF

Open Access

TL;DR

This paper systematically analyzes the sparsity and compressibility of components in unified multimodal models, revealing that understanding modules are highly compressible while generation modules are sensitive, and proposes a Mixture-of-Experts adaptation to improve efficiency.

Contribution

It provides the first systematic analysis of component-wise sparsity in unified multimodal models and introduces a Mixture-of-Experts adaptation to enhance inference efficiency.

Findings

01

Understanding components are highly compressible in both tasks.

02

Generation components are sensitive to compression, performance drops with moderate pruning.

03

Mixture-of-Experts adaptation achieves comparable performance with only half the parameters activated.

Abstract

Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications