UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Weijia Mao, Zhenheng Yang, Mike Zheng Shou

TL;DR
UniMoD introduces a task-aware token pruning method for unified multimodal transformers, significantly reducing training computational costs while maintaining or improving performance across multiple benchmarks.
Contribution
The paper proposes UniMoD, a novel token pruning approach with separate routers per task, tailored for unified multimodal transformers to enhance efficiency.
Findings
Reduces training FLOPs by up to 40% on benchmark tasks.
Analyzes token redundancy influenced by task and layer variations.
Maintains or improves performance despite reduced computation.
Abstract
Unified multimodal transformers, which handle both generation and understanding tasks within a shared parameter space, have received increasing attention in recent research. Although various unified transformers have been proposed, training these models is costly due to redundant tokens and heavy attention computation. In the past, studies on large language models have demonstrated that token pruning methods, such as Mixture of Depths (MoD), can significantly improve computational efficiency. MoD employs a router to select the most important ones for processing within a transformer layer. However, directly applying MoD-based token pruning to unified transformers will result in suboptimal performance because different tasks exhibit varying levels of token redundancy. In our work, we analyze the unified transformers by (1) examining attention weight patterns, (2) evaluating the layer…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is generally well written. Especially, the clarity of the problem formulation and literature review effectively sets up this novel approach. - The paper successfully grounds its motivation in a strong empirical analysis by directly examining attention weight patterns across various tasks and modalities in unified transformers, identifying significant differences in redundancy based on task and layer. - This observation, combined with the evaluation of layer importance and token redun
- While the detailed analysis is provided, some findings, such as the observation that attention weight patterns differ between tasks (Observation 1) or that token redundancy differs based on modeling methods (Observation 3), might be viewed as expected or previously implied in related transformer studies. - Similarly, the finding that early layers are more critical for the final outcome (Observation 2) is a phenomenon often observed in general deep neural networks. - Although the application
1. Novelty and Relevance: The paper tackles the important issue of computational efficiency in unified multimodal transformers, introducing a novel task-aware MoD mechanism. 2. Comprehensive Analysis: The empirical studies (attention weights, ARank, layer importance, task competition) are thorough and well-motivated. 3. General Applicability: The approach works for both autoregressive and diffusion-based models.
1. Limited Task Diversity: Only two unified models (Show-o, Emu3) are tested; results on larger or more diverse multimodal architectures would strengthen claims. 2. Ablation Details: Although some ablation studies are provided, the analysis of router behavior (e.g., routing distributions, token importance dynamics) could be more in-depth. 3. Limited Discussion on Trade-offs: The paper could elaborate more on how pruning ratios affect different modalities’ representations and downstream tasks. 4.
S1. The paper analyzes the existing token redundancy problem and reveals that it varies across tasks and layers. Based on this finding, it proposes a task-aware token pruning method. The causality between the problem and the proposed modules is clear, and its effectiveness is verified. S2. The proposed method (UniMoD) experimentally shows that it reduces the computational cost of existing methods while maintaining comparable or even superior performance. For example, it reduces training FLOPs b
W1. Only 2 main models (Show-o and Emu3) are thoroughly evaluated: 1) Show-o: 1.4B params, diffusion + autoregressive, 2) Emu3: 8.5B params, fully autoregressive. Related work mentions MoMa (Lin et al., 2024b) which also integrates MoE/MoD into Chameleon (a unified model). A more direct comparison or clearer differentiation would be beneficial, although the paper notes MoMa lacked results on generation/most understanding tasks. W2. The method relies on calculating ARank values beforehand to se
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence · Neural Networks and Applications
