Resolving Task Objective Conflicts in Unified Model via Task-Aware Mixture-of-Experts
Jiaxing Zhang, Hao Tang

TL;DR
This paper introduces UTAMoE, a novel mixture-of-experts framework that decouples internal autoregressive modules in multimodal large language models to resolve task conflicts and improve performance.
Contribution
It proposes a task-aware MoE layer and a two-stage training strategy to effectively mitigate task conflicts in unified multimodal models.
Findings
Achieves state-of-the-art results on multimodal benchmarks.
Effectively reduces task interference and improves task-specific performance.
Validates approach through extensive ablation studies.
Abstract
Unified multimodal large language models (MLLMs) based on end-to-end autoregressive (AR) transformers effectively integrate both understanding and generation tasks within a single framework. However, intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges, often leading to suboptimal trade-offs and task interference. Existing solutions, such as decoupling shared visual encoders, fall short of fundamentally resolving these conflicts due to inherent AR architecture. In this paper, we propose a novel approach that decouples internal components of AR to resolve task objective conflicts. Specifically, we design UTAMoE, a Unified Task-Aware Mixture-of-Experts (MoE) framework that decouples internal AR modules via a Task-Aware MoE Layer to create task-specific optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman-Automation Interaction and Safety · Context-Aware Activity Recognition Systems · Multi-Agent Systems and Negotiation
