TL;DR
DeepOmni introduces DeepTalk, an adaptive Mixture of Experts framework for native multimodal large language models, enhancing speech interaction by reducing catastrophic forgetting and maintaining low latency.
Contribution
The paper proposes DeepTalk, a novel MoE-based adaptive modality expert learning framework that significantly improves native MLLMs' performance and interaction smoothness.
Findings
Performance drop is only 5.5% with DeepTalk, much lower than typical 20% in native MLLMs.
End-to-end dialogue latency remains within 0.5 seconds.
DeepTalk achieves comparable performance to modular MLLMs while preserving richer paralinguistic features.
Abstract
Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Using a MoE architecture for developing MLLMs has been explored in earlier works, as well as dynamic modality expert selection such as in prior works of LLMoE etc. The single-modality expert training and then cross-modal expert training has also been explored in the prior Uni-MoE framework. The main contribution of this work seems to investigate the impact of these previously proposed approaches on mitigating catastrophic forgetting of text capabilities in LALMs and omni models, which is an i
1. Some important related, non-contemporaneous works are missing in theoretical and empirical comparisons, for example, strong MLLMs such as Kimi-audio, Ming-lite-omni (which is also a MoE-based omni model). Hence, the presentation of the experimental results is misleading. For example, in Table 2 performance on Spoken QA, Kimi-audio and Qwen2.5-omni achieved much better performance than the proposed DeepOmni, yet their evaluation results are missing. In Table 3 evaluating the T2T performance a
1. The work claims to be the first native MLLM built upon an MoE-based LLM backbone with a 3-stage post-training and addresses the catastrophic forgetting in native MLLM. Solid and highly effective. 2. It proposes an effective and intuitive expert partition strategy that selects modality-specific experts based on modality load, and the proposed model achieves a low performance drop in language capacity.
1. The paper claims native MLLMs preserve richer paralinguistic features as part of its motivation, but the evaluation lacks essential quality-based metrics to substantiate this claim and compare the expressive quality of the proposed model against other native baselines.
1. Addresses Important Problem: Catastrophic forgetting in native multimodal speech models is a genuine and pressing challenge. The paper tackles a real bottleneck that limits the practical deployment of end-to-end speech interaction systems. 2. Novel Adaptive Selection Strategy: The adaptive modality expert partitioning based on modality load is creative and well-motivated. Unlike random assignment, this data-driven approach intelligently identifies which experts are suitable for audio vs. text
1. Weak Baselines in Comparison: The results section appears to compare against relatively weak baselines. Why do Tables 2–5 not include comparisons with Qwen-2.5-OMNI and Kimi-Audio? Notably, Kimi-Audio is itself a non-modular speech LLM, making it an important baseline for fair evaluation. 2. Questionable Claims About Modular SLM Limitations: The paper’s claims regarding the limitations of Modular Speech Language Models are not fully substantiated. These models remain end-to-end differentiabl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
