DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

Hang Shao; Heting Gao; Yunhang Shen; Jiawei Chen; Zuwei Long; Dong Yang; Ke Li; Xing Sun

arXiv:2506.21864·cs.CL·October 28, 2025

DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Zuwei Long, Dong Yang, Ke Li, Xing Sun

PDF

3 Reviews

TL;DR

DeepOmni introduces DeepTalk, an adaptive Mixture of Experts framework for native multimodal large language models, enhancing speech interaction by reducing catastrophic forgetting and maintaining low latency.

Contribution

The paper proposes DeepTalk, a novel MoE-based adaptive modality expert learning framework that significantly improves native MLLMs' performance and interaction smoothness.

Findings

01

Performance drop is only 5.5% with DeepTalk, much lower than typical 20% in native MLLMs.

02

End-to-end dialogue latency remains within 0.5 seconds.

03

DeepTalk achieves comparable performance to modular MLLMs while preserving richer paralinguistic features.

Abstract

Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

1. Using a MoE architecture for developing MLLMs has been explored in earlier works, as well as dynamic modality expert selection such as in prior works of LLMoE etc. The single-modality expert training and then cross-modal expert training has also been explored in the prior Uni-MoE framework. The main contribution of this work seems to investigate the impact of these previously proposed approaches on mitigating catastrophic forgetting of text capabilities in LALMs and omni models, which is an i

Weaknesses

1. Some important related, non-contemporaneous works are missing in theoretical and empirical comparisons, for example, strong MLLMs such as Kimi-audio, Ming-lite-omni (which is also a MoE-based omni model). Hence, the presentation of the experimental results is misleading. For example, in Table 2 performance on Spoken QA, Kimi-audio and Qwen2.5-omni achieved much better performance than the proposed DeepOmni, yet their evaluation results are missing. In Table 3 evaluating the T2T performance a

Reviewer 02Rating 6Confidence 3

Strengths

1. The work claims to be the first native MLLM built upon an MoE-based LLM backbone with a 3-stage post-training and addresses the catastrophic forgetting in native MLLM. Solid and highly effective. 2. It proposes an effective and intuitive expert partition strategy that selects modality-specific experts based on modality load, and the proposed model achieves a low performance drop in language capacity.

Weaknesses

1. The paper claims native MLLMs preserve richer paralinguistic features as part of its motivation, but the evaluation lacks essential quality-based metrics to substantiate this claim and compare the expressive quality of the proposed model against other native baselines.

Reviewer 03Rating 4Confidence 4

Strengths

1. Addresses Important Problem: Catastrophic forgetting in native multimodal speech models is a genuine and pressing challenge. The paper tackles a real bottleneck that limits the practical deployment of end-to-end speech interaction systems. 2. Novel Adaptive Selection Strategy: The adaptive modality expert partitioning based on modality load is creative and well-motivated. Unlike random assignment, this data-driven approach intelligently identifies which experts are suitable for audio vs. text

Weaknesses

1. Weak Baselines in Comparison: The results section appears to compare against relatively weak baselines. Why do Tables 2–5 not include comparisons with Qwen-2.5-OMNI and Kimi-Audio? Notably, Kimi-Audio is itself a non-modular speech LLM, making it an important baseline for fair evaluation. 2. Questionable Claims About Modular SLM Limitations: The paper’s claims regarding the limitations of Modular Speech Language Models are not fully substantiated. These models remain end-to-end differentiabl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.