
TL;DR
Qwen3.5-Omni is a large, multi-modal model with hundreds of billions of parameters, supporting extensive audio-visual understanding, reasoning, and interaction, and introduces innovations like ARIA for speech synthesis stability.
Contribution
It advances the Qwen-Omni family with a hybrid MoE architecture, multi-modal capabilities, multilingual support, and a novel Audio-Visual Vibe Coding feature.
Findings
Achieves SOTA on 215 audio and audio-visual benchmarks.
Supports over 10 hours of audio and 400 seconds of video processing.
Introduces ARIA for improved speech synthesis stability.
Abstract
In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
