Qwen3.5-Omni Technical Report

Qwen Team

arXiv:2604.15804·cs.CL·April 22, 2026

Qwen3.5-Omni Technical Report

Qwen Team

PDF

TL;DR

Qwen3.5-Omni is a large, multi-modal model with hundreds of billions of parameters, supporting extensive audio-visual understanding, reasoning, and interaction, and introduces innovations like ARIA for speech synthesis stability.

Contribution

It advances the Qwen-Omni family with a hybrid MoE architecture, multi-modal capabilities, multilingual support, and a novel Audio-Visual Vibe Coding feature.

Findings

01

Achieves SOTA on 215 audio and audio-visual benchmarks.

02

Supports over 10 hours of audio and 400 seconds of video processing.

03

Introduces ARIA for improved speech synthesis stability.

Abstract

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.