VIBEVOICE-ASR Technical Report
Zhiliang Peng, Jianwei Yu, Yaoyao Chang, Zilong Wang, Li Dong, Yingbo Hao, Yujie Tu, Chenyu Yang, Wenhui Wang, Songchen Xu, Yutao Sun, Hangbo Bao, Weijiang Xu, Yi Zhu, Zehua Wang, Ting Song, Yan Xia, Zewen Chi, Shaohan Huang, Liang Wang, Chuang Ding, Shuai Wang, Xie Chen

TL;DR
VibeVoice-ASR is a versatile, end-to-end speech understanding framework capable of processing long-form audio with multi-language and code-switching support, improving accuracy through prompt-based context injection.
Contribution
It introduces VibeVoice-ASR, a single-pass, multi-task framework for long-form audio that unifies recognition, diarization, and timestamping, with novel context injection for domain adaptation.
Findings
Supports up to 60 minutes of audio in a single pass
Handles over 50 languages without explicit language setting
Improves accuracy with prompt-based context injection
Abstract
This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/VibeVoice-ASRmodel· 513k dl· ♡ 973513k dl♡ 973
- 🤗microsoft/VibeVoice-ASR-HFmodel· 258k dl· ♡ 67258k dl♡ 67
- 🤗lemuriandezapada/VibeVoice-ASR-awq-int4model· 749 dl· ♡ 1749 dl♡ 1
- 🤗bezzam/VibeVoice-ASR-7Bmodel· 479 dl· ♡ 4479 dl♡ 4
- 🤗williamchangtw/VibeVoice-ASRmodel· 3 dl3 dl
- 🤗bealore/vibevoice-asr-fp8model· 45 dl45 dl
- 🤗dmartu/VibeVoice-ASR-HFImodel· 5 dl5 dl
- 🤗lemuriandezapada/VibeVoice-ASR-gptq-int4model· 132 dl132 dl
- 🤗newworldcrimson/vibe-voice-modelmodel· 15 dl15 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
