M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection
Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao,, Yali Li

TL;DR
This paper introduces M$^{3}$V, a multi-modal multi-view learning approach that improves device-directed speech detection by effectively handling ASR errors and surpassing human judgment performance.
Contribution
The paper presents a novel multi-view learning framework that incorporates unimodal and alignment views, significantly enhancing detection accuracy over existing models.
Findings
M$^{3}$V outperforms single and multi-modal models in accuracy.
It surpasses human judgment on ASR error data.
The approach effectively mitigates ASR errors in device-directed speech detection.
Abstract
With the goal of more natural and human-like interaction with virtual voice assistants, recent research in the field has focused on full duplex interaction mode without relying on repeated wake-up words. This requires that in scenes with complex sound sources, the voice assistant must classify utterances as device-oriented or non-device-oriented. The dual-encoder structure, which is jointly modeled by text and speech, has become the paradigm of device-directed speech detection. However, in practice, these models often produce incorrect predictions for unaligned input pairs due to the unavoidable errors of automatic speech recognition (ASR).To address this challenge, we propose MV, a multi-modal multi-view approach for device-directed speech detection, which frames we frame the problem as a multi-view learning task that introduces unimodal views and a text-audio alignment view in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
