M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech   Detection

Anna Wang; Da Liu; Zhiyu Zhang; Shengqiang Liu; Jie Gao,; Yali Li

arXiv:2409.09284·cs.SD·September 17, 2024

M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection

Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao,, Yali Li

PDF

Open Access

TL;DR

This paper introduces M$^{3}$V, a multi-modal multi-view learning approach that improves device-directed speech detection by effectively handling ASR errors and surpassing human judgment performance.

Contribution

The paper presents a novel multi-view learning framework that incorporates unimodal and alignment views, significantly enhancing detection accuracy over existing models.

Findings

01

M$^{3}$V outperforms single and multi-modal models in accuracy.

02

It surpasses human judgment on ASR error data.

03

The approach effectively mitigates ASR errors in device-directed speech detection.

Abstract

With the goal of more natural and human-like interaction with virtual voice assistants, recent research in the field has focused on full duplex interaction mode without relying on repeated wake-up words. This requires that in scenes with complex sound sources, the voice assistant must classify utterances as device-oriented or non-device-oriented. The dual-encoder structure, which is jointly modeled by text and speech, has become the paradigm of device-directed speech detection. However, in practice, these models often produce incorrect predictions for unaligned input pairs due to the unavoidable errors of automatic speech recognition (ASR).To address this challenge, we propose M $^{3}$ V, a multi-modal multi-view approach for device-directed speech detection, which frames we frame the problem as a multi-view learning task that introduces unimodal views and a text-audio alignment view in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis