Rule-embedded network for audio-visual voice activity detection in live musical video streams
Yuanbo Hou, Yi Deng, Bilei Zhu, Zejun Ma, Dick Botteldooren

TL;DR
This paper introduces a rule-embedded audio-visual network that improves voice activity detection in live musical streams by effectively fusing audio and visual cues, outperforming audio-only methods.
Contribution
It proposes a novel rule-embedded network for audio-visual fusion in VAD, utilizing visual data as a mask to enhance target voice detection in noisy environments.
Findings
Bi-modal model outperforms audio-only models.
Cross-modal fusion improves detection accuracy.
Introduces a new live musical video dataset.
Abstract
Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. With the help of visual information, this paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs to help the model better detect target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as the mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion by the proposed rule, the detection result of A-V branch outperforms that of audio branch; 2) the performance of bi-modal model far outperforms that of audio-only models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
