Rule-embedded network for audio-visual voice activity detection in live   musical video streams

Yuanbo Hou; Yi Deng; Bilei Zhu; Zejun Ma; Dick Botteldooren

arXiv:2010.14168·cs.SD·November 3, 2020

Rule-embedded network for audio-visual voice activity detection in live musical video streams

Yuanbo Hou, Yi Deng, Bilei Zhu, Zejun Ma, Dick Botteldooren

PDF

Open Access 1 Repo

TL;DR

This paper introduces a rule-embedded audio-visual network that improves voice activity detection in live musical streams by effectively fusing audio and visual cues, outperforming audio-only methods.

Contribution

It proposes a novel rule-embedded network for audio-visual fusion in VAD, utilizing visual data as a mask to enhance target voice detection in noisy environments.

Findings

01

Bi-modal model outperforms audio-only models.

02

Cross-modal fusion improves detection accuracy.

03

Introduces a new live musical video dataset.

Abstract

Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. With the help of visual information, this paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs to help the model better detect target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as the mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion by the proposed rule, the detection result of A-V branch outperforms that of audio branch; 2) the performance of bi-modal model far outperforms that of audio-only models,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Yuanbo2020/Audio-Visual-VAD
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies