Waveform-based Voice Activity Detection Exploiting Fully Convolutional   networks with Multi-Branched Encoders

Cheng Yu; Kuo-Hsuan Hung; I-Fan Lin; Szu-Wei Fu; Yu Tsao; Jeih-weih; Hung

arXiv:2006.11139·eess.AS·June 22, 2020·6 cites

Waveform-based Voice Activity Detection Exploiting Fully Convolutional networks with Multi-Branched Encoders

Cheng Yu, Kuo-Hsuan Hung, I-Fan Lin, Szu-Wei Fu, Yu Tsao, Jeih-weih, Hung

PDF

Open Access

TL;DR

This paper introduces a waveform-based voice activity detection system using fully convolutional networks with multi-branched encoders, demonstrating improved accuracy over traditional spectral feature-based methods.

Contribution

The study presents a novel waveform-based VAD system with multi-branched encoders, outperforming existing spectral feature-based methods and enabling attribute-based ensemble extensions.

Findings

01

WVAD outperforms state-of-the-art VAD algorithms on AURORA2.

02

WEVAD achieves better performance than WVAD by incorporating multiple attributes.

03

The proposed methods effectively utilize raw waveforms for more accurate VAD.

Abstract

In this study, we propose an encoder-decoder structured system with fully convolutional networks to implement voice activity detection (VAD) directly on the time-domain waveform. The proposed system processes the input waveform to identify its segments to be either speech or non-speech. This novel waveform-based VAD algorithm, with a short-hand notation "WVAD", has two main particularities. First, as compared to most conventional VAD systems that use spectral features, raw-waveforms employed in WVAD contain more comprehensive information and thus are supposed to facilitate more accurate speech/non-speech predictions. Second, based on the multi-branched architecture, WVAD can be extended by using an ensemble of encoders, referred to as WEVAD, that incorporate multiple attribute information in utterances, and thus can yield better VAD performance for specified acoustic conditions. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing