End-Point Detection with State Transition Model based on Chunk-Wise Classification
Juntae Kim, Jaesung Bae, Minsoo Hahn

TL;DR
This paper introduces a robust end-point detection method using a chunk-wise classification-based state transition model that reduces errors caused by noisy environments, improving speech/non-speech detection accuracy.
Contribution
It proposes a novel chunk-wise classification approach for state transition modeling in end-point detection, enhancing robustness against VAD errors in noisy conditions.
Findings
Improved accuracy in noisy environments.
Reduced false transitions due to chunk-wise aggregation.
Lower phone error rate in evaluations.
Abstract
A state transition model (STM) based on chunk-wise classification was proposed for end-point detection (EPD). In general, EPD is developed using frame-wise voice activity detection (VAD) with additional STM, in which the state transition is conducted based on VAD's frame-level decision (speech or non-speech). However, VAD errors frequently occur in noisy environments, even though we use state-of-the-art deep neural network based VAD, which causes the undesired state transition of STM. In this work, to build robust STM, a state transition is conducted based on chunk-wise classification as EPD does not need to be conducted in frame-level. The chunk consists of multiple frames and the classification of chunk between speech and non-speech is done by aggregating the decisions of VAD for multiple frames, so that some undesired VAD errors in a chunk can be smoothed by other correct VAD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques
