Optimising The Input Window Alignment in CD-DNN Based Phoneme Recognition for Low Latency Processing
Akash Kumar Dhaka, Giampiero Salvi

TL;DR
This paper investigates how shifting the input feature window asymmetrically affects the performance and latency of CD-DNN based phoneme recognisers, finding that a window with more past frames reduces latency without degrading accuracy.
Contribution
It introduces a systematic analysis of asymmetric input windows in phoneme recognition, demonstrating potential latency reductions while maintaining performance.
Findings
Performance remains stable with up to 5 frames of past shift.
Asymmetric window with 8 past and 2 future frames yields best results.
Latency can be reduced by approximately 50 ms without accuracy loss.
Abstract
We present a systematic analysis on the performance of a phonetic recogniser when the window of input features is not symmetric with respect to the current frame. The recogniser is based on Context Dependent Deep Neural Networks (CD-DNNs) and Hidden Markov Models (HMMs). The objective is to reduce the latency of the system by reducing the number of future feature frames required to estimate the current output. Our tests performed on the TIMIT database show that the performance does not degrade when the input window is shifted up to 5 frames in the past compared to common practice (no future frame). This corresponds to improving the latency by 50 ms in our settings. Our tests also show that the best results are not obtained with the symmetric window commonly employed, but with an asymmetric window with eight past and two future context frames, although this observation should be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
