Deep Triphone Embedding Improves Phoneme Recognition
Mohit Yadav, Vivek Tyagi

TL;DR
This paper introduces Deep Triphone Embeddings (DTE), a novel neural network-based feature representation that enhances phoneme recognition accuracy by capturing contextual speech information more effectively.
Contribution
The paper proposes a new DTE method derived from DNN activations, improving phoneme recognition over traditional triphone systems.
Findings
DTE improves phoneme recognition accuracy by 2.11%.
DTE captures contextual speech features effectively.
The method outperforms existing triphone-based systems.
Abstract
In this paper, we present a novel Deep Triphone Embedding (DTE) representation derived from Deep Neural Network (DNN) to encapsulate the discriminative information present in the adjoining speech frames. DTEs are generated using a four hidden layer DNN with 3000 nodes in each hidden layer at the first-stage. This DNN is trained with the tied-triphone classification accuracy as an optimization criterion. Thereafter, we retain the activation vectors (3000) of the last hidden layer, for each speech MFCC frame, and perform dimension reduction to further obtain a 300 dimensional representation, which we termed as DTE. DTEs along with MFCC features are fed into a second-stage four hidden layer DNN, which is subsequently trained for the task of tied-triphone classification. Both DNNs are trained using tri-phone labels generated from a tied-state triphone HMM-GMM system, by performing a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
