Application of Word2vec in Phoneme Recognition
Xin Feng, Lei Wang

TL;DR
This paper introduces a hybrid phoneme recognition system combining Word2vec embeddings with an attention-based model, improving performance on TIMIT by innovative training and data augmentation techniques.
Contribution
It proposes a novel integration of Word2vec embeddings into an end-to-end speech recognition model and a new training method to address overfitting in phoneme recognition.
Findings
Achieved 16.5% PER on TIMIT dataset.
Enhanced phoneme vector separation with Word2vec initialization.
Implemented a phoneme mapping method for data augmentation.
Abstract
In this paper, we present how to hybridize a Word2vec model and an attention-based end-to-end speech recognition model. We build a phoneme recognition system based on Listen, Attend and Spell model. And the phoneme recognition model uses a word2vec model to initialize the embedding matrix for the improvement of the performance, which can increase the distance among the phoneme vectors. At the same time, in order to solve the problem of overfitting in the 61 phoneme recognition model on TIMIT dataset, we propose a new training method. A 61-39 phoneme mapping comparison table is used to inverse map the phonemes of the dataset to generate more 61 phoneme training data. At the end of training, replace the dataset with a standard dataset for corrective training. Our model can achieve the best result under the TIMIT dataset which is 16.5% PER (Phoneme Error Rate).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
