Word-level Speech Recognition with a Letter to Word Encoder
Ronan Collobert, Awni Hannun, Gabriel Synnaeve

TL;DR
This paper introduces a direct-to-word speech recognition model that learns word embeddings from letters, improving accuracy and efficiency over sub-word models while handling unseen words without retraining.
Contribution
The paper presents a novel word-level sequence model that integrates a word network with letter-based embeddings, compatible with various sequence modeling architectures.
Findings
Achieves lower word error rates than sub-word models.
Can predict unseen words without retraining.
Uses larger stride for efficiency without accuracy loss.
Abstract
We propose a direct-to-word sequence model which uses a word network to learn word embeddings from letters. The word network can be integrated seamlessly with arbitrary sequence models including Connectionist Temporal Classification and encoder-decoder models with attention. We show our direct-to-word model can achieve word error rate gains over sub-word level models for speech recognition. We also show that our direct-to-word approach retains the ability to predict words not seen at training time without any retraining. Finally, we demonstrate that a word-level model can use a larger stride than a sub-word level model while maintaining accuracy. This makes the model more efficient both for training and inference.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
