Less is More: Accurate Speech Recognition & Translation without Web-Scale Data
Krishna C. Puvvada, Piotr \.Zelasko, He Huang, Oleksii Hrinchuk,, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva,, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

TL;DR
This paper presents Canary, a multilingual speech recognition and translation model that achieves state-of-the-art accuracy without web-scale data by using a novel architecture, synthetic data, and advanced training techniques.
Contribution
The paper introduces Canary, a data-efficient multilingual speech model that outperforms larger models using significantly less training data and innovative training strategies.
Findings
Canary outperforms Whisper, OWSM, and Seamless-M4T on multiple languages.
Achieves high accuracy with an order of magnitude less data.
Utilizes synthetic data and advanced training methods for efficiency.
Abstract
Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsSoftmax · Attention Is All You Need
