Less is More: Accurate Speech Recognition & Translation without   Web-Scale Data

Krishna C. Puvvada; Piotr \.Zelasko; He Huang; Oleksii Hrinchuk,; Nithin Rao Koluguri; Kunal Dhawan; Somshubra Majumdar; Elena Rastorgueva,; Zhehuai Chen; Vitaly Lavrukhin; Jagadeesh Balam; Boris Ginsburg

arXiv:2406.19674·cs.CL·July 1, 2024·1 cites

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

Krishna C. Puvvada, Piotr \.Zelasko, He Huang, Oleksii Hrinchuk,, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva,, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

PDF

Open Access 2 Models

TL;DR

This paper presents Canary, a multilingual speech recognition and translation model that achieves state-of-the-art accuracy without web-scale data by using a novel architecture, synthetic data, and advanced training techniques.

Contribution

The paper introduces Canary, a data-efficient multilingual speech model that outperforms larger models using significantly less training data and innovative training strategies.

Findings

01

Canary outperforms Whisper, OWSM, and Seamless-M4T on multiple languages.

02

Achieves high accuracy with an order of magnitude less data.

03

Utilizes synthetic data and advanced training methods for efficiency.

Abstract

Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need