Anatomy of Industrial Scale Multilingual ASR

Francis McCann Ramirez; Luka Chkhetiani; Andrew Ehrenberg; Robert; McHardy; Rami Botros; Yash Khare; Andrea Vanzo; Taufiquzzaman Peyash; Gabriel; Oexle; Michael Liang; Ilya Sklyar; Enver Fakhan; Ahmed Etefy; Daniel; McCrystal; Sam Flamini; Domenic Donato; Takuya Yoshioka

arXiv:2404.09841·eess.AS·April 17, 2024·1 cites

Anatomy of Industrial Scale Multilingual ASR

Francis McCann Ramirez, Luka Chkhetiani, Andrew Ehrenberg, Robert, McHardy, Rami Botros, Yash Khare, Andrea Vanzo, Taufiquzzaman Peyash, Gabriel, Oexle, Michael Liang, Ilya Sklyar, Enver Fakhan, Ahmed Etefy, Daniel, McCrystal, Sam Flamini, Domenic Donato, Takuya Yoshioka

PDF

Open Access

TL;DR

This paper presents AssemblyAI's large-scale multilingual ASR system, highlighting its architecture, training data, and performance advantages over existing models in real-world, industrial applications.

Contribution

It introduces a novel industrial-scale ASR system with a unique architecture and training regimen, optimized for speed, noise robustness, and multilingual capabilities.

Findings

01

Competitive WERs against larger models

02

5x inference speedup over Whisper baseline

03

30% reduction in hallucination rate

Abstract

This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Speech Recognition and Synthesis