HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

Florian Mai; Juan Zuluaga-Gomez; Titouan Parcollet; Petr; Motlicek

arXiv:2305.18281·cs.CL·May 30, 2023·1 cites

HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

Florian Mai, Juan Zuluaga-Gomez, Titouan Parcollet, Petr, Motlicek

PDF

Open Access 1 Repo

TL;DR

HyperConformer introduces a multi-head HyperMixer module into the Conformer architecture, achieving efficient speech recognition with comparable or better accuracy and significantly improved inference speed and resource usage.

Contribution

It extends HyperMixer to Conformer, creating HyperConformer, which reduces computational complexity and resource requirements while maintaining high recognition performance.

Findings

01

Achieves 2.9% WER on Librispeech test-clean.

02

Reduces training memory to under 8M parameters.

03

Speeds up inference by 38-56% compared to Conformer.

Abstract

State-of-the-art ASR systems have achieved promising results by modeling local and global interactions separately. While the former can be computed efficiently, global interactions are usually modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending HyperMixer, an efficient alternative to attention exhibiting linear complexity, to the Conformer architecture for speech recognition, leading to HyperConformer. In particular, multi-head HyperConformer achieves comparable or higher recognition performance while being more efficient than Conformer in terms of inference speed, memory, parameter count, and available training data. HyperConformer achieves a word error rate of 2.9% on Librispeech test-clean with less than 8M neural parameters and a peak memory during training of 5.7GB, hence trainable with accessible hardware. Encoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

speechbrain/speechbrain
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings