HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition
Florian Mai, Juan Zuluaga-Gomez, Titouan Parcollet, Petr, Motlicek

TL;DR
HyperConformer introduces a multi-head HyperMixer module into the Conformer architecture, achieving efficient speech recognition with comparable or better accuracy and significantly improved inference speed and resource usage.
Contribution
It extends HyperMixer to Conformer, creating HyperConformer, which reduces computational complexity and resource requirements while maintaining high recognition performance.
Findings
Achieves 2.9% WER on Librispeech test-clean.
Reduces training memory to under 8M parameters.
Speeds up inference by 38-56% compared to Conformer.
Abstract
State-of-the-art ASR systems have achieved promising results by modeling local and global interactions separately. While the former can be computed efficiently, global interactions are usually modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending HyperMixer, an efficient alternative to attention exhibiting linear complexity, to the Conformer architecture for speech recognition, leading to HyperConformer. In particular, multi-head HyperConformer achieves comparable or higher recognition performance while being more efficient than Conformer in terms of inference speed, memory, parameter count, and available training data. HyperConformer achieves a word error rate of 2.9% on Librispeech test-clean with less than 8M neural parameters and a peak memory during training of 5.7GB, hence trainable with accessible hardware. Encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
