Whisper in Medusa's Ear: Multi-head Efficient Decoding for   Transformer-based ASR

Yael Segal-Feldman; Aviv Shamsian; Aviv Navon; Gill Hetz; Joseph; Keshet

arXiv:2409.15869·eess.AS·September 25, 2024

Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR

Yael Segal-Feldman, Aviv Shamsian, Aviv Navon, Gill Hetz, Joseph, Keshet

PDF

Open Access 2 Repos

TL;DR

Whisper-Medusa is a novel multi-head decoding approach that accelerates transformer-based speech recognition models by 50% with minimal WER increase, improving inference speed for speech transcription tasks.

Contribution

It introduces Whisper-Medusa, a multi-head decoding method that extends Whisper architecture to significantly reduce inference latency while maintaining accuracy.

Findings

01

50% reduction in decoding latency

02

Minimal impact on Word Error Rate

03

Effective across various datasets and learning setups

Abstract

Large transformer-based models have significant potential for speech transcription and translation. Their self-attention mechanisms and parallel processing enable them to capture complex patterns and dependencies in audio sequences. However, this potential comes with challenges, as these large and computationally intensive models lead to slow inference speeds. Various optimization strategies have been proposed to improve performance, including efficient hardware utilization and algorithmic enhancements. In this paper, we introduce Whisper-Medusa, a novel approach designed to enhance processing speed with minimal impact on Word Error Rate (WER). The proposed model extends the OpenAI's Whisper architecture by predicting multiple tokens per iteration, resulting in a 50% reduction in latency. We showcase the effectiveness of Whisper-Medusa across different learning setups and datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlind Source Separation Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings