Outlier Reduction with Gated Attention for Improved Post-training   Quantization in Large Sequence-to-sequence Speech Foundation Models

Dominik Wagner; Ilja Baumann; Korbinian Riedhammer; Tobias Bocklet

arXiv:2406.11022·cs.SD·June 18, 2024

Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models

Dominik Wagner, Ilja Baumann, Korbinian Riedhammer, Tobias Bocklet

PDF

Open Access

TL;DR

This paper introduces a gating mechanism in attention blocks to reduce outliers in transformer-based speech models, significantly improving post-training quantization quality and speech recognition accuracy.

Contribution

It proposes a novel gating approach in attention modules to mitigate outliers, enhancing 8-bit quantization and performance in large speech models after knowledge distillation.

Findings

01

Outliers are prevalent in transformer-based speech models and hinder quantization.

02

Gating attention blocks effectively reduce outliers, enabling better quantization.

03

Gated models achieve lower word error rates compared to ungated counterparts.

Abstract

This paper explores the improvement of post-training quantization (PTQ) after knowledge distillation in the Whisper speech foundation model family. We address the challenge of outliers in weights and activation tensors, known to impede quantization quality in transformer-based language and vision models. Extending this observation to Whisper, we demonstrate that these outliers are also present when transformer-based models are trained to perform automatic speech recognition, necessitating mitigation strategies for PTQ. We show that outliers can be reduced by a recently proposed gating mechanism in the attention blocks of the student model, enabling effective 8-bit quantization, and lower word error rates compared to student models without the gating mechanism in place.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing