How Redundant Is the Transformer Stack in Speech Representation Models?

Teresa Dorszewski; Albert Kj{\o}ller Jacobsen; Lenka T\v{e}tkov\'a,; Lars Kai Hansen

arXiv:2409.16302·eess.AS·January 20, 2025

How Redundant Is the Transformer Stack in Speech Representation Models?

Teresa Dorszewski, Albert Kj{\o}ller Jacobsen, Lenka T\v{e}tkov\'a,, Lars Kai Hansen

PDF

Open Access

TL;DR

This paper investigates layer redundancy in transformer-based speech models, revealing high similarity between layers and demonstrating that significant pruning and distillation can drastically reduce model size and computation without losing performance.

Contribution

The study provides a detailed analysis of layer similarity in speech transformers and introduces effective pruning and distillation techniques to minimize model complexity.

Findings

01

Up to 40% reduction in transformer layers with minimal performance loss

02

Knowledge distillation reduces model size by 95-98%

03

Inference time decreases by up to 94%

Abstract

Self-supervised speech representation models, particularly those leveraging transformer architectures, have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection. Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning, which we will investigate here for transformer-based speech representation models. We perform a detailed analysis of layer similarity in speech representation models using three similarity metrics: cosine similarity, centered kernel alignment, and mutual nearest-neighbor alignment. Our findings reveal a block-like structure of high similarity, suggesting two main processing steps and significant redundancy of layers. We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques

MethodsPruning · Knowledge Distillation