How Redundant Is the Transformer Stack in Speech Representation Models?
Teresa Dorszewski, Albert Kj{\o}ller Jacobsen, Lenka T\v{e}tkov\'a,, Lars Kai Hansen

TL;DR
This paper investigates layer redundancy in transformer-based speech models, revealing high similarity between layers and demonstrating that significant pruning and distillation can drastically reduce model size and computation without losing performance.
Contribution
The study provides a detailed analysis of layer similarity in speech transformers and introduces effective pruning and distillation techniques to minimize model complexity.
Findings
Up to 40% reduction in transformer layers with minimal performance loss
Knowledge distillation reduces model size by 95-98%
Inference time decreases by up to 94%
Abstract
Self-supervised speech representation models, particularly those leveraging transformer architectures, have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection. Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning, which we will investigate here for transformer-based speech representation models. We perform a detailed analysis of layer similarity in speech representation models using three similarity metrics: cosine similarity, centered kernel alignment, and mutual nearest-neighbor alignment. Our findings reveal a block-like structure of high similarity, suggesting two main processing steps and significant redundancy of layers. We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
MethodsPruning · Knowledge Distillation
