Interpreting A Pre-trained Model Is A Key For Model Architecture Optimization: A Case Study On Wav2Vec 2.0
Liu Chen, Meysam Asgari

TL;DR
This paper introduces a method to diagnose and avoid abnormal attention patterns in pre-trained Transformer models, specifically Wav2Vec 2.0, leading to significant performance improvements in speech recognition tasks.
Contribution
It proposes a novel approach to analyze and filter abnormal attention patterns in pre-trained models, enhancing their performance without retraining from scratch.
Findings
Avoiding abnormal patterns improves Wav2Vec 2.0 performance by 4.8% WER
Diagnosing attention patterns helps understand model behavior
Filtering abnormal patterns is key to model optimization
Abstract
A deep Transformer model with good evaluation score does not mean each subnetwork (a.k.a transformer block) learns reasonable representation. Diagnosing abnormal representation and avoiding it can contribute to achieving a better evaluation score. We propose an innovative perspective for analyzing attention patterns: summarize block-level patterns and assume abnormal patterns contribute negative influence. We leverage Wav2Vec 2.0 as a research target and analyze a pre-trained model's pattern. All experiments leverage Librispeech-100-clean as training data. Through avoiding diagnosed abnormal ones, our custom Wav2Vec 2.0 outperforms the original version about 4.8% absolute word error rate (WER) on test-clean with viterbi decoding. Our version is still 0.9% better when decoding with a 4-gram language model. Moreover, we identify that avoiding abnormal patterns is the main contributor for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Model-Driven Software Engineering Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Byte Pair Encoding · Attention Is All You Need · Residual Connection · Layer Normalization · Adam · Label Smoothing
