Interpreting A Pre-trained Model Is A Key For Model Architecture   Optimization: A Case Study On Wav2Vec 2.0

Liu Chen; Meysam Asgari

arXiv:2104.02851·cs.CL·April 8, 2021·1 cites

Interpreting A Pre-trained Model Is A Key For Model Architecture Optimization: A Case Study On Wav2Vec 2.0

Liu Chen, Meysam Asgari

PDF

Open Access

TL;DR

This paper introduces a method to diagnose and avoid abnormal attention patterns in pre-trained Transformer models, specifically Wav2Vec 2.0, leading to significant performance improvements in speech recognition tasks.

Contribution

It proposes a novel approach to analyze and filter abnormal attention patterns in pre-trained models, enhancing their performance without retraining from scratch.

Findings

01

Avoiding abnormal patterns improves Wav2Vec 2.0 performance by 4.8% WER

02

Diagnosing attention patterns helps understand model behavior

03

Filtering abnormal patterns is key to model optimization

Abstract

A deep Transformer model with good evaluation score does not mean each subnetwork (a.k.a transformer block) learns reasonable representation. Diagnosing abnormal representation and avoiding it can contribute to achieving a better evaluation score. We propose an innovative perspective for analyzing attention patterns: summarize block-level patterns and assume abnormal patterns contribute negative influence. We leverage Wav2Vec 2.0 as a research target and analyze a pre-trained model's pattern. All experiments leverage Librispeech-100-clean as training data. Through avoiding diagnosed abnormal ones, our custom Wav2Vec 2.0 outperforms the original version about 4.8% absolute word error rate (WER) on test-clean with viterbi decoding. Our version is still 0.9% better when decoding with a 4-gram language model. Moreover, we identify that avoiding abnormal patterns is the main contributor for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Model-Driven Software Engineering Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Byte Pair Encoding · Attention Is All You Need · Residual Connection · Layer Normalization · Adam · Label Smoothing