Dissecting Lottery Ticket Transformers: Structural and Behavioral Study   of Sparse Neural Machine Translation

Rajiv Movva; Jason Y. Zhao

arXiv:2009.13270·cs.CL·October 14, 2020

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Rajiv Movva, Jason Y. Zhao

PDF

TL;DR

This paper investigates how pruning in sparse Transformer models for neural machine translation affects their internal representations and behavior, revealing that semantic information degrades first, while early layers adapt and attention remains stable.

Contribution

It provides a detailed analysis of the structural and behavioral changes in sparse Transformers caused by pruning, highlighting layer-specific effects and the stability of attention mechanisms.

Findings

01

Semantic information degrades with pruning

02

Higher layers diverge more than lower layers

03

Attention mechanisms remain stable despite sparsity

Abstract

Recent work on the lottery ticket hypothesis has produced highly sparse Transformers for NMT while maintaining BLEU. However, it is unclear how such pruning techniques affect a model's learned representations. By probing Transformers with more and more low-magnitude weights pruned away, we find that complex semantic information is first to be degraded. Analysis of internal activations reveals that higher layers diverge most over the course of pruning, gradually becoming less complex than their dense counterparts. Meanwhile, early layers of sparse models begin to perform more encoding. Attention mechanisms remain remarkably consistent as sparsity increases.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPruning