A Dynamic Head Importance Computation Mechanism for Neural Machine Translation
Akshay Goindani, Manish Shrivastava

TL;DR
This paper introduces DHICM, a dynamic mechanism to compute and utilize head importance in Transformer models for neural machine translation, improving performance especially with limited training data.
Contribution
The paper proposes a novel dynamic head importance computation mechanism that enhances Transformer efficiency and translation quality by adaptively identifying important attention heads.
Findings
DHICM outperforms traditional Transformer models in NMT tasks.
DHICM is especially effective with limited training data.
The added importance mechanism improves resource utilization and translation accuracy.
Abstract
Multiple parallel attention mechanisms that use multiple attention heads facilitate greater performance of the Transformer model for various applications e.g., Neural Machine Translation (NMT), text classification. In multi-head attention mechanism, different heads attend to different parts of the input. However, the limitation is that multiple heads might attend to the same part of the input, resulting in multiple heads being redundant. Thus, the model resources are under-utilized. One approach to avoid this is to prune least important heads based on certain importance score. In this work, we focus on designing a Dynamic Head Importance Computation Mechanism (DHICM) to dynamically calculate the importance of a head with respect to the input. Our insight is to design an additional attention layer together with multi-head attention, and utilize the outputs of the multi-head attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Dense Connections · Label Smoothing · Residual Connection · Adam · Byte Pair Encoding
