EIT: Enhanced Interactive Transformer
Tong Zheng, Bei Li, Huiwen Bao, Tong Xiao, Jingbo Zhu

TL;DR
This paper introduces EMHA, an improved multi-head self-attention mechanism that enhances consensus among heads and relaxes the complementarity constraints, leading to better performance on various language tasks.
Contribution
The paper proposes EMHA, which incorporates inner- and cross-subspace interactions to improve multi-head self-attention by balancing complementarity and consensus.
Findings
Outperforms existing models on multiple language tasks
Achieves superior results with minimal increase in model size
Demonstrates the effectiveness of enhanced consensus mechanisms
Abstract
Two principles: the complementary principle and the consensus principle are widely acknowledged in the literature of multi-view learning. However, the current design of multi-head self-attention, an instance of multi-view learning, prioritizes the complementarity while ignoring the consensus. To address this problem, we propose an enhanced multi-head self-attention (EMHA). First, to satisfy the complementary principle, EMHA removes the one-to-one mapping constraint among queries and keys in multiple subspaces and allows each query to attend to multiple keys. On top of that, we develop a method to fully encourage consensus among heads by introducing two interaction models, namely inner-subspace interaction and cross-subspace interaction. Extensive experiments on a wide range of language tasks (e.g., machine translation, abstractive summarization and grammar correction, language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Online Learning and Analytics · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dropout
