EIT: Enhanced Interactive Transformer

Tong Zheng; Bei Li; Huiwen Bao; Tong Xiao; Jingbo Zhu

arXiv:2212.10197·cs.CL·June 6, 2024·1 cites

EIT: Enhanced Interactive Transformer

Tong Zheng, Bei Li, Huiwen Bao, Tong Xiao, Jingbo Zhu

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces EMHA, an improved multi-head self-attention mechanism that enhances consensus among heads and relaxes the complementarity constraints, leading to better performance on various language tasks.

Contribution

The paper proposes EMHA, which incorporates inner- and cross-subspace interactions to improve multi-head self-attention by balancing complementarity and consensus.

Findings

01

Outperforms existing models on multiple language tasks

02

Achieves superior results with minimal increase in model size

03

Demonstrates the effectiveness of enhanced consensus mechanisms

Abstract

Two principles: the complementary principle and the consensus principle are widely acknowledged in the literature of multi-view learning. However, the current design of multi-head self-attention, an instance of multi-view learning, prioritizes the complementarity while ignoring the consensus. To address this problem, we propose an enhanced multi-head self-attention (EMHA). First, to satisfy the complementary principle, EMHA removes the one-to-one mapping constraint among queries and keys in multiple subspaces and allows each query to attend to multiple keys. On top of that, we develop a method to fully encourage consensus among heads by introducing two interaction models, namely inner-subspace interaction and cross-subspace interaction. Extensive experiments on a wide range of language tasks (e.g., machine translation, abstractive summarization and grammar correction, language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

EIT: Enhanced Interactive Transformer· underline

Taxonomy

TopicsTopic Modeling · Online Learning and Analytics · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dropout