Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention
Mengsay Loem, Sho Takase, Masahiro Kaneko, Naoaki Okazaki

TL;DR
This paper proposes replacing self-attention in Transformers with a multi-head neural n-gram approach, which considers local context and can match or outperform traditional self-attention in sequence-to-sequence tasks.
Contribution
The study introduces a multi-head neural n-gram mechanism as an alternative to self-attention, demonstrating competitive performance and potential for combined use in Transformers.
Findings
Multi-head neural n-gram achieves comparable or better results than self-attention.
Combining neural n-gram with self-attention further improves performance.
Neural n-gram complements self-attention, enhancing Transformer models.
Abstract
Impressive performance of Transformer has been attributed to self-attention, where dependencies between entire input in a sequence are considered at every position. In this work, we reform the neural -gram model, which focuses on only several surrounding representations of each position, with the multi-head mechanism as in Vaswani et al.(2017). Through experiments on sequence-to-sequence tasks, we show that replacing self-attention in Transformer with multi-head neural -gram can achieve comparable or better performance than Transformer. From various analyses on our proposed method, we find that multi-head neural -gram is complementary to self-attention, and their combinations can further improve performance of vanilla Transformer.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Softmax · Dense Connections · Dropout · Adam · Byte Pair Encoding · Label Smoothing · Multi-Head Attention · Residual Connection
