Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to   Self-attention

Mengsay Loem; Sho Takase; Masahiro Kaneko; Naoaki Okazaki

arXiv:2207.13354·cs.CL·July 28, 2022·1 cites

Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention

Mengsay Loem, Sho Takase, Masahiro Kaneko, Naoaki Okazaki

PDF

Open Access

TL;DR

This paper proposes replacing self-attention in Transformers with a multi-head neural n-gram approach, which considers local context and can match or outperform traditional self-attention in sequence-to-sequence tasks.

Contribution

The study introduces a multi-head neural n-gram mechanism as an alternative to self-attention, demonstrating competitive performance and potential for combined use in Transformers.

Findings

01

Multi-head neural n-gram achieves comparable or better results than self-attention.

02

Combining neural n-gram with self-attention further improves performance.

03

Neural n-gram complements self-attention, enhancing Transformer models.

Abstract

Impressive performance of Transformer has been attributed to self-attention, where dependencies between entire input in a sequence are considered at every position. In this work, we reform the neural $n$ -gram model, which focuses on only several surrounding representations of each position, with the multi-head mechanism as in Vaswani et al.(2017). Through experiments on sequence-to-sequence tasks, we show that replacing self-attention in Transformer with multi-head neural $n$ -gram can achieve comparable or better performance than Transformer. From various analyses on our proposed method, we find that multi-head neural $n$ -gram is complementary to self-attention, and their combinations can further improve performance of vanilla Transformer.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Softmax · Dense Connections · Dropout · Adam · Byte Pair Encoding · Label Smoothing · Multi-Head Attention · Residual Connection