MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning

Guangxiang Zhao; Xu Sun; Jingjing Xu; Zhiyuan Zhang; Liangchen Luo

arXiv:1911.09483·cs.CL·November 22, 2019·41 cites

MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning

Guangxiang Zhao, Xu Sun, Jingjing Xu, Zhiyuan Zhang, Liangchen Luo

PDF

Open Access 3 Repos

TL;DR

MUSE introduces a parallel multi-scale attention mechanism for sequence-to-sequence learning, enhancing long-sequence modeling and outperforming previous models in machine translation tasks.

Contribution

The paper proposes the MUSE model, combining parallel multi-scale attention with convolution and self-attention, improving long-sequence translation performance.

Findings

01

Outperforms previous models on three machine translation benchmarks.

02

Achieves substantial improvements especially on long sequences.

03

Potential for faster inference due to parallel architecture.

Abstract

In sequence to sequence learning, the self-attention mechanism proves to be highly effective, and achieves significant improvements in many tasks. However, the self-attention mechanism is not without its own flaws. Although self-attention can model extremely long dependencies, the attention in deep layers tends to overconcentrate on a single token, leading to insufficient use of local information and difficultly in representing long sequences. In this work, we explore parallel multi-scale representation learning on sequence data, striving to capture both long-range and short-range language structures. To this end, we propose the Parallel MUlti-Scale attEntion (MUSE) and MUSE-simple. MUSE-simple contains the basic idea of parallel multi-scale sequence representation learning, and it encodes the sequence in parallel, in terms of different scales with the help from self-attention, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Multi-Head Attention · Byte Pair Encoding · Dense Connections