Multi-Stream Transformers

Mikhail Burtsev; Anna Rumshisky

arXiv:2107.10342·cs.CL·July 23, 2021

Multi-Stream Transformers

Mikhail Burtsev, Anna Rumshisky

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Multi-stream Transformer architecture that splits the encoder into multiple streams to preserve and explore alternative hypotheses, leading to improved performance in transformer models.

Contribution

The paper proposes a novel multi-stream encoder design that enhances hypothesis exploration and improves transformer performance, with added skip connections for further gains.

Findings

01

Splitting encoder into multiple streams improves performance.

02

Adding skip connections enhances the model further.

03

Multi-stream approach effectively preserves alternative hypotheses.

Abstract

Transformer-based encoder-decoder models produce a fused token-wise representation after every encoder layer. We investigate the effects of allowing the encoder to preserve and explore alternative hypotheses, combined at the end of the encoding process. To that end, we design and examine a $Multi-stream Transformer$ architecture and find that splitting the Transformer encoder into multiple encoder streams and allowing the model to merge multiple representational hypotheses improves performance, with further improvement obtained by adding a skip connection between the first and the final encoder layer.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lucidrains/multistream-transformers
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Music and Audio Processing · Ferroelectric and Negative Capacitance Devices

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Softmax · Dense Connections · Adam · Layer Normalization