Multi-Stream Transformers
Mikhail Burtsev, Anna Rumshisky

TL;DR
This paper introduces a Multi-stream Transformer architecture that splits the encoder into multiple streams to preserve and explore alternative hypotheses, leading to improved performance in transformer models.
Contribution
The paper proposes a novel multi-stream encoder design that enhances hypothesis exploration and improves transformer performance, with added skip connections for further gains.
Findings
Splitting encoder into multiple streams improves performance.
Adding skip connections enhances the model further.
Multi-stream approach effectively preserves alternative hypotheses.
Abstract
Transformer-based encoder-decoder models produce a fused token-wise representation after every encoder layer. We investigate the effects of allowing the encoder to preserve and explore alternative hypotheses, combined at the end of the encoding process. To that end, we design and examine a architecture and find that splitting the Transformer encoder into multiple encoder streams and allowing the model to merge multiple representational hypotheses improves performance, with further improvement obtained by adding a skip connection between the first and the final encoder layer.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Music and Audio Processing · Ferroelectric and Negative Capacitance Devices
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Softmax · Dense Connections · Adam · Layer Normalization
