TranSFormer: Slow-Fast Transformer for Machine Translation
Bei Li, Yi Jing, Xu Tan, Zhen Xing, Tong Xiao, Jingbo Zhu

TL;DR
TranSFormer introduces a dual-branch Transformer model that integrates character-level features with subword sequences, improving translation quality efficiently across benchmarks.
Contribution
It proposes a novel Slow-Fast two-stream Transformer architecture that incorporates character-level features into machine translation models.
Findings
Achieves over 1 BLEU point improvement on multiple benchmarks.
Efficiently combines character and subword information in translation.
Demonstrates the effectiveness of multiscale features in Transformer models.
Abstract
Learning multiscale Transformer models has been evidenced as a viable approach to augmenting machine translation systems. Prior research has primarily focused on treating subwords as basic units in developing such systems. However, the incorporation of fine-grained character-level features into multiscale Transformer has not yet been explored. In this work, we present a \textbf{S}low-\textbf{F}ast two-stream learning model, referred to as Tran\textbf{SF}ormer, which utilizes a ``slow'' branch to deal with subword sequences and a ``fast'' branch to deal with longer character sequences. This model is efficient since the fast branch is very lightweight by reducing the model width, and yet provides useful fine-grained features for the slow branch. Our TranSFormer shows consistent BLEU improvements (larger than 1 BLEU point) on several machine translation benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Bioinformatics
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer · Label Smoothing · Adam · Residual Connection
