Local Byte Fusion for Neural Machine Translation
Makesh Narsimhan Sreedhar, Xiangpeng Wan, Yu Cheng, Junjie Hu

TL;DR
This paper introduces Local Byte Fusion (LOBEF), a method that enhances byte-based neural machine translation by aggregating local semantic information, leading to improved performance and efficiency over traditional subword models.
Contribution
The paper proposes a novel byte-based translation method, LOBEF, which effectively aggregates local information using byte n-grams and word boundaries, outperforming existing subword techniques.
Findings
Consistent improvement over traditional byte-based models.
Outperforms subword techniques in multilingual translation and domain adaptation.
Models are parameter-efficient and train faster.
Abstract
Subword tokenization schemes are the dominant technique used in current NLP models. However, such schemes can be rigid and tokenizers built on one corpus do not adapt well to other parallel corpora. It has also been observed that in multilingual corpora, subword tokenization schemes over-segment low-resource languages leading to a drop in translation performance. A simple alternative to subword tokenizers is byte-based methods i.e. tokenization into byte sequences using encoding schemes such as UTF-8. Byte tokens often represent inputs at a sub-character granularity i.e. one character can be represented by a sequence of multiple byte tokens. This results in byte sequences that are significantly longer than character sequences. Enforcing aggregation of local information in the lower layers can guide the model to build higher-level semantic information. We propose a Local Byte Fusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
