Applying the Transformer to Character-level Transduction

Shijie Wu; Ryan Cotterell; Mans Hulden

arXiv:2005.10213·cs.CL·January 29, 2021

Applying the Transformer to Character-level Transduction

Shijie Wu, Ryan Cotterell, Mans Hulden

PDF

3 Repos

TL;DR

This paper demonstrates that with proper batch size and a simple technique, the transformer surpasses recurrent models in character-level transduction tasks, achieving state-of-the-art results across multiple NLP applications.

Contribution

The study reveals the importance of batch size for transformer performance in character-level tasks and introduces a technique that enhances its effectiveness, leading to new state-of-the-art results.

Findings

01

Transformer outperforms recurrent models with large batch sizes.

02

A simple technique improves feature-guided character transduction.

03

State-of-the-art results achieved on multiple character-level NLP tasks.

Abstract

The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks. Yet for character-level transduction tasks, e.g. morphological inflection generation and historical text normalization, there are few works that outperform recurrent models using the transformer. In an empirical study, we uncover that, in contrast to recurrent sequence-to-sequence models, the batch size plays a crucial role in the performance of the transformer on character-level tasks, and we show that with a large enough batch size, the transformer does indeed outperform recurrent models. We also introduce a simple technique to handle feature-guided character-level transduction that further improves performance. With these insights, we achieve state-of-the-art performance on morphological inflection and historical text normalization. We also show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding