Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and   Performance for Low-Resource Machine Translation

Kenton Murray; Jeffery Kinnison; Toan Q. Nguyen; Walter Scheirer,; David Chiang

arXiv:1910.06717·cs.CL·October 16, 2019

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Kenton Murray, Jeffery Kinnison, Toan Q. Nguyen, Walter Scheirer,, David Chiang

PDF

1 Repo

TL;DR

This paper introduces auto-sizing, a method that dynamically adjusts Transformer architecture during training, leading to faster, more efficient, and better-performing low-resource machine translation models.

Contribution

It presents a novel auto-sizing approach that integrates architecture search into a single training run using regularization to prune neurons, improving efficiency and translation quality.

Findings

01

BLEU scores improved by up to 3.9 points on low-resource pairs

02

Model size reduced by one-third through neuron pruning

03

Auto-sizing enhances training efficiency and translation performance

Abstract

Neural sequence-to-sequence models, particularly the Transformer, are the state of the art in machine translation. Yet these neural networks are very sensitive to architecture and hyperparameter settings. Optimizing these settings by grid or random search is computationally expensive because it requires many training runs. In this paper, we incorporate architecture search into a single training run through auto-sizing, which uses regularization to delete neurons in a network over the course of training. On very low-resource language pairs, we show that auto-sizing can improve BLEU scores by up to 3.9 points while removing one-third of the parameters from the model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KentonMurray/ProxGradPytorch
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Random Search · Residual Connection · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Multi-Head Attention · Byte Pair Encoding