A Tensorized Transformer for Language Modeling

Xindian Ma; Peng Zhang; Shuai Zhang; Nan Duan; Yuexian Hou; Dawei; Song; Ming Zhou

arXiv:1906.09777·cs.CL·November 7, 2019·61 cites

A Tensorized Transformer for Language Modeling

Xindian Ma, Peng Zhang, Shuai Zhang, Nan Duan, Yuexian Hou, Dawei, Song, Ming Zhou

PDF

Open Access 1 Repo

TL;DR

This paper introduces a tensorized self-attention mechanism for Transformers that reduces model size and improves performance on language modeling and translation tasks by leveraging tensor decomposition and parameter sharing.

Contribution

It proposes a novel Multi-linear attention model with Block-Term Tensor Decomposition, enhancing efficiency and effectiveness over existing Transformer variants.

Findings

01

Significant parameter reduction in models.

02

Performance improvements on language modeling benchmarks.

03

Outperforms Transformer, Transformer-XL, and tensor train-based models.

Abstract

Latest development of neural models has connected the encoder and decoder through a self-attention mechanism. In particular, Transformer, which is solely based on self-attention, has led to breakthroughs in Natural Language Processing (NLP) tasks. However, the multi-head attention mechanism, as a key component of Transformer, limits the effective deployment of the model to a resource-limited setting. In this paper, based on the ideas of tensor decomposition and parameters sharing, we propose a novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD). We test and verify the proposed attention method on three language modeling tasks (i.e., PTB, WikiText-103 and One-billion) and a neural machine translation task (i.e., WMT-2016 English-German). Multi-linear attention can not only largely compress the model parameters but also obtain performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

szhangtju/The-compression-of-Transformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Tensor decomposition and applications · Speech Recognition and Synthesis

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Variational Dropout · Adaptive Input Representations · Adaptive Softmax · Linear Warmup With Cosine Annealing · Transformer-XL · Residual Connection