Pretrained Transformers as Universal Computation Engines

Kevin Lu; Aditya Grover; Pieter Abbeel; Igor Mordatch

arXiv:2103.05247·cs.LG·July 1, 2021·99 cites

Pretrained Transformers as Universal Computation Engines

Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch

PDF

Open Access 4 Repos 1 Video

TL;DR

This paper explores how pretrained language transformers can be adapted to perform well on diverse non-language tasks with minimal modifications, demonstrating their versatility as universal computation engines.

Contribution

It introduces the concept of Frozen Pretrained Transformers (FPT) and shows their effectiveness across multiple modalities without changing self-attention or feedforward layers.

Findings

01

Pretraining on language improves non-language task performance.

02

FPT achieves strong results with minimal finetuning.

03

Language-pretrained transformers outperform random initialized models.

Abstract

We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language can improve performance and compute efficiency on non-language downstream tasks. Additionally, we perform an analysis of the architecture, comparing the performance of a random initialized transformer to a random LSTM. Combining the two insights, we find language-pretrained transformers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Pretrained Transformers as Universal Computation Engines (Machine Learning Research Paper Explained)· youtube

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Neural Networks and Reservoir Computing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Tanh Activation · Sigmoid Activation · Long Short-Term Memory · Dropout · Attention Is All You Need · Layer Normalization · Adam