Pretrained Transformers as Universal Computation Engines
Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch

TL;DR
This paper explores how pretrained language transformers can be adapted to perform well on diverse non-language tasks with minimal modifications, demonstrating their versatility as universal computation engines.
Contribution
It introduces the concept of Frozen Pretrained Transformers (FPT) and shows their effectiveness across multiple modalities without changing self-attention or feedforward layers.
Findings
Pretraining on language improves non-language task performance.
FPT achieves strong results with minimal finetuning.
Language-pretrained transformers outperform random initialized models.
Abstract
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language can improve performance and compute efficiency on non-language downstream tasks. Additionally, we perform an analysis of the architecture, comparing the performance of a random initialized transformer to a random LSTM. Combining the two insights, we find language-pretrained transformers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Pretrained Transformers as Universal Computation Engines (Machine Learning Research Paper Explained)· youtube
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Neural Networks and Reservoir Computing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Tanh Activation · Sigmoid Activation · Long Short-Term Memory · Dropout · Attention Is All You Need · Layer Normalization · Adam
