Efficient GPT Model Pre-training using Tensor Train Matrix   Representation

Viktoriia Chekalina; Georgii Novikov; Julia Gusak; Ivan Oseledets,; Alexander Panchenko

arXiv:2306.02697·cs.AI·June 6, 2023·1 cites

Efficient GPT Model Pre-training using Tensor Train Matrix Representation

Viktoriia Chekalina, Georgii Novikov, Julia Gusak, Ivan Oseledets,, Alexander Panchenko

PDF

Open Access

TL;DR

This paper introduces a tensor train matrix-based approach to reduce parameters in GPT-2, maintaining performance while decreasing model size and training costs.

Contribution

It proposes replacing fully-connected layer matrices with tensor train matrices in GPT-2, enabling parameter reduction and efficient training.

Findings

01

Model stores up to 40% fewer parameters.

02

Perplexity comparable to original GPT-2.

03

Performs similarly on downstream tasks.

Abstract

Large-scale transformer models have shown remarkable performance in language modelling tasks. However, such models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch. To reduce the number of the parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Tensor Train Matrix~(TTM) structure. Finally, we customize forward and backward operations through the TTM-based layer for simplicity and the stableness of further training. % The resulting GPT-2-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model. On the downstream tasks, including language understanding and text summarization, the model performs similarly to the original GPT-2 model. The proposed tensorized layers could be used to efficiently pre-training other…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational Physics and Python Applications · Machine Learning in Healthcare

MethodsAttention Is All You Need · Cosine Annealing · Label Smoothing · Absolute Position Encodings · Linear Layer · Position-Wise Feed-Forward Layer · Layer Normalization · Byte Pair Encoding · Softmax · Linear Warmup With Cosine Annealing