Dynamic Layer Tying for Parameter-Efficient Transformers

Tamir David Hay; Lior Wolf

arXiv:2401.12819·cs.LG·January 24, 2024·1 cites

Dynamic Layer Tying for Parameter-Efficient Transformers

Tamir David Hay, Lior Wolf

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a reinforcement learning-based method for dynamically tying transformer layers to reduce parameters and memory usage, while maintaining or improving performance.

Contribution

It presents a novel dynamic layer tying approach using RL to select and share layers during training, enhancing efficiency and regularization.

Findings

01

Model outperforms baseline in perplexity

02

Reduces trainable parameters significantly

03

Memory consumption during training is greatly decreased

Abstract

In the pursuit of reducing the number of trainable parameters in deep transformer networks, we employ Reinforcement Learning to dynamically select layers during training and tie them together. Every few iterations, the RL agent is asked whether to train each layer $i$ independently or to copy the weights of a previous layer $j < i$ . This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique. Experimental evaluations validate that our model modestly outperforms the baseline transformer model with regard to perplexity and drastically reduces the number of trainable parameters. In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. Large transformer models have demonstrated excellent performance, but they are often very expensive in terms of computational resources. The paper addresses a very timely problem: ensuring transformer models can be applied in a practical setting without excessive cost. 2. The proposed method, parameter tying via Q-Learning-is intuitive and sensible approach for this problem. 3. Quantitative results both on the axis of performance and computational resources is convincing.

Weaknesses

1. The approach is very specific in its focus. The architecture in question are only transformers, and evaluation is only on transformers in the language domain. It is unclear if these results hold for transformers in other data domains (such as vision transformers). It is also unclear if this approach (specifically parameter tying) could work for other neural architectures.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1 By employing a simple deep reinforcement learning algorithm, the number of trainable parameters of GPT-2 and BERT has been significantly reduced. The algorithm is easy to implement and holds the potential for application in other Transformer-based neural networks. 2 The algorithm has been thoroughly analyzed, providing a detailed explanation of the reasons for its effectiveness.

Weaknesses

1 The Methods section contains a considerable amount of redundant content, providing a step-by-step explanation of the algorithmic process. 2 The experimental results only show the comparison of the proposed method and the conventional training method, lacking comparison with other baselines and the related methods. 3 The related work is not adequately summarized. There is only one publication after 2020 shown in the section of Related Work, and there is a lack of research focusing on improvin

Reviewer 03Rating 10· strong accept, should be highlighted at the conferenceConfidence 4

Strengths

1. Clear, concise and precise writing. 2. The idea attempted itself is intriguing. 3. The idea is executed very well, resulting in excellent outcomes. 4. The results are very promising. 5. The results open up a lot of new questions, as well as motivate the need for research in the pathway the authors have opened up. 6. The paper represents what could be a seminal moment in architecture search methods, as well as understanding modularity in NNs.

Weaknesses

1. The method section, while technically sound, suffers from a lack of clarity as to the method being presented. For such a significant contribution, it is important to ensure that one can understanding the fundamentals of the method without having to crunch through all of the equations presented and fill many of the gaps with their own imagination. Perhaps something like a functional diagram, a more intuitive algorithm, or just a page that lays out the ingredients one by one, as well as how the

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Neural Networks and Applications