Fast-ELECTRA for Efficient Pre-training
Chengyu Dong, Liyuan Liu, Hao Cheng, Jingbo Shang, Jianfeng Gao,, Xiaodong Liu

TL;DR
Fast-ELECTRA improves pre-training efficiency by replacing the auxiliary model with an existing language model and using a temperature-scaled curriculum, achieving comparable performance with reduced computational costs.
Contribution
It introduces a novel approach that replaces the auxiliary model with an existing language model and employs a curriculum based on temperature scaling to enhance efficiency.
Findings
Achieves similar performance to state-of-the-art ELECTRA pre-training.
Reduces training cost and memory usage significantly.
Improves training stability and hyper-parameter robustness.
Abstract
ELECTRA pre-trains language models by detecting tokens in a sequence that have been replaced by an auxiliary model. Although ELECTRA offers a significant boost in efficiency, its potential is constrained by the training cost brought by the auxiliary model. Notably, this model, which is jointly trained with the main model, only serves to assist the training of the main model and is discarded post-training. This results in a substantial amount of training cost being expended in vain. To mitigate this issue, we propose Fast-ELECTRA, which leverages an existing language model as the auxiliary model. To construct a learning curriculum for the main model, we smooth its output distribution via temperature scaling following a descending schedule. Our approach rivals the performance of state-of-the-art ELECTRA-style pre-training methods, while significantly eliminating the computation and memory…
Peer Reviews
Decision·ICLR 2024 poster
Novelty: The idea of using a fixed aux model for efficiency is interesting, and novel to my knowledge (although I'm not entirely sure since I could not find much discussion about this). Similarly the idea of using decaying temperature as a curriculum in this context is quite interesting Quality: The paper provides a nice analysis of computation and memory benefits of the method. Clarity: The paper is easy to follow for most part. Connections to prior work and some other details could be presen
Comparison to prior work: - It would be helpful to highlight the most relevant work in Table 1 that a reader should focus in. Additionally, is there any evaluation on prior work that uses fixed generator? It would also help to include some FLOPs comparison to the baselines used in Table 1. The paper will also help with a discussion on accuracy-efficiency tradeoff. Lack of such discussions made it harder to assess the full value of proposed method. - Recent paper (Dong et al.) from ICML 2023 p
1. Simple and effective method. 2. Good performance. 3. Very clear presentation.
The scale of models in experiments seems a bit limited under the current standard. Have you tried larger models?
1. The method is both intuitive and effective. 2. The problem it tackles is highly practical.
1. The proposed method appears tailored specifically for ELECTRA, potentially limiting its applicability and community interest. 2. Could we consider applying a continual learning method (e.g., [1]) to enhance ELECTRA's efficiency? [1]: Adapting a Language Model While Preserving its General Knowledge, Ke et al., EMNLP 2022
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Residual Connection · WordPiece · Adam · Layer Normalization · Attention Dropout
