Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking
I\~nigo Urteaga, Moulay-Za\"idane Dra\"idia, Tomer Lancewicki and, Shahram Khadivi

TL;DR
This paper introduces a Bayesian optimization framework using multi-armed bandits and Gaussian processes to efficiently select hyperparameters during language model pre-training, reducing computational costs and improving performance.
Contribution
It presents a novel Thompson sampling approach with Gaussian process modeling for dynamic hyperparameter selection in TLM pre-training, avoiding costly grid searches.
Findings
GP-TS accelerates pre-training by reducing epochs needed.
Achieves lower MLM loss compared to fixed hyperparameters.
Attains competitive downstream performance.
Abstract
We design and evaluate a Bayesian optimization framework for resource efficient pre-training of Transformer-based language models (TLMs). TLM pre-training requires high computational resources and introduces many unresolved design choices, such as selecting its pre-training hyperparameters. We propose a multi-armed bandit framework for the sequential selection of TLM pre-training hyperparameters, aimed at optimizing language model performance, in a resource efficient manner. We design a Thompson sampling algorithm, with a surrogate Gaussian process reward model of the Masked Language Model (MLM) pre-training objective, for its sequential minimization. Instead of MLM pre-training with fixed masking probabilities, the proposed Gaussian process-based Thompson sampling (GP-TS) accelerates pre-training by sequentially selecting masking hyperparameters that improve performance. We empirically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
MethodsGaussian Process
