AlphaAdam:Asynchronous Masked Optimization with Dynamic Alpha for Selective Updates
Da Chang, Yu Li, Ganzhao Yuan

TL;DR
AlphaAdam introduces an innovative optimization framework for large language models that employs dynamic intra-layer parameter masking and adaptive update strengths, leading to faster convergence and enhanced training stability.
Contribution
It proposes a novel intra-layer asynchronous masked optimization method with dynamic alpha adjustment, improving efficiency and stability over existing optimizers for LLM training.
Findings
Outperforms AdamW in convergence speed
Enhances training stability for LLMs
Applicable to various large models like GPT-2, RoBERTa, and Llama-7B
Abstract
In the training of large language models (LLMs), updating parameters more efficiently and stably has always been an important challenge. To achieve efficient parameter updates, existing methods usually achieve performance comparable to full parameter updates through methods such as low-dimensional decomposition or layer-wise selective updates. In this work, we propose AlphaAdam, an optimization framework for LLM from the perspective of intra-layer parameter updates. By decoupling parameter updates and dynamically adjusting their strength, AlphaAdam accelerates convergence and improves training stability. We construct parameter masks based on the consistency of historical momentum and gradient direction and combine them with an adaptive mask strength strategy to ensure efficient optimization and theoretical convergence guarantees, which is also applicable to most momentum-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · WordPiece · BERT · Cosine Annealing · Adam · Dropout · Byte Pair Encoding · Residual Connection
