AlphaAdam:Asynchronous Masked Optimization with Dynamic Alpha for   Selective Updates

Da Chang; Yu Li; Ganzhao Yuan

arXiv:2501.18094·cs.LG·February 6, 2025

AlphaAdam:Asynchronous Masked Optimization with Dynamic Alpha for Selective Updates

Da Chang, Yu Li, Ganzhao Yuan

PDF

Open Access

TL;DR

AlphaAdam introduces an innovative optimization framework for large language models that employs dynamic intra-layer parameter masking and adaptive update strengths, leading to faster convergence and enhanced training stability.

Contribution

It proposes a novel intra-layer asynchronous masked optimization method with dynamic alpha adjustment, improving efficiency and stability over existing optimizers for LLM training.

Findings

01

Outperforms AdamW in convergence speed

02

Enhances training stability for LLMs

03

Applicable to various large models like GPT-2, RoBERTa, and Llama-7B

Abstract

In the training of large language models (LLMs), updating parameters more efficiently and stably has always been an important challenge. To achieve efficient parameter updates, existing methods usually achieve performance comparable to full parameter updates through methods such as low-dimensional decomposition or layer-wise selective updates. In this work, we propose AlphaAdam, an optimization framework for LLM from the perspective of intra-layer parameter updates. By decoupling parameter updates and dynamically adjusting their strength, AlphaAdam accelerates convergence and improves training stability. We construct parameter masks based on the consistency of historical momentum and gradient direction and combine them with an adaptive mask strength strategy to ensure efficient optimization and theoretical convergence guarantees, which is also applicable to most momentum-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · WordPiece · BERT · Cosine Annealing · Adam · Dropout · Byte Pair Encoding · Residual Connection