COAT: Compressing Optimizer states and Activation for Memory-Efficient   FP8 Training

Haocheng Xi; Han Cai; Ligeng Zhu; Yao Lu; Kurt Keutzer; Jianfei Chen,; Song Han

arXiv:2410.19313·cs.LG·February 14, 2025

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen,, Song Han

PDF

Open Access 1 Repo

TL;DR

COAT is a novel FP8 training framework that significantly reduces memory usage and accelerates training of large models by compressing optimizer states and activations with innovative quantization techniques.

Contribution

The paper introduces COAT, which employs dynamic range expansion and mixed-granularity activation quantization to optimize memory and speed in FP8 training.

Findings

01

Reduces training memory footprint by 1.54x compared to BF16.

02

Achieves nearly lossless performance across multiple tasks.

03

Provides a 1.43x speedup over BF16 training.

Abstract

FP8 training has emerged as a promising method for improving training efficiency. Existing frameworks accelerate training by applying FP8 computation to linear layers while leaving optimizer states and activations in higher precision, which fails to fully optimize memory usage. This paper introduces COAT (Compressing Optimizer States and Activations for FP8 Training), a novel FP8 training framework designed to significantly reduce memory footprint when training large models. COAT addresses current limitations through two key innovations: (1) Dynamic Range Expansion, which aligns optimizer state distributions more closely with the FP8 representation range, thereby reducing quantization error, and (2) Mixed-Granularity Activation Quantization, which optimizes activation memory using a combination of per-tensor and per-group quantization strategies. Experiments demonstrate that COAT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nvlabs/coat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing