SparseOptimizer: Sparsify Language Models through Moreau-Yosida   Regularization and Accelerate via Compiler Co-design

Fu-Ming Guo

arXiv:2306.15656·cs.LG·July 19, 2023

SparseOptimizer: Sparsify Language Models through Moreau-Yosida Regularization and Accelerate via Compiler Co-design

Fu-Ming Guo

PDF

Open Access

TL;DR

This paper presents SparseOptimizer, a regularization-based optimizer that induces sparsity in large language models, enabling significant acceleration through compiler co-design while maintaining comparable performance.

Contribution

It introduces a novel optimizer using Moreau-Yosida regularization with an embedded shrinkage operator, enabling universal sparsification without code changes and demonstrating inference speedups.

Findings

01

Sparse models achieve comparable accuracy to dense models on benchmarks.

02

Significant inference acceleration (up to 7.15x) with compiler co-design.

03

The optimizer is plug-and-play and theoretically robust.

Abstract

This paper introduces SparseOptimizer, a novel deep learning optimizer that exploits Moreau-Yosida regularization to naturally induce sparsity in large language models such as BERT, ALBERT and GPT. Key to the design of SparseOptimizer is an embedded shrinkage operator, which imparts sparsity directly within the optimization process. This operator, backed by a sound theoretical framework, includes an analytical solution, thereby reinforcing the optimizer's robustness and efficacy. Crucially, SparseOptimizer's plug-and-play functionality eradicates the need for code modifications, making it a universally adaptable tool for a wide array of large language models. Empirical evaluations on benchmark datasets such as GLUE, RACE, SQuAD1, and SQuAD2 confirm that SparseBERT and SparseALBERT, when sparsified using SparseOptimizer, achieve performance comparable to their dense counterparts, BERT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · LAMB · Linear Layer · Linear Warmup With Cosine Annealing · Layer Normalization · Attention Dropout · WordPiece · Dense Connections