SparseOptimizer: Sparsify Language Models through Moreau-Yosida Regularization and Accelerate via Compiler Co-design
Fu-Ming Guo

TL;DR
This paper presents SparseOptimizer, a regularization-based optimizer that induces sparsity in large language models, enabling significant acceleration through compiler co-design while maintaining comparable performance.
Contribution
It introduces a novel optimizer using Moreau-Yosida regularization with an embedded shrinkage operator, enabling universal sparsification without code changes and demonstrating inference speedups.
Findings
Sparse models achieve comparable accuracy to dense models on benchmarks.
Significant inference acceleration (up to 7.15x) with compiler co-design.
The optimizer is plug-and-play and theoretically robust.
Abstract
This paper introduces SparseOptimizer, a novel deep learning optimizer that exploits Moreau-Yosida regularization to naturally induce sparsity in large language models such as BERT, ALBERT and GPT. Key to the design of SparseOptimizer is an embedded shrinkage operator, which imparts sparsity directly within the optimization process. This operator, backed by a sound theoretical framework, includes an analytical solution, thereby reinforcing the optimizer's robustness and efficacy. Crucially, SparseOptimizer's plug-and-play functionality eradicates the need for code modifications, making it a universally adaptable tool for a wide array of large language models. Empirical evaluations on benchmark datasets such as GLUE, RACE, SQuAD1, and SQuAD2 confirm that SparseBERT and SparseALBERT, when sparsified using SparseOptimizer, achieve performance comparable to their dense counterparts, BERT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · LAMB · Linear Layer · Linear Warmup With Cosine Annealing · Layer Normalization · Attention Dropout · WordPiece · Dense Connections
