LOST: Low-rank and Sparse Pre-training for Large Language Models
Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Liwu Xu, Tianjin Huang, Wenwu Wang, Shiwei Liu, Xilu Wang

TL;DR
LOST introduces a novel low-rank and sparse pre-training method for large language models, significantly reducing computational costs while maintaining or improving performance compared to traditional full-rank training.
Contribution
The paper proposes an innovative integration of low-rank and sparse structures for efficient LLM pretraining, addressing limitations of previous simplistic approaches.
Findings
Achieves competitive or better performance than full-rank models.
Reduces memory and compute requirements significantly.
Effective for models ranging from 60M to 7B parameters.
Abstract
While large language models (LLMs) have achieved remarkable performance across a wide range of tasks, their massive scale incurs prohibitive computational and memory costs for pre-training from scratch. Recent studies have investigated the use of low-rank parameterization as a means of reducing model size and training cost. In this context, sparsity is often employed as a complementary technique to recover important information lost in low-rank compression by capturing salient features in the residual space. However, existing approaches typically combine low-rank and sparse components in a simplistic or ad hoc manner, often resulting in undesirable performance degradation compared to full-rank training. In this paper, we propose \textbf{LO}w-rank and \textbf{S}parse pre-\textbf{T}raining (\textbf{LOST}) for LLMs, a novel method that ingeniously integrates low-rank and sparse structures…
Peer Reviews
Decision·Submitted to ICLR 2026
Many Previous works have shown empirical results that large models exhibit substantial parameter redundancy. The success of LoRA also demonstrates that tuning the low-rank part of the weight matrix can effectively learn new things. Extending this idea to the pre-training is promising and could have a broad impact on the large model community.
1. My major concern is the lack of a scaling law. When comparing architectures/optimization paradigms, scaling laws are a crucial and widely used tool. Different methods and settings often require different optimal hyperparameters, and relative rankings can flip as scale changes (e.g., attention variants such as MLA, GQA, linear attention exhibit size-dependent crossovers). Although this work reports results of multiple model sizes, the comparisons could still be confounded by hyperparameter sub
1. Principled "Co-Design" Methodology: Instead of naively combining low-rank and sparse matrices, it uses Singular Value Decomposition (SVD) to purposefully create the sparse component from the residual singular values. This "co-design" ensures the two components are complementary from the start. 2. Strong Empirical Performance: LOST demonstrates state-of-the-art results. It achieves perplexity scores that are competitive with, or even superior to (very suspicious they use different batch size a
(Please respond to the questions section) 1. Limited Novelty Compared to SLTrain 2. Weak Justification for Sparse Component Design 3. Inconsistent and Unfair Experimental Comparisons
1. The paper is well written and easy to read. 2. Significant efficiency and performance advantages. The core results (Table 1, Figure 1) show that while drastically reducing memory footprint, LOST achieves lower Perplexity than full-rank models across most model scales, demonstrating the notable superiority of the proposed method. 3. The experiments cover a variety of model sizes ranging from 60M to 7B parameters, verifying the method’s universality and scalability. Comparisons are conducted wi
1. There are only experimental results of pre-training on LLMs, with a lack of fine-tuning results on downstream tasks (e.g., GLUE). The fine-tuning hyperparameters for the GLUE dataset are provided in Table 13; however, the actual experimental results of the GLUE fine-tuning task are not presented in the main text. 2. The authors only conducted experiments on models from the Llama family, and there is a lack of experimental results on LLMs with other architectures. It remains unclear whether LO
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications
