Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction
Bj\"orn Roman Kohlberger (EctoSpace, Dublin, Ireland)

TL;DR
Spectral Compact Training (SCT) significantly reduces memory usage in training large language models by replacing dense matrices with spectral factors, enabling training on limited hardware like a Steam Deck.
Contribution
Introduces SCT, a novel spectral factorization method that replaces dense weights with low-rank spectral factors, allowing efficient training of large models on consumer hardware.
Findings
Up to 199x memory reduction per MLP layer at rank 32.
Achieved full training of 70B-parameter models on a Steam Deck.
Rank 128 offers optimal trade-off with 11.7x compression and lowest perplexity.
Abstract
The memory wall remains the primary bottleneck for training large language models on consumer hardware. We introduce Spectral Compact Training (SCT), a method that replaces dense weight matrices with permanent truncated SVD factors W = U diag(s) V^T, where the full dense matrix is never materialized during training or inference. Gradients flow through the compact spectral factors via standard backpropagation, and U, V are retracted to the Stiefel manifold via QR decomposition after each optimizer step. SCT achieves up to 199x memory reduction per MLP layer at rank 32, enabling full training steps of 70B-parameter architectures on a Steam Deck handheld (7.2 GB peak memory vs. 1,245 GB for dense FP32 training with Adam). Rank-sweep experiments on SmolLM2-1.7B (ranks 32-256, 2000 steps, NVIDIA A100) show that all tested ranks converge to the same loss floor (~4.2-4.5), identifying the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
