Cut Less, Fold More: Model Compression through the Lens of Projection Geometry
Olga Saukh, Dong Wang, Haris \v{S}iki\'c, Yun Cheng, Lothar Thiele

TL;DR
This paper introduces a geometry-based approach to neural network compression, demonstrating that model folding often outperforms pruning in accuracy and theoretical robustness across various models and training conditions.
Contribution
It formalizes folding and pruning as orthogonal projections, proving folding's advantages in reconstruction error and functional perturbation within a rank distance of one.
Findings
Folding yields smaller parameter reconstruction error than pruning.
Folding generally achieves higher post-compression accuracy.
The benefits of folding are most pronounced at moderate-high compression levels.
Abstract
Compressing neural networks without retraining is vital for deployment at scale. We study calibration-free compression through the lens of projection geometry: structured pruning is an axis-aligned projection, whereas model folding performs a low-rank projection via weight clustering. We formalize both as orthogonal operators and show that, within a rank distance of one, folding provably yields smaller parameter reconstruction error, and under mild smoothness assumptions, smaller functional perturbations than pruning. At scale, we evaluate >1000 checkpoints spanning ResNet18, PreActResNet18, ViT-B/32, and CLIP ViT-B/32 on CIFAR-10 and ImageNet-1K, covering diverse training hyperparameters (optimizers, learning rates, augmentations, regularization, sharpness-aware training), as well as multiple LLaMA-family 60M and 130M parameter models trained on C4. We show that folding typically…
Peer Reviews
Decision·ICLR 2026 Poster
• **Solid theoretical contribution**: Formalization of pruning and folding as orthogonal projections with provable guarantees on parameter reconstruction error, providing principled understanding of compression methods • **Extensive empirical validation with thorough ablations**: Over 1,000 checkpoints across diverse architectures (CNNs/ViTs) and datasets, with systematic investigation of learning rates, SAM, augmentation, and regularization effects that clearly identify when folding excels •
**Major Weaknesses** - **Limited Scope to Small-Scale Models and Datasets**: Evaluations are confined to relatively small architectures such as ResNet18 (11M parameters) and ViT-B/32 (~86M parameters) on CIFAR-10 and ImageNet-1K. The absence of experiments on large-scale models like LLMs (e.g., GPT series) or diffusion models limits confidence in the method’s scalability and its relevance to modern deployment settings with billions of parameters. - **Insufficient Comparisons to State-of-the-Ar
- The projection-theoretic framing of pruning and folding is reasonable. Framing compression as orthogonal projection provides an solid interpretation. - The paper is writing well with good visualizations.
- The main argument is questionable. The paper argues that model folding is better than model pruning with closer distance to the original models. This statement is a little problematic. Since pruning domain has been developed broadly and deeply in the past decade, many pruning methods could produce pruning models even surpassing original models. The logic here is that pruning could eliminate some harmful neurons to make the pruned model to be stronger. Therefore, the distance to original model
- Casting pruning and folding as orthogonal projections cleanly explains their geometric differences; theorems show folding’s strictly smaller projection error and hence tighter loss perturbation under mild smoothness - The evaluation spans >1000 checkpoints and multiple architectures/datasets, with consistent wins for folding at moderate to high compression and robustness to training variations. - Folding’s advantage persists after minimal recalibration (e.g., BN/LayerNorm reset) and 1-5 epoc
- The core guarantee uses a one-rank slack (pruning rank vs. folding rank) Although the authors state experiments match retained parameters/FLOPs, a theoretical result at exactly matched rank would remove any residual ambiguity, especially at low compression where a single unit can matter. Consider strengthening theory (e.g., conditions under which folding dominates at equal rank) or adding tighter empirical per-layer equality checks to preclude hidden capacity differences. - Results focus on
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · 3D Shape Modeling and Analysis
