Towards Universal & Efficient Model Compression via Exponential Torque Pruning
Sarthak Ketanbhai Modi, Zi Pong Lim, Shourya Kuchhal, Yushi Cao, Yupeng Cheng, Yon Shin Teo, Shang-Wei Lin, Zhiming Li

TL;DR
This paper introduces Exponential Torque Pruning, a novel regularization method that significantly improves neural network compression efficiency by applying an exponential force scheme, outperforming previous methods with minimal accuracy loss.
Contribution
The paper proposes a new exponential force application scheme for model pruning, enhancing compression rates while maintaining accuracy better than existing torque-inspired regularization methods.
Findings
Achieves higher compression rates than previous methods.
Maintains negligible accuracy drop during pruning.
Demonstrates effectiveness across various domains.
Abstract
The rapid growth in complexity and size of modern deep neural networks (DNNs) has increased challenges related to computational costs and memory usage, spurring a growing interest in efficient model compression techniques. Previous state-of-the-art approach proposes using a Torque-inspired regularization which forces the weights of neural modules around a selected pivot point. Whereas, we observe that the pruning effect of this approach is far from perfect, as the post-trained network is still dense and also suffers from high accuracy drop. In this work, we attribute such ineffectiveness to the default linear force application scheme, which imposes inappropriate force on neural module of different distances. To efficiently prune the redundant and distant modules while retaining those that are close and necessary for effective inference, in this work, we propose Exponential Torque…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Improved Regularization Design: ETP introduces an exponential force application scheme that better aligns with the intuition of structured pruning, effectively addressing the imbalance in the original Torque method. 2. Strong Empirical Results Across Domains:The experiments span vision, language, graph, and time-series tasks, demonstrating the universality and robustness of the approach across architectures.
1. Limited Novelty: The core idea — replacing the linear regularization force in Torque with an exponential one — is conceptually straightforward. Similar non-linear or adaptive regularization ideas (e.g., GReg, Wang et al., 2020) have been explored before. The theoretical novelty could be better articulated. 2. Lack of Real Hardware Validation: While the paper claims ETP is suitable for edge deployment, it only reports computational reduction via MACs, not real inference latency or energy consu
The strengths can be summarized as follows. - (1) The idea is clear and simple. Replacing a linear distance weight with an exponential weight is easy to adopt and aligns with the intended near keep and far shrink behavior of torque style penalties. - (2) The applicability is wide. The same formulation works across CNNs, Transformers for vision and NLP, a graph model, and a time series model, which supports the claim of generality. - (3) The empirical gains over the chosen baselines are consis
Also, the weaknesses are summarized as follows. - (1) Novelty is relatively not high. The contribution is a schedule change within an existing regularization template rather than a new pruning paradigm or analysis. A theoretical treatment that connects the exponential weight to an ideal selection boundary would strengthen the work. - (2) The SOTA claim is not fully established. Since only one paper published in 2025 has been cited, the comparison set omits several recent and strong methods for L
- Simple, general idea with clear intuition. Replacing linear with exponential penalization aligns with the goal “strongly suppress far modules, preserve near ones,” and is easy to add as a loss term next to the task loss. The method is architecture-agnostic and demonstrated across several domains. - Broad empirical coverage. Results on CNNs/ViT, BERT/RoBERTa, GAT and Informer show favorable accuracy at equal MACs speed-up, and robustness under more aggressive compression.
- Section 4.4 reports the 50% sparsity perplexities on OPT-350M/WikiText and compares to SparseGPT, Wanda, DepGraph, LLM-Pruner, but does not state whether ETP was used with additional training/finetuning (task loss + ETP loss) or applied in a purely post-training fashion; there are no training hyperparameters for OPT in Appendix 7.1, unlike other tasks. This raises questions about the exact pipeline and reproducibility. - The paper says, “All methods are constrained to 50% sparsity for a fair c
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
