Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs
Pierre Abillama, Changwoo Lee, Juechu Dong, David Blaauw, Dennis Sylvester, Hun-Seok Kim

TL;DR
This paper introduces memory-efficient custom kernels for block low-rank compressed foundation models, significantly reducing inference latency and model size on resource-constrained GPUs while maintaining accuracy.
Contribution
It develops optimized Triton kernels with partial fusion and layout improvements for BLR models, enabling faster inference and compression on limited GPU hardware.
Findings
Achieves up to 3.76x speedup on NVIDIA Jetson Orin Nano and A40
Provides 3x model size compression over dense baselines
Supports various models including Llama, GPT2, DiT, and ViT
Abstract
Recent advances in transformer-based foundation models have made them the default choice for many tasks, but their rapidly growing size makes fitting a full model on a single GPU increasingly difficult and their computational cost prohibitive. Block low-rank (BLR) compression techniques address this challenge by learning compact representations of weight matrices. While traditional low-rank (LR) methods often incur sharp accuracy drops, BLR approaches such as Monarch and BLAST can better capture the underlying structure, thus preserving accuracy while reducing computations and memory footprints. In this work, we use roofline analysis to show that, although BLR methods achieve theoretical savings and practical speedups for single-token inference, multi-token inference often becomes memory-bound in practice, increasing latency despite compiler-level optimizations in PyTorch. To address…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. Demonstrating up to 3.76× speedup and 3× model size reduction on resource-constrained GPUs (Jetson Orin Nano, A40) highlights real-world relevance for edge and low-memory environments. 2. The kernel fusion method is clearly explained.
1. The main issue lies in the limited novelty of the contribution. Using kernel fusion to reduce I/O overhead is a well-established engineering practice in modern LLM systems. State-of-the-art training frameworks such as Megatron-LM already include numerous fused operations that consistently outperform vanilla PyTorch implementations. Therefore, the proposed optimization should be viewed primarily as an engineering refinement rather than an academic innovation. For comparison, methods like Flash
1. It explores the rarely studied problem of optimizing the GPU kernels of block low-rank compression. 2. It presents a set of practical solutions for the problem.
1. Given that the low-rank compression's accuracy is bad, it lacks a well discussed motivation of optimizing this kind of algorithm specifically. 2. The roofline model analysis is not novel nor necessary. The performance bottleneck of the BLR computation is obvious from system aspect, that the matmul shape is quite small so that it becomes memory bounded (especially the intermediate K dimension of the matmul). 3. The GPU kernel optimization is straight forward and does not have new contribution
* The paper evaluates the method on the compression of a broad range of foundation models, demonstrating the general applicability of the proposed method. Reported speedups are obtained under realistic conditions, such as BF16 precision and multi-token inference, which enhances the practical relevance of the experiments. * The speedup analysis is conducted at both the layer and end-to-end levels, providing insights into the sources of efficiency gains. The trade-off between accuracy and inferenc
My main concern with this submission is its positioning with respect to a very close paper [1] that was not mentioned in the submission. [1] also proposes to reduce the cost of data movement operations when performing matrix multiplication on GPU with the so-called Kronecker-sparse factors. These matrices are typically involved in butterfly factorizations and in the Monarch matrices discussed in this submission. In [1], it is observed that the original implementation of Monarch matrix multiplica
The paper is clearly written and well structured, making the technical contributions easy to follow. On quality, the empirical results are strong: the method achieves up to 3.76× speedup over the dense baseline and demonstrates an approximately 3× improvement in model size, indicating meaningful efficiency gains. The implementation choices further support robustness: the kernels are written in Triton, positioning the work to benefit from ongoing compiler and hardware-backend optimizations. On or
Significance/motivation. The case for block-low-rank (BLR) matrices is under-motivated. It remains unclear how frequently BLR kernels arise in real-world workloads and whether practitioners deploy them at scale. Evaluation scope. All experiments use batch size = 1, which limits generality. Performance on modern accelerators often changes with batch size, sequence length, block size/rank, and tensor shapes. Results sweeping batch size, sequence length, and BLR configurations would make the findi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Advanced Neural Network Applications
