UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs
Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu

TL;DR
UniQL is a comprehensive framework that combines quantization and low-rank compression techniques to enable efficient deployment of large language models on mobile devices, reducing memory and increasing throughput while maintaining accuracy.
Contribution
UniQL introduces a unified post-training compression framework with on-device configurable pruning, integrating quantization and low-rank methods for diverse edge LLMs.
Findings
Achieves 4x-5.7x memory reduction in models.
Improves token throughput by 2.7x-3.4x.
Maintains accuracy within 5% at 15% pruning.
Abstract
Deploying large language models (LLMs) on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a…
Peer Reviews
Decision·ICLR 2026 Poster
- Unified Framework: UniQL supports Transformers, State Space Models (SSMs), and hybrid architectures, addressing a wide range of LLM structures. - On-device Adaptive Pruning: Enables users to prune the model at inference time based on the current device memory state.
- While the results are generally strong, some inconsistencies exist in the evaluation setup: In Tables 1 and 2, the latency results for different models and methods are evaluated on different hardware platforms (Llama-3.1-8B and Nemotron-H-8B on A6000; Qwen-2.5-7B and Mamba2-8B on Nano 8G). Additionally, Table 2 lacks baseline comparisons such as TRT-AWQ for some models. This raises concerns about the consistency and comparability of latency evaluations across models and methods. Can the author
1. The adaptive LLM memory problem discussed in this paper is important and interesting. 2. The proposed methods are evaluated on different model structures.
1. The contribution is limited, and the proposed methods are very incremental. a) Methods applied to different model structures appear more as a systematic engineering effort than a novel algorithmic advancement. b) The quantization, pruning combination has been explored. 2. Insufficient Empirical Evaluation. a) The paper claims adaptive deployment of LLMs on the edge, but there is no real deployment with different workloads. Crucially, the paper does not address the core systems challenge: how
Wide Model Architecture Coverage: This approach systematically supports post-training quantization and structured pruning for Transformer, SSM, and hybrid models for the first time. Strong On-Device Adaptability: This approach dynamically adjusts model size based on memory and compute resources after deployment to adapt to the dynamic load of edge devices. High Compression Efficiency: This approach significantly accelerates the compression process (up to 22x faster than MoDeGPT) by avoiding ps
Accuracy-compression tradeoff not yet optimal: At high pruning rates (e.g., 35%), the accuracy of some models (e.g., Mamba2) drops significantly (down to 57.7%). Sensitive to calibration data: Structured ordering and quantization depend on the calibration set; their robustness to different data distributions is not analyzed. Lack of comparison with unstructured methods: No comparison with popular unstructured pruning methods (e.g., SparseGPT) or hybrid sparse methods is provided. Device-side
Code & Models
- 🤗ut-enyac/Llama-3.1-8B-uniql-1.0-masked-lora-rft-w4a16model· 7 dl7 dl
- 🤗ut-enyac/Llama-2-7b-hf-uniql-1.0-masked-lora-rft-w4a16model
- 🤗ut-enyac/Nemotron-H-8B-Base-8K-uniql-1.0-masked-lora-rft-w4a16model· 4 dl4 dl
- 🤗ut-enyac/Qwen2.5-7B-uniql-1.0-masked-lora-rft-w4a16model
- 🤗ut-enyac/Bamba-9B-v2-uniql-1.0-masked-lora-rft-w4a16model· 1 dl1 dl
- 🤗ut-enyac/mamba2-8b-converted-uniql-1.0-masked-lora-rft-w4a16model· 4 dl4 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Topic Modeling
