CLIP-Map: Structured Matrix Mapping for Parameter-Efficient CLIP Compression
Kangjie Zhang, Wenxuan Huang, Xin Zhou, Boxiang Zhou, Dejia Song, Yuan Xie, Baochang Zhang, Lizhuang Ma, Nemo Chen, Xu Tang, Yao Hu, and Shaohui Lin

TL;DR
CLIP-Map introduces a learnable matrix-based compression method for CLIP that preserves more information and outperforms existing select-based methods, especially at high compression ratios.
Contribution
The paper proposes CLIP-Map, a novel structured matrix mapping approach using learnable matrices and Kronecker factorization for efficient CLIP compression.
Findings
Outperforms select-based methods across various compression ratios.
Significant improvements under high compression settings.
Uses Diagonal Inheritance Initialization for better optimization.
Abstract
Contrastive Language-Image Pre-training (CLIP) has achieved widely applications in various computer vision tasks, e.g., text-to-image generation, Image-Text retrieval and Image captioning. However, CLIP suffers from high memory and computation cost, which prohibits its usage to the resource-limited application scenarios. Existing CLIP compression methods typically reduce the size of pre-trained CLIP weights by selecting their subset as weight inheritance for further retraining via mask optimization or important weight measurement. However, these select-based weight inheritance often compromises the feature presentation ability, especially on the extreme compression. In this paper, we propose a novel mapping-based CLIP compression framework, CLIP-Map. It leverages learnable matrices to map and combine pretrained weights by Full-Mapping with Kronecker Factorization, aiming to preserve as…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is well-organized and easy to follow. - The proposed method is effective, and the use of structured matrix mapping is novel in the context of CLIP compression. - The experimental results are comprehensive and convincing.
- This paper greatly violates ICLR's formatting guidelines (see suggestions below). The paper should not be accepted in its current form. - Equation (10) is not rigorous. The kronecker product of $F^{in}$ and $F^{out}$ initialized with Equation (9) is not an identity matrix (the diagonal entries are not all 1 and there are non-zero off-diagonal entries)
The methods proposed in this paper are technically sound, and both contributions offer relevant improvements for clip compression. Moreover, the experiments are comprehensive, with thorough exploration conducted on both classification and retrieval tasks. This task is also quite important.
Can the compressed CLIP model be widely applied to downstream tasks? The experiments in this paper only evaluated retrieval and classification tasks, without assessing other tasks such as generation or some comprehension tasks. If the model can only improve performance in classification and retrieval but not be applicable to these other tasks, its applicability in a broader range of fields would be significantly reduced. It would be even better if comparisons with other compressed models on othe
This paper addresses a practical and timely issue: large multimodal models incur substantial computational and storage overhead, which is critical for resource-constrained devices. Furthermore, the core approach of replacing selection and discard with mapping and combining pre-trained weights is reasonably justified. This method can preserve the original model's information, demonstrating particularly significant effects at high compression rates, whereas comparable methods may result in critica
1) This paper's core idea is that mapping preserves more information than selection currently remains at an intuitive level, lacking rigorous theoretical support. For instance, the authors fail to analyse from a mathematical perspective why this mapping-based structure can more effectively maintain the knowledge fidelity of pre-trained models. 2) The experiments lack ablation studies. For instance, the paper provides no ablation experiments to distinguish the respective contributions of width an
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Data Compression Techniques
