TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge

Shu-Hao Zhang; Wei-Cheng Tang; Chen Wu; Peng Hu; Nan Li; Liang-Jie Zhang; Qi Zhang; Shao-Qun Zhang

arXiv:2510.21879·cs.CV·October 28, 2025

TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge

Shu-Hao Zhang, Wei-Cheng Tang, Chen Wu, Peng Hu, Nan Li, Liang-Jie Zhang, Qi Zhang, Shao-Qun Zhang

PDF

1 Models

TL;DR

TernaryCLIP introduces a method to compress large vision-language models by converting weights to ternary format with minimal performance loss, enabling efficient deployment on resource-limited devices.

Contribution

It presents TernaryCLIP, the first framework to convert CLIP's weights into ternary format with quantization-aware training and distillation, achieving high compression and acceleration.

Findings

01

Achieves 99% ternarized weights with 1.58-bit representation

02

Provides 16.98× compression ratio and 2.3× inference speedup

03

Maintains strong zero-shot performance across 41 datasets

Abstract

Recent years have witnessed an increasing interest in image-text contrastive modeling, exemplified by models such as Contrastive Language-Image Pretraining (CLIP). In this paper, we propose the TernaryCLIP, a lightweight computational framework that converts connection weights of both vision and text encoders of CLIP into the ternary format, instead of full-precision or floating ones. TernaryCLIP incorporates quantization-aware training and distillation modules, preventing precision degradation and enabling low-cost and high-efficiency computations. Comprehensive experiments demonstrate that TernaryCLIP can achieve up to 99\% ternarized weights with 1.58-bit representation, 16.98 $\times$ compression ratio, 2.3 $\times$ inference acceleration, 16 $\times$ storage reduction, 10 $\times$ memory optimization, and 60\% sparsity while maintaining promising performance on zero-shot image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
zhangsq-nju/TernaryCLIP_ViT-B-16
model· 6 dl· ♡ 4
6 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.