Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation

Li Zhong; Ahmed Ghazal; Jun-Jun Wan; Frederik Zilly; Patrick Mackens; Joachim E. Vollrath; Bogdan Sorin Coseriu

arXiv:2505.18039·cs.CV·May 26, 2025

Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation

Li Zhong, Ahmed Ghazal, Jun-Jun Wan, Frederik Zilly, Patrick Mackens, Joachim E. Vollrath, Bogdan Sorin Coseriu

PDF

TL;DR

This paper introduces Clip4Retrofit, a model distillation framework that enables real-time image labeling on resource-limited edge devices by compressing CLIP into a lightweight, efficient model suitable for practical deployment.

Contribution

It presents a novel distillation approach that combines EfficientNet-B3 with MLP heads to retain cross-modal alignment while reducing computational demands for edge deployment.

Findings

01

Distilled model achieves real-time performance on edge devices.

02

Maintains effective cross-modal alignment comparable to CLIP.

03

Enables practical deployment in autonomous vehicles and retrofitting scenarios.

Abstract

Foundation models like CLIP (Contrastive Language-Image Pretraining) have revolutionized vision-language tasks by enabling zero-shot and few-shot learning through cross-modal alignment. However, their computational complexity and large memory footprint make them unsuitable for deployment on resource-constrained edge devices, such as in-car cameras used for image collection and real-time processing. To address this challenge, we propose Clip4Retrofit, an efficient model distillation framework that enables real-time image labeling on edge devices. The framework is deployed on the Retrofit camera, a cost-effective edge device retrofitted into thousands of vehicles, despite strict limitations on compute performance and memory. Our approach distills the knowledge of the CLIP model into a lightweight student model, combining EfficientNet-B3 with multi-layer perceptron (MLP) projection heads…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training