MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
Min Yang, Zihan Jia, Zhilin Dai, Sheng Guo, Limin Wang

TL;DR
MobileViCLIP is a lightweight, efficient video-text model optimized for mobile devices, achieving high speed and strong zero-shot performance by integrating temporal reparameterization into an image-text architecture.
Contribution
The paper introduces MobileViCLIP, a novel mobile-friendly video-text model that combines temporal structural reparameterization with efficient architecture for fast inference and competitive zero-shot capabilities.
Findings
MobileViCLIP-Small is 55.4x faster than InternVideo2-L14 on mobile devices.
Achieves zero-shot retrieval performance comparable to larger models.
Outperforms similar models on MSR-VTT dataset.
Abstract
Efficient lightweight neural networks are with increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video pre-trained models still focus on the common ViT architecture with high latency, and few works attempt to build efficient architecture on mobile devices. This paper bridges this gap by introducing temporal structural reparameterization into an efficient image-text model and training it on a large-scale high-quality video-text dataset, resulting in an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities, termed as MobileViCLIP. In particular, in terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14. In terms of zero-shot retrieval performance, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
