Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval
Young Kyun Jang, Donghyun Kim, Ser-nam Lim

TL;DR
This paper presents DCMQ, a novel distillation-based method that leverages vision-language pre-trained models to enhance cross-modal hashing for efficient image-text retrieval, achieving superior performance over existing methods.
Contribution
Introducing DCMQ, a new distillation framework that uses VLP models to improve hash representations and a novel quantization method PQG for better codebook learning.
Findings
DCMQ outperforms existing supervised cross-modal hashing methods.
The use of VLP models significantly enhances hash representation quality.
PQG improves codebook balance and retrieval accuracy.
Abstract
``Learning to hash'' is a practical solution for efficient retrieval, offering fast search speed and low storage cost. It is widely applied in various applications, such as image-text cross-modal search. In this paper, we explore the potential of enhancing the performance of learning to hash with the proliferation of powerful large pre-trained models, such as Vision-Language Pre-training (VLP) models. We introduce a novel method named Distillation for Cross-Modal Quantization (DCMQ), which leverages the rich semantic knowledge of VLP models to improve hash representation learning. Specifically, we use the VLP as a `teacher' to distill knowledge into a `student' hashing model equipped with codebooks. This process involves the replacement of supervised labels, which are composed of multi-hot vectors and lack semantics, with the rich semantics of VLP. In the end, we apply a transformation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Constraint Satisfaction and Optimization
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
