Distilling Vision-Language Pretraining for Efficient Cross-Modal   Retrieval

Young Kyun Jang; Donghyun Kim; Ser-nam Lim

arXiv:2405.14726·cs.CV·May 24, 2024

Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval

Young Kyun Jang, Donghyun Kim, Ser-nam Lim

PDF

Open Access

TL;DR

This paper presents DCMQ, a novel distillation-based method that leverages vision-language pre-trained models to enhance cross-modal hashing for efficient image-text retrieval, achieving superior performance over existing methods.

Contribution

Introducing DCMQ, a new distillation framework that uses VLP models to improve hash representations and a novel quantization method PQG for better codebook learning.

Findings

01

DCMQ outperforms existing supervised cross-modal hashing methods.

02

The use of VLP models significantly enhances hash representation quality.

03

PQG improves codebook balance and retrieval accuracy.

Abstract

``Learning to hash'' is a practical solution for efficient retrieval, offering fast search speed and low storage cost. It is widely applied in various applications, such as image-text cross-modal search. In this paper, we explore the potential of enhancing the performance of learning to hash with the proliferation of powerful large pre-trained models, such as Vision-Language Pre-training (VLP) models. We introduce a novel method named Distillation for Cross-Modal Quantization (DCMQ), which leverages the rich semantic knowledge of VLP models to improve hash representation learning. Specifically, we use the VLP as a `teacher' to distill knowledge into a `student' hashing model equipped with codebooks. This process involves the replacement of supervised labels, which are composed of multi-hot vectors and lack semantics, with the rich semantics of VLP. In the end, we apply a transformation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Constraint Satisfaction and Optimization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings