CLIP Multi-modal Hashing: A new baseline CLIPMH
Jian Zhu, Mingkai Sheng, Mingda Ke, Zhangmin Huang, Jingfei Chang

TL;DR
This paper introduces CLIPMH, a multi-modal hashing method leveraging CLIP to extract and fuse image and text features, significantly improving retrieval accuracy over existing methods by enhancing feature expressiveness.
Contribution
The paper proposes a new baseline CLIPMH that uses CLIP for feature extraction and fusion, addressing low accuracy issues in multi-modal hashing.
Findings
CLIPMH outperforms state-of-the-art methods with up to 8.38% accuracy increase.
CLIP enhances feature expressiveness for better retrieval performance.
The method demonstrates significant improvements over traditional backbone networks.
Abstract
The multi-modal hashing method is widely used in multimedia retrieval. It can fuse multi-source data to generate binary hash code. However, the current multi-modal methods have the problem of low retrieval accuracy. The reason is that the individual backbone networks have limited feature expression capabilities and are not jointly pre-trained on large-scale unsupervised multi-modal data. To solve this problem, we propose a new baseline CLIP Multi-modal Hashing (CLIPMH) method. It uses CLIP model to extract text and image features, and then fuse to generate hash code. CLIP improves the expressiveness of each modal feature. In this way, it can greatly improve the retrieval performance of multi-modal hashing methods. In comparison to state-of-the-art unsupervised and supervised multi-modal hashing methods, experiments reveal that the proposed CLIPMH can significantly enhance performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
