CLIP-based Synergistic Knowledge Transfer for Text-based Person   Retrieval

Yating Liu; Yaowei Li; Zimo Liu; Wenming Yang; Yaowei Wang; Qingmin; Liao

arXiv:2309.09496·cs.CV·January 3, 2024·2 cites

CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval

Yating Liu, Yaowei Li, Zimo Liu, Wenming Yang, Yaowei Wang, Qingmin, Liao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a CLIP-based method for text-based person retrieval that effectively transfers knowledge between vision and language modalities, achieving superior performance with minimal additional training parameters.

Contribution

The paper proposes a novel synergistic knowledge transfer framework using bidirectional prompts and dual adapters, enhancing CLIP's capabilities for TPR with efficient parameter usage.

Findings

01

Outperforms state-of-the-art on three benchmarks

02

Uses only 7.4% of total model parameters for training

03

Demonstrates high efficiency, effectiveness, and generalization

Abstract

Text-based Person Retrieval (TPR) aims to retrieve the target person images given a textual query. The primary challenge lies in bridging the substantial gap between vision and language modalities, especially when dealing with limited large-scale datasets. In this paper, we introduce a CLIP-based Synergistic Knowledge Transfer (CSKT) approach for TPR. Specifically, to explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections. Secondly, Dual Adapters Transferring (DAT) is designed to transfer knowledge on output side of Multi-Head Attention (MHA) in vision and language. This synergistic two-way collaborative mechanism promotes the early-stage feature fusion and efficiently exploits the existing knowledge of CLIP. CSKT outperforms the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Liu-Yating/CSKT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Softmax · Contrastive Language-Image Pre-training