TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval

Kaibin Tian; Ruixiang Zhao; Hu Hu; Runquan Xie; Fengzong Lian; Zhanhui; Kang; Xirong Li

arXiv:2308.01217·cs.CV·August 3, 2023

TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval

Kaibin Tian, Ruixiang Zhao, Hu Hu, Runquan Xie, Fengzong Lian, Zhanhui, Kang, Xirong Li

PDF

Open Access

TL;DR

TeachCLIP introduces a multi-grained teaching approach that enables efficient text-to-video retrieval by distilling knowledge from advanced models into a lightweight CLIP4Clip-based student, enhancing performance without added retrieval overhead.

Contribution

The paper proposes a novel multi-grained teaching framework with an Attentional frame-Feature Aggregation (AFA) block to improve student learning in efficient T2VR models.

Findings

01

Effective knowledge transfer from heavy models to lightweight student.

02

Improved retrieval accuracy demonstrated on multiple datasets.

03

AFA enhances fine-grained learning without extra retrieval cost.

Abstract

For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominating. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR into doubt. For efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's learning capability, we add an Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage/computation overhead at the retrieval stage. While attentive weights produced by AFA are commonly used for combining frame-level features, we propose a novel use of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization