TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm

Bingqing Zhang; Zhuo Cao; Heming Du; Xin Yu; Xue Li; Jiajun Liu and; Sen Wang

arXiv:2409.19865·cs.CV·October 1, 2024

TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm

Bingqing Zhang, Zhuo Cao, Heming Du, Xin Yu, Xue Li, Jiajun Liu and, Sen Wang

PDF

Open Access 1 Repo

TL;DR

TokenBinder introduces a novel one-to-many alignment paradigm for text-video retrieval, inspired by human comparative judgment, significantly improving fine-grained matching accuracy across multiple datasets.

Contribution

It proposes a two-stage framework with a Focused-view Fusion Network that dynamically aligns multiple videos, addressing limitations of one-to-one paradigms in TVR.

Findings

01

Outperforms state-of-the-art methods on six benchmark datasets

02

Demonstrates robustness and effectiveness of fine-grained alignment

03

Bridges intra- and inter-modality information gaps in TVR

Abstract

Text-Video Retrieval (TVR) methods typically match query-candidate pairs by aligning text and video features in coarse-grained, fine-grained, or combined (coarse-to-fine) manners. However, these frameworks predominantly employ a one(query)-to-one(candidate) alignment paradigm, which struggles to discern nuanced differences among candidates, leading to frequent mismatches. Inspired by Comparative Judgement in human cognitive science, where decisions are made by directly comparing items rather than evaluating them independently, we propose TokenBinder. This innovative two-stage TVR framework introduces a novel one-to-many coarse-to-fine alignment paradigm, imitating the human cognitive process of identifying specific items within a large collection. Our method employs a Focused-view Fusion Network with a sophisticated cross-attention mechanism, dynamically aligning and comparing features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bingqingzhang/TokenBinder
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques