TVPR: Text-to-Video Person Retrieval and a New Benchmark

Xu Zhang; Fan Ni; Guan-Nan Dong; Aichun Zhu; Jianhui Wu; Mingcheng Ni,; Hui Liu

arXiv:2307.07184·cs.CV·April 22, 2025

TVPR: Text-to-Video Person Retrieval and a New Benchmark

Xu Zhang, Fan Ni, Guan-Nan Dong, Aichun Zhu, Jianhui Wu, Mingcheng Ni,, Hui Liu

PDF

Open Access

TL;DR

This paper introduces the TVPR task and a new dataset for text-to-video person retrieval, along with a novel learning strategy that significantly improves retrieval performance by leveraging cross-modal representations.

Contribution

It presents the first video-based text person retrieval method and constructs a large-scale dataset with natural language annotations for this task.

Findings

01

Achieved state-of-the-art results on the TVPReid dataset.

02

Demonstrated the effectiveness of the MFGF strategy in cross-modal learning.

03

Provided a new benchmark for future research in text-to-video person retrieval.

Abstract

Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured or variable motion details are missed in isolated frames. To overcome this, we propose a novel Text-to-Video Person Retrieval (TVPR) task. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, termed as Text-to-Video Person Re-identification (TVPReid) dataset. In this paper, we introduce a Multielement Feature Guided Fragments Learning (MFGF) strategy, which leverages the cross-modal text-video representations to provide strong text-visual and text-motion matching information to tackle uncertain occlusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Gait Recognition and Analysis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Linear Warmup With Linear Decay · Residual Connection · Adam · Dense Connections · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?