Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification
Rifen Lin, Alex Jinpeng Wang, Jiawei Mo, Min Li

TL;DR
This paper introduces a novel skeleton-driven pretraining framework for video-based person re-identification that leverages skeleton data to improve motion understanding and achieves state-of-the-art results.
Contribution
It presents the first skeleton-based pretraining paradigm for ReID, including a contrastive learning approach and a motion-aware temporal modeling module.
Findings
Achieves state-of-the-art results on standard benchmarks
Demonstrates strong generalization to skeleton-only ReID tasks
Outperforms previous methods significantly
Abstract
Multimodal pretraining has revolutionized visual understanding, but its impact on video-based person re-identification (ReID) remains underexplored. Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion-an essential cue for distinguishing identities in video. In this work, we take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID. To achieve this, we propose Contrastive Skeleton-Image Pretraining for ReID (CSIP-ReID), a novel two-stage method that leverages skeleton sequences as a spatiotemporally informative modality aligned with video frames. In the first stage, we employ contrastive learning to align skeleton and visual features at sequence level. In the second stage, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Gait Recognition and Analysis · Human Pose and Action Recognition
