Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

Rifen Lin; Alex Jinpeng Wang; Jiawei Mo; Min Li

arXiv:2511.13150·cs.CV·November 18, 2025

Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

Rifen Lin, Alex Jinpeng Wang, Jiawei Mo, Min Li

PDF

Open Access

TL;DR

This paper introduces a novel skeleton-driven pretraining framework for video-based person re-identification that leverages skeleton data to improve motion understanding and achieves state-of-the-art results.

Contribution

It presents the first skeleton-based pretraining paradigm for ReID, including a contrastive learning approach and a motion-aware temporal modeling module.

Findings

01

Achieves state-of-the-art results on standard benchmarks

02

Demonstrates strong generalization to skeleton-only ReID tasks

03

Outperforms previous methods significantly

Abstract

Multimodal pretraining has revolutionized visual understanding, but its impact on video-based person re-identification (ReID) remains underexplored. Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion-an essential cue for distinguishing identities in video. In this work, we take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID. To achieve this, we propose Contrastive Skeleton-Image Pretraining for ReID (CSIP-ReID), a novel two-stage method that leverages skeleton sequences as a spatiotemporally informative modality aligned with video frames. In the first stage, we employ contrastive learning to align skeleton and visual features at sequence level. In the second stage, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Gait Recognition and Analysis · Human Pose and Action Recognition