Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition
Xiao Wang, Qian Zhu, Jiandong Jin, Jun Zhu, Futian Wang, Bo Jiang,, Yaowei Wang, Yonghong Tian

TL;DR
This paper introduces a novel spatio-temporal side-tuning method for pre-trained foundation models to improve video-based pedestrian attribute recognition by leveraging temporal information and multi-modal fusion.
Contribution
It proposes a parameter-efficient spatio-temporal side-tuning strategy for CLIP, integrating attribute semantics and video features for enhanced pedestrian attribute recognition.
Findings
Significant performance improvements on large-scale datasets.
Effective utilization of temporal information in video frames.
Parameter-efficient fine-tuning of pre-trained models.
Abstract
Existing pedestrian attribute recognition (PAR) algorithms are mainly developed based on a static image, however, the performance is unreliable in challenging scenarios, such as heavy occlusion, motion blur, etc. In this work, we propose to understand human attributes using video frames that can fully use temporal information by fine-tuning a pre-trained multi-modal foundation model efficiently. Specifically, we formulate the video-based PAR as a vision-language fusion problem and adopt a pre-trained foundation model CLIP to extract the visual features. More importantly, we propose a novel spatiotemporal side-tuning strategy to achieve parameter-efficient optimization of the pre-trained vision foundation model. To better utilize the semantic information, we take the full attribute list that needs to be recognized as another input and transform the attribute words/phrases into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Automated Road and Building Extraction · Human Pose and Action Recognition
MethodsAttention Is All You Need · Dropout · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Dense Connections · Contrastive Language-Image Pre-training
