Spatio-Temporal Side Tuning Pre-trained Foundation Models for   Video-based Pedestrian Attribute Recognition

Xiao Wang; Qian Zhu; Jiandong Jin; Jun Zhu; Futian Wang; Bo Jiang,; Yaowei Wang; Yonghong Tian

arXiv:2404.17929·cs.CV·April 30, 2024·1 cites

Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition

Xiao Wang, Qian Zhu, Jiandong Jin, Jun Zhu, Futian Wang, Bo Jiang,, Yaowei Wang, Yonghong Tian

PDF

Open Access 3 Repos

TL;DR

This paper introduces a novel spatio-temporal side-tuning method for pre-trained foundation models to improve video-based pedestrian attribute recognition by leveraging temporal information and multi-modal fusion.

Contribution

It proposes a parameter-efficient spatio-temporal side-tuning strategy for CLIP, integrating attribute semantics and video features for enhanced pedestrian attribute recognition.

Findings

01

Significant performance improvements on large-scale datasets.

02

Effective utilization of temporal information in video frames.

03

Parameter-efficient fine-tuning of pre-trained models.

Abstract

Existing pedestrian attribute recognition (PAR) algorithms are mainly developed based on a static image, however, the performance is unreliable in challenging scenarios, such as heavy occlusion, motion blur, etc. In this work, we propose to understand human attributes using video frames that can fully use temporal information by fine-tuning a pre-trained multi-modal foundation model efficiently. Specifically, we formulate the video-based PAR as a vision-language fusion problem and adopt a pre-trained foundation model CLIP to extract the visual features. More importantly, we propose a novel spatiotemporal side-tuning strategy to achieve parameter-efficient optimization of the pre-trained vision foundation model. To better utilize the semantic information, we take the full attribute list that needs to be recognized as another input and transform the attribute words/phrases into the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Automated Road and Building Extraction · Human Pose and Action Recognition

MethodsAttention Is All You Need · Dropout · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Dense Connections · Contrastive Language-Image Pre-training