VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition
Abdellah Zakaria Sellam, Salah Eddine Bekhouche, Fadi Dornaika, Cosimo Distante, Abdenour Hadid

TL;DR
VLM-PAR is a novel vision-language model that improves pedestrian attribute recognition accuracy by aligning image and text embeddings, effectively addressing class imbalance and domain shifts, and achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper introduces VLM-PAR, a modular framework leveraging frozen multilingual encoders and cross-attention fusion to enhance pedestrian attribute recognition performance.
Findings
Achieves new state-of-the-art accuracy on PA100K benchmark.
Significant improvements in mean accuracy on PETA and Market-1501.
Effectively handles class imbalance and domain shifts in PAR.
Abstract
Pedestrian Attribute Recognition (PAR) involves predicting fine-grained attributes such as clothing color, gender, and accessories from pedestrian imagery, yet is hindered by severe class imbalance, intricate attribute co-dependencies, and domain shifts. We introduce VLM-PAR, a modular vision-language framework built on frozen SigLIP 2 multilingual encoders. By first aligning image and prompt embeddings via refining visual features through a compact cross-attention fusion, VLM-PAR achieves significant accuracy improvement on the highly imbalanced PA100K benchmark, setting a new state-of-the-art performance, while also delivering significant gains in mean accuracy across PETA and Market-1501 benchmarks. These results underscore the efficacy of integrating large-scale vision-language pretraining with targeted cross-modal refinement to overcome imbalance and generalization challenges in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Video Surveillance and Tracking Methods
