VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition

Abdellah Zakaria Sellam; Salah Eddine Bekhouche; Fadi Dornaika; Cosimo Distante; Abdenour Hadid

arXiv:2512.22217·cs.CV·December 30, 2025

VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition

Abdellah Zakaria Sellam, Salah Eddine Bekhouche, Fadi Dornaika, Cosimo Distante, Abdenour Hadid

PDF

Open Access

TL;DR

VLM-PAR is a novel vision-language model that improves pedestrian attribute recognition accuracy by aligning image and text embeddings, effectively addressing class imbalance and domain shifts, and achieving state-of-the-art results on multiple benchmarks.

Contribution

The paper introduces VLM-PAR, a modular framework leveraging frozen multilingual encoders and cross-attention fusion to enhance pedestrian attribute recognition performance.

Findings

01

Achieves new state-of-the-art accuracy on PA100K benchmark.

02

Significant improvements in mean accuracy on PETA and Market-1501.

03

Effectively handles class imbalance and domain shifts in PAR.

Abstract

Pedestrian Attribute Recognition (PAR) involves predicting fine-grained attributes such as clothing color, gender, and accessories from pedestrian imagery, yet is hindered by severe class imbalance, intricate attribute co-dependencies, and domain shifts. We introduce VLM-PAR, a modular vision-language framework built on frozen SigLIP 2 multilingual encoders. By first aligning image and prompt embeddings via refining visual features through a compact cross-attention fusion, VLM-PAR achieves significant accuracy improvement on the highly imbalanced PA100K benchmark, setting a new state-of-the-art performance, while also delivering significant gains in mean accuracy across PETA and Market-1501 benchmarks. These results underscore the efficacy of integrating large-scale vision-language pretraining with targeted cross-modal refinement to overcome imbalance and generalization challenges in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Video Surveillance and Tracking Methods