Human Pose Descriptions and Subject-Focused Attention for Improved   Zero-Shot Transfer in Human-Centric Classification Tasks

Muhammad Saif Ullah Khan; Muhammad Ferjad Naeem; Federico Tombari; Luc; Van Gool; Didier Stricker; Muhammad Zeshan Afzal

arXiv:2403.06904·cs.CV·October 30, 2024·1 cites

Human Pose Descriptions and Subject-Focused Attention for Improved Zero-Shot Transfer in Human-Centric Classification Tasks

Muhammad Saif Ullah Khan, Muhammad Ferjad Naeem, Federico Tombari, Luc, Van Gool, Didier Stricker, Muhammad Zeshan Afzal

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a new approach combining human pose descriptions and subject-focused attention in CLIP to improve zero-shot human-centric classification, demonstrating significant accuracy gains across multiple datasets and tasks.

Contribution

It presents a novel dataset of natural language pose descriptions and a new FocusCLIP framework with Subject-Focused Attention for enhanced zero-shot classification performance.

Findings

01

8.61% average accuracy improvement over CLIP

02

Significant gains in activity, age, and emotion recognition

03

Effective use of pose descriptions for zero-shot transfer

Abstract

We present a novel LLM-based pipeline for creating contextual descriptions of human body poses in images using only auxiliary attributes. This approach facilitates the creation of the MPII Pose Descriptions dataset, which includes natural language annotations for 17,367 images containing people engaged in 410 distinct activities. We demonstrate the effectiveness of our pose descriptions in enabling zero-shot human-centric classification using CLIP. Moreover, we introduce the FocusCLIP framework, which incorporates Subject-Focused Attention (SFA) in CLIP for improved text-to-image alignment. Our models were pretrained on the MPII Pose Descriptions dataset and their zero-shot performance was evaluated on five unseen datasets covering three tasks. FocusCLIP outperformed the baseline CLIP model, achieving an average accuracy increase of 8.61\% (33.65\% compared to CLIP's 25.04\%). Notably,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

saifkhichi96/mpii-human-pose-captions
dataset· 26 dl
26 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman-Automation Interaction and Safety

MethodsContrastive Language-Image Pre-training