Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images

Cristina Mahanta; Gagan Bhatia

arXiv:2506.13458·cs.CV·June 17, 2025

Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images

Cristina Mahanta, Gagan Bhatia

PDF

Open Access

TL;DR

This paper demonstrates that leveraging contrastive vision-language pre-training, specifically fine-tuning CLIP, significantly enhances human activity recognition accuracy in still images, achieving a substantial improvement over traditional CNNs.

Contribution

The study shows that using pre-trained vision-language models like CLIP greatly improves activity recognition in static images compared to previous CNN-based methods.

Findings

01

CNNs scored 41% accuracy on activity recognition

02

Fine-tuning CLIP raised accuracy to 76%

03

Contrastive vision-language pre-training is effective for still-image action recognition

Abstract

Recognising human activity in a single photo enables indexing, safety and assistive applications, yet lacks motion cues. Using 285 MSCOCO images labelled as walking, running, sitting, and standing, scratch CNNs scored 41% accuracy. Fine-tuning multimodal CLIP raised this to 76%, demonstrating that contrastive vision-language pre-training decisively improves still-image action recognition in real-world deployments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems