Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images
Cristina Mahanta, Gagan Bhatia

TL;DR
This paper demonstrates that leveraging contrastive vision-language pre-training, specifically fine-tuning CLIP, significantly enhances human activity recognition accuracy in still images, achieving a substantial improvement over traditional CNNs.
Contribution
The study shows that using pre-trained vision-language models like CLIP greatly improves activity recognition in static images compared to previous CNN-based methods.
Findings
CNNs scored 41% accuracy on activity recognition
Fine-tuning CLIP raised accuracy to 76%
Contrastive vision-language pre-training is effective for still-image action recognition
Abstract
Recognising human activity in a single photo enables indexing, safety and assistive applications, yet lacks motion cues. Using 285 MSCOCO images labelled as walking, running, sitting, and standing, scratch CNNs scored 41% accuracy. Fine-tuning multimodal CLIP raised this to 76%, demonstrating that contrastive vision-language pre-training decisively improves still-image action recognition in real-world deployments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems
