Advancing Vision-based Human Action Recognition: Exploring Vision-Language CLIP Model for Generalisation in Domain-Independent Tasks

Utkarsh Shandilya; Marsha Mariya Kappan; Sanyam Jain; Vijeta Sharma

arXiv:2507.18675·cs.CV·August 1, 2025

Advancing Vision-based Human Action Recognition: Exploring Vision-Language CLIP Model for Generalisation in Domain-Independent Tasks

Utkarsh Shandilya, Marsha Mariya Kappan, Sanyam Jain, Vijeta Sharma

PDF

Open Access

TL;DR

This paper evaluates the CLIP vision-language model for human action recognition across diverse domains, analyzing its limitations under masking strategies, and proposes a noise-based enhancement to improve its robustness and generalization in healthcare applications.

Contribution

It systematically assesses CLIP's performance on action recognition, identifies its limitations under masking, and introduces a novel noise-injection method to enhance its generalization capabilities.

Findings

01

CLIP shows inconsistent behavior with frequent misclassifications under masking.

02

Incorporating class-specific noise improves accuracy and reduces bias.

03

The approach offers potential for better generalization in healthcare scenarios.

Abstract

Human action recognition plays a critical role in healthcare and medicine, supporting applications such as patient behavior monitoring, fall detection, surgical robot supervision, and procedural skill assessment. While traditional models like CNNs and RNNs have achieved moderate success, they often struggle to generalize across diverse and complex actions. Recent advancements in vision-language models, especially the transformer-based CLIP model, offer promising capabilities for generalizing action recognition from video data. In this work, we evaluate CLIP on the UCF-101 dataset and systematically analyze its performance under three masking strategies: (1) percentage-based and shape-based black masking at 10%, 30%, and 50%, (2) feature-specific masking to suppress bias-inducing elements, and (3) isolation masking that retains only class-specific regions. Our results reveal that CLIP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Context-Aware Activity Recognition Systems