Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings
Abderrazek Abid, Thanh-Cong Ho, Fakhri Karray

TL;DR
This paper explores the application of Vision Language Models (VLMs) for human activity recognition in healthcare, introducing a new dataset and evaluation methods, and demonstrating their competitive performance against traditional models.
Contribution
It introduces a descriptive caption dataset and comprehensive evaluation methods for VLMs in healthcare activity recognition, providing a new benchmark for future research.
Findings
VLMs achieve comparable accuracy to state-of-the-art deep learning models.
VLMs sometimes outperform traditional models in recognition tasks.
The work establishes a benchmark for VLMs in healthcare applications.
Abstract
As generative AI continues to evolve, Vision Language Models (VLMs) have emerged as promising tools in various healthcare applications. One area that remains relatively underexplored is their use in human activity recognition (HAR) for remote health monitoring. VLMs offer notable strengths, including greater flexibility and the ability to overcome some of the constraints of traditional deep learning models. However, a key challenge in applying VLMs to HAR lies in the difficulty of evaluating their dynamic and often non-deterministic outputs. To address this gap, we introduce a descriptive caption data set and propose comprehensive evaluation methods to evaluate VLMs in HAR. Through comparative experiments with state-of-the-art deep learning models, our findings demonstrate that VLMs achieve comparable performance and, in some cases, even surpass conventional approaches in terms of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
