LS-HAR: Language Supervised Human Action Recognition with Salient Fusion, Construction Sites as a Use-Case
Mohammad Mahdavian, Mohammad Loni, Ted Samuelsson, Mo Chen

TL;DR
LS-HAR introduces a language-supervised approach for human action recognition that fuses skeleton and visual data using attention mechanisms, and provides a new dataset for construction site applications.
Contribution
The paper presents a novel language-guided feature extraction and salient fusion method for HAR, along with a new dataset for real-world construction site scenarios.
Findings
Achieves promising accuracy on multiple datasets
Demonstrates robustness across modalities
Provides a new dataset for construction site HAR
Abstract
Detecting human actions is a crucial task for autonomous robots and vehicles, often requiring the integration of various data modalities for improved accuracy. In this study, we introduce a novel approach to Human Action Recognition (HAR) using language supervision named LS-HAR based on skeleton and visual cues. Our method leverages a language model to guide the feature extraction process in the skeleton encoder. Specifically, we employ learnable prompts for the language model conditioned on the skeleton modality to optimize feature representation. Furthermore, we propose a fusion mechanism that combines dual-modality features using a salient fusion module, incorporating attention and transformer mechanisms to address the modalities' high dimensionality. This fusion process prioritizes informative video frames and body joints, enhancing the recognition accuracy of human actions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOccupational Health and Safety Research
MethodsSoftmax · Attention Is All You Need
