JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
Taein Son, Soo Won Seo, Jisong Kim, Seok Hwan Lee, Jun Won Choi

TL;DR
JoVALE is a novel multi-modal video action detection system that integrates audio, visual, and scene language context using a transformer-based architecture, achieving state-of-the-art results on key benchmarks.
Contribution
This work introduces the first VAD method to combine audio, visual, and scene language features through an actor-centric transformer model, advancing multi-modal action recognition.
Findings
Achieves new state-of-the-art performance on AVA, UCF101-24, and JHMDB51-21 benchmarks.
Demonstrates that multi-modal integration significantly improves action detection accuracy.
Validates the effectiveness of scene descriptive context in enhancing VAD performance.
Abstract
Video Action Detection (VAD) entails localizing and categorizing action instances within videos, which inherently consist of diverse information sources such as audio, visual cues, and surrounding scene contexts. Leveraging this multi-modal information effectively for VAD poses a significant challenge, as the model must identify action-relevant cues with precision. In this study, we introduce a novel multi-modal VAD architecture, referred to as the Joint Actor-centric Visual, Audio, Language Encoder (JoVALE). JoVALE is the first VAD method to integrate audio and visual features with scene descriptive context sourced from large-capacity image captioning models. At the heart of JoVALE is the actor-centric aggregation of audio, visual, and scene descriptive information, enabling adaptive integration of crucial features for recognizing each actor's actions. We have developed a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition
MethodsAttention Is All You Need · Linear Layer · Dropout · Multi-Head Attention · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Softmax
