JoVALE: Detecting Human Actions in Video Using Audiovisual and Language   Contexts

Taein Son; Soo Won Seo; Jisong Kim; Seok Hwan Lee; Jun Won Choi

arXiv:2412.13708·cs.CV·February 4, 2025

JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts

Taein Son, Soo Won Seo, Jisong Kim, Seok Hwan Lee, Jun Won Choi

PDF

Open Access 1 Repo 1 Video

TL;DR

JoVALE is a novel multi-modal video action detection system that integrates audio, visual, and scene language context using a transformer-based architecture, achieving state-of-the-art results on key benchmarks.

Contribution

This work introduces the first VAD method to combine audio, visual, and scene language features through an actor-centric transformer model, advancing multi-modal action recognition.

Findings

01

Achieves new state-of-the-art performance on AVA, UCF101-24, and JHMDB51-21 benchmarks.

02

Demonstrates that multi-modal integration significantly improves action detection accuracy.

03

Validates the effectiveness of scene descriptive context in enhancing VAD performance.

Abstract

Video Action Detection (VAD) entails localizing and categorizing action instances within videos, which inherently consist of diverse information sources such as audio, visual cues, and surrounding scene contexts. Leveraging this multi-modal information effectively for VAD poses a significant challenge, as the model must identify action-relevant cues with precision. In this study, we introduce a novel multi-modal VAD architecture, referred to as the Joint Actor-centric Visual, Audio, Language Encoder (JoVALE). JoVALE is the first VAD method to integrate audio and visual features with scene descriptive context sourced from large-capacity image captioning models. At the heart of JoVALE is the actor-centric aggregation of audio, visual, and scene descriptive information, enabling adaptive integration of crucial features for recognizing each actor's actions. We have developed a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

taeiin/aaai2025-jovale
noneOfficial

Videos

JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts· underline

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition

MethodsAttention Is All You Need · Linear Layer · Dropout · Multi-Head Attention · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Softmax