JARViS: Detecting Actions in Video Using Unified Actor-Scene Context   Relation Modeling

Seok Hwan Lee; Taein Son; Soo Won Seo; Jisong Kim; Jun Won Choi

arXiv:2408.03612·cs.CV·September 18, 2024

JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling

Seok Hwan Lee, Taein Son, Soo Won Seo, Jisong Kim, Jun Won Choi

PDF

Open Access

TL;DR

JARViS introduces a novel two-stage video action detection framework that leverages unified actor-scene context modeling with Transformer attention, significantly improving performance over existing methods.

Contribution

The paper presents JARViS, a new two-stage VAD framework that effectively models cross-modal actor-scene relations using Transformer attention, achieving state-of-the-art results.

Findings

01

Outperforms existing VAD methods on AVA, UCF101-24, JHMDB51-21 datasets.

02

Achieves significant performance improvements with Transformer-based context modeling.

03

Demonstrates the effectiveness of unified actor-scene relation modeling in video action detection.

Abstract

Video action detection (VAD) is a formidable vision task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip. Among the myriad VAD architectures, two-stage VAD methods utilize a pre-trained person detector to extract the region of interest features, subsequently employing these features for action detection. However, the performance of two-stage VAD methods has been limited as they depend solely on localized actor features to infer action semantics. In this study, we propose a new two-stage VAD framework called Joint Actor-scene context Relation modeling based on Visual Semantics (JARViS), which effectively consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention. JARViS employs a person detector to produce densely sampled actor features from a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications

MethodsSparse Evolutionary Training · Linear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings