JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling
Seok Hwan Lee, Taein Son, Soo Won Seo, Jisong Kim, Jun Won Choi

TL;DR
JARViS introduces a novel two-stage video action detection framework that leverages unified actor-scene context modeling with Transformer attention, significantly improving performance over existing methods.
Contribution
The paper presents JARViS, a new two-stage VAD framework that effectively models cross-modal actor-scene relations using Transformer attention, achieving state-of-the-art results.
Findings
Outperforms existing VAD methods on AVA, UCF101-24, JHMDB51-21 datasets.
Achieves significant performance improvements with Transformer-based context modeling.
Demonstrates the effectiveness of unified actor-scene relation modeling in video action detection.
Abstract
Video action detection (VAD) is a formidable vision task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip. Among the myriad VAD architectures, two-stage VAD methods utilize a pre-trained person detector to extract the region of interest features, subsequently employing these features for action detection. However, the performance of two-stage VAD methods has been limited as they depend solely on localized actor features to infer action semantics. In this study, we propose a new two-stage VAD framework called Joint Actor-scene context Relation modeling based on Visual Semantics (JARViS), which effectively consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention. JARViS employs a person detector to produce densely sampled actor features from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications
MethodsSparse Evolutionary Training · Linear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings
