Feature Hallucination for Self-supervised Action Recognition
Lei Wang, Piotr Koniusz

TL;DR
This paper introduces a multimodal self-supervised framework for action recognition that uses feature hallucination, novel descriptors, and uncertainty modeling to improve accuracy without extra computational cost.
Contribution
It proposes a new deep translational framework with domain-specific descriptors and uncertainty-aware hallucination for enhanced action recognition.
Findings
Achieves state-of-the-art results on Kinetics-400, Kinetics-600, and Something-Something V2 datasets.
Effectively integrates multimodal features and auxiliary cues for improved recognition.
Demonstrates robustness to feature noise through uncertainty modeling.
Abstract
Understanding human actions in videos requires more than raw pixel analysis; it relies on high-level semantic reasoning and effective integration of multimodal features. We propose a deep translational action recognition framework that enhances recognition accuracy by jointly predicting action concepts and auxiliary features from RGB video frames. At test time, hallucination streams infer missing cues, enriching feature representations without increasing computational overhead. To focus on action-relevant regions beyond raw pixels, we introduce two novel domain-specific descriptors. Object Detection Features (ODF) aggregate outputs from multiple object detectors to capture contextual cues, while Saliency Detection Features (SDF) highlight spatial and intensity patterns crucial for action recognition. Our framework seamlessly integrates these descriptors with auxiliary modalities such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications
MethodsDropout · Dense Connections · Absolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Transformer · Focus
