DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning
Junho Yoon, Jaemo Jung, Hyunju Kim, Dongman Lee

TL;DR
DETACH introduces a decomposed spatio-temporal alignment framework for exocentric video and ambient sensors, addressing local detail preservation and semantic grounding, leading to improved human action recognition without wearable sensors.
Contribution
The paper proposes a novel two-stage, decomposed alignment method with online clustering and contrastive loss, overcoming limitations of global sequence encoding in exocentric-ambient settings.
Findings
Significant accuracy improvements on Opportunity++ and HWU-USP datasets.
Effective preservation of local motion details and semantic context.
Robust alignment across diverse human actions.
Abstract
Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Context-Aware Activity Recognition Systems · Emotion and Mood Recognition
