OWL (Observe, Watch, Listen): Audiovisual Temporal Context for   Localizing Actions in Egocentric Videos

Merey Ramazanova; Victor Escorcia; Fabian Caba Heilbron; Chen Zhao,; Bernard Ghanem

arXiv:2202.04947·cs.CV·October 27, 2022

OWL (Observe, Watch, Listen): Audiovisual Temporal Context for Localizing Actions in Egocentric Videos

Merey Ramazanova, Victor Escorcia, Fabian Caba Heilbron, Chen Zhao,, Bernard Ghanem

PDF

Open Access

TL;DR

This paper introduces OWL, a method that leverages audiovisual temporal context to improve action localization in egocentric videos, demonstrating significant performance gains over visual-only models on large-scale datasets.

Contribution

The paper presents a novel audiovisual approach for egocentric temporal action localization, effectively utilizing multimodal signals to enhance detection accuracy.

Findings

01

Audiovisual context improves localization performance.

02

OWL boosts mAP by over 2% on EPIC-Kitchens.

03

OWL achieves over 3% mAP improvement on HOMAGE.

Abstract

Egocentric videos capture sequences of human activities from a first-person perspective and can provide rich multimodal signals. However, most current localization methods use third-person videos and only incorporate visual information. In this work, we take a deep look into the effectiveness of audiovisual context in detecting actions in egocentric videos and introduce a simple-yet-effective approach via Observing, Watching, and Listening (OWL). OWL leverages audiovisual information and context for egocentric temporal action localization (TAL). We validate our approach in two large-scale datasets, EPIC-Kitchens, and HOMAGE. Extensive experiments demonstrate the relevance of the audiovisual temporal context. Namely, we boost the localization performance (mAP) over visual-only models by +2.23% and +3.35% in the above datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications