Seeing and Hearing Egocentric Actions: How Much Can We Learn?

Alejandro Cartas; Jordi Luque; Petia Radeva; Carlos Segura; and Mariella Dimiccoli

arXiv:1910.06693·cs.CV·October 16, 2019

Seeing and Hearing Egocentric Actions: How Much Can We Learn?

Alejandro Cartas, Jordi Luque, Petia Radeva, Carlos Segura, and Mariella Dimiccoli

PDF

1 Repo

TL;DR

This paper introduces a multimodal approach combining audio and visual data for egocentric action recognition in kitchens, demonstrating improved accuracy over unimodal methods through late fusion and sparse sampling.

Contribution

It presents a novel multimodal model that integrates audio and visual streams with a sparse sampling strategy for egocentric action recognition.

Findings

01

Achieved a 5.18% improvement in verb classification accuracy.

02

Multimodal integration outperforms unimodal approaches.

03

Late fusion of audio and visual data enhances recognition performance.

Abstract

Our interaction with the world is an inherently multimodal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial, and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a 5.18% improvement over the state of the art on verb classification.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gorayni/seeing_and_hearing
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.