TL;DR
This paper introduces a multimodal approach combining audio and visual data for egocentric action recognition in kitchens, demonstrating improved accuracy over unimodal methods through late fusion and sparse sampling.
Contribution
It presents a novel multimodal model that integrates audio and visual streams with a sparse sampling strategy for egocentric action recognition.
Findings
Achieved a 5.18% improvement in verb classification accuracy.
Multimodal integration outperforms unimodal approaches.
Late fusion of audio and visual data enhances recognition performance.
Abstract
Our interaction with the world is an inherently multimodal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial, and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a 5.18% improvement over the state of the art on verb classification.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
