Compact CNN for Indexing Egocentric Videos
Yair Poleg, Ariel Ephrat, Shmuel Peleg, Chetan Arora

TL;DR
This paper introduces a compact 3D CNN architecture for recognizing long-term activities in egocentric videos, significantly improving accuracy and enabling effective indexing of unstructured video content.
Contribution
A novel compact 3D CNN model that outperforms existing methods in egocentric activity recognition and video classification accuracy.
Findings
Achieves 89% activity recognition accuracy, 19% higher than previous state-of-the-art.
Classifies twice as many categories as current methods.
Recognizes egocentric videos with 99.2% accuracy.
Abstract
While egocentric video is becoming increasingly popular, browsing it is very difficult. In this paper we present a compact 3D Convolutional Neural Network (CNN) architecture for long-term activity recognition in egocentric videos. Recognizing long-term activities enables us to temporally segment (index) long and unstructured egocentric videos. Existing methods for this task are based on hand tuned features derived from visible objects, location of hands, as well as optical flow. Given a sparse optical flow volume as input, our CNN classifies the camera wearer's activity. We obtain classification accuracy of 89%, which outperforms the current state-of-the-art by 19%. Additional evaluation is performed on an extended egocentric video dataset, classifying twice the amount of categories than current state-of-the-art. Furthermore, our CNN is able to recognize whether a video is egocentric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
