Convolutional Long Short-Term Memory Networks for Recognizing First Person Interactions
Swathikiran Sudhakaran, Oswald Lanz

TL;DR
This paper introduces a deep learning model combining convolutional neural networks and convolutional LSTMs to recognize first-person interactions, effectively capturing both short-term and long-term spatio-temporal features.
Contribution
It proposes a novel architecture that preserves spatio-temporal structure and outperforms existing RGB-based methods on first-person interaction datasets.
Findings
Outperforms state-of-the-art on UTKinect-FirstPerson dataset
Surpasses previous RGB-only methods by over 20% accuracy
Effective in recognizing complex ego-motion interactions
Abstract
In this paper, we present a novel deep learning based approach for addressing the problem of interaction recognition from a first person perspective. The proposed approach uses a pair of convolutional neural networks, whose parameters are shared, for extracting frame level features from successive frames of the video. The frame level features are then aggregated using a convolutional long short-term memory. The hidden state of the convolutional long short-term memory, after all the input video frames are processed, is used for classification in to the respective categories. The two branches of the convolutional neural network perform feature encoding on a short time interval whereas the convolutional long short term memory encodes the changes on a longer temporal duration. In our network the spatio-temporal structure of the input is preserved till the very final processing stage.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
