A Unified Method for First and Third Person Action Recognition
Ali Javidani, Ahmad Mahmoudi-Aznaveh

TL;DR
This paper introduces a unified video classification approach effective for both first and third person videos, combining appearance and motion analysis through dual independent streams and novel feature extraction techniques.
Contribution
The paper presents a novel dual-stream framework that efficiently captures appearance and motion features using pre-trained networks and a new pooling operator, applicable to both first and third person videos.
Findings
Achieves state-of-the-art results on multiple datasets.
Effectively captures long-term motion dynamics.
Demonstrates versatility across different video perspectives.
Abstract
In this paper, a new video classification methodology is proposed which can be applied in both first and third person videos. The main idea behind the proposed strategy is to capture complementary information of appearance and motion efficiently by performing two independent streams on the videos. The first stream is aimed to capture long-term motions from shorter ones by keeping track of how elements in optical flow images have changed over time. Optical flow images are described by pre-trained networks that have been trained on large scale image datasets. A set of multi-channel time series are obtained by aligning descriptions beside each other. For extracting motion features from these time series, PoT representation method plus a novel pooling operator is followed due to several advantages. The second stream is accomplished to extract appearance features which are vital in the case…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
