Simple means Faster: Real-Time Human Motion Forecasting in Monocular First Person Videos on CPU
Junaid Ahmed Ansari, Brojeshwar Bhowmick

TL;DR
This paper introduces a lightweight, real-time RNN-based framework for human motion forecasting in first-person videos that operates efficiently on CPU, surpassing state-of-the-art accuracy and speed with minimal computational resources.
Contribution
The authors propose a simple, low-memory neural network relying solely on bounding boxes, achieving high prediction accuracy and speed on CPU, with effective transferability across datasets.
Findings
Outperforms state-of-the-art methods in accuracy on CityWalks dataset.
Achieves 78 trajectory predictions per second on CPU.
Model size is approximately 17 MB, enabling deployment on low-power devices.
Abstract
We present a simple, fast, and light-weight RNN based framework for forecasting future locations of humans in first person monocular videos. The primary motivation for this work was to design a network which could accurately predict future trajectories at a very high rate on a CPU. Typical applications of such a system would be a social robot or a visual assistance system for all, as both cannot afford to have high compute power to avoid getting heavier, less power efficient, and costlier. In contrast to many previous methods which rely on multiple type of cues such as camera ego-motion or 2D pose of the human, we show that a carefully designed network model which relies solely on bounding boxes can not only perform better but also predicts trajectories at a very high rate while being quite low in size of approximately 17 MB. Specifically, we demonstrate that having an auto-encoder in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
