TL;DR
This paper introduces a novel multi-modal Transformer-based approach for egocentric human trajectory forecasting using a new dataset, improving accuracy in predicting future paths in crowded environments.
Contribution
The paper presents a new dataset and a multi-modal Transformer model with a cascaded cross-attention mechanism for egocentric trajectory prediction, advancing prior methods.
Findings
Our model outperforms state-of-the-art in trajectory forecasting accuracy.
Multi-modal fusion improves prediction by incorporating scene semantics and depth.
The dataset enables better understanding of egocentric navigation in crowded spaces.
Abstract
In this paper, we address the problem of forecasting the trajectory of an egocentric camera wearer (ego-person) in crowded spaces. The trajectory forecasting ability learned from the data of different camera wearers walking around in the real world can be transferred to assist visually impaired people in navigation, as well as to instill human navigation behaviours in mobile robots, enabling better human-robot interactions. To this end, a novel egocentric human trajectory forecasting dataset was constructed, containing real trajectories of people navigating in crowded spaces wearing a camera, as well as extracted rich contextual data. We extract and utilize three different modalities to forecast the trajectory of the camera wearer, i.e., his/her past trajectory, the past trajectories of nearby people, and the environment such as the scene semantics or the depth of the scene. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
