How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction
Sejoon Jun, Hai Nguyen-Truong, Luigi Seminara, Lorenzo Torresani

TL;DR
This paper introduces TrajPilot, a model that predicts future egocentric camera trajectories to improve action and goal prediction, outperforming existing baselines across multiple datasets and tasks.
Contribution
The paper demonstrates that using predicted future trajectories as a conditioning signal significantly enhances egocentric action prediction and planning, surpassing language-based and other models.
Findings
TrajPilot outperforms VLM and structured-planner baselines on multiple datasets.
Trajectory-based conditioning improves prediction especially at longer horizons.
The model maintains performance with RGB-only camera-pose estimation.
Abstract
Predicting how a person's first-person view will evolve (what action will follow, what plan completes a task, whether an in-progress shot will score) is fundamentally under-specified: the same context admits many plausible futures, and a model trained to minimize prediction error is forced to hedge or average across them, getting it wrong either way. Two findings shape our approach. First, the future camera trajectory, the path the head carves through space, lets the model commit to one of those futures: it carries the operator's intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal. Second, this same intent makes the trajectory itself partially predictable from the context at hand, enough that trajectory need not be observed at test time to recover most of the gain. We instantiate these findings as TrajPilot,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
