Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

Yifan Li; Xinyu Zhou; Yunhao Ge; Yu Kong

arXiv:2605.20085·cs.CV·May 20, 2026

Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

Yifan Li, Xinyu Zhou, Yunhao Ge, Yu Kong

PDF

TL;DR

This paper introduces a new spatial prompting framework for egocentric manipulation, utilizing initial spatial cues to predict future trajectories, supported by a novel dataset and a specialized model.

Contribution

It formalizes the SP-VTP problem, creates the EgoSPT dataset, and proposes SPOT, a model that improves trajectory prediction with spatial prompts in egocentric scenes.

Findings

01

SPOT outperforms non-prompted baselines in cross-scene trajectory prediction.

02

EgoSPT dataset provides annotated egocentric manipulation trajectories with spatial prompts.

03

Spatial prompts enhance the generalization of trajectory prediction models.

Abstract

Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.