TL;DR
VISTA is a novel egocentric video anticipation model that combines object detection and temporal context to predict future human-object interactions, achieving top results in the EgoVis 2026 challenge.
Contribution
It introduces a V-JEPA-based integrated approach for short-term object interaction anticipation in egocentric videos, combining object detection with temporal context modeling.
Findings
VISTA achieved first place in the EgoVis 2026 Ego4D STA Challenge.
The model effectively combines object proposals with temporal features for accurate anticipation.
Ensembling predictions improved robustness and overall performance.
Abstract
We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
