VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

Qiaohui Chu; Haoyu Zhang; Yisen Feng; Meng Liu; Weili Guan; Dongmei Jiang; Liqiang Nie

arXiv:2605.20901·cs.CV·May 21, 2026

VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Dongmei Jiang, Liqiang Nie

PDF

1 Repo

TL;DR

VISTA is a novel egocentric video anticipation model that combines object detection and temporal context to predict future human-object interactions, achieving top results in the EgoVis 2026 challenge.

Contribution

It introduces a V-JEPA-based integrated approach for short-term object interaction anticipation in egocentric videos, combining object detection with temporal context modeling.

Findings

01

VISTA achieved first place in the EgoVis 2026 Ego4D STA Challenge.

02

The model effectively combines object proposals with temporal features for accurate anticipation.

03

Ensembling predictions improved robustness and overall performance.

Abstract

We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CorrineQiu/VISTA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.