TL;DR
JFAA is a novel action anticipation method for egocentric videos that leverages a frozen encoder and ensemble techniques, achieving top results in the EPIC-KITCHENS-100 challenge.
Contribution
It introduces a JEPA-based approach with a field-aware ensemble, improving robustness and accuracy in action anticipation tasks.
Findings
Achieved first place in the EgoVis 2026 EK-100 Challenge.
Outperformed previous methods on the official challenge server.
Demonstrated robustness through ensemble over epoch predictions.
Abstract
We propose JFAA, a JEPA-based Future Action Anticipation method for the EPIC-KITCHENS-100 (EK-100) Action Anticipation task. Inspired by the representation learning and future prediction ability of V-JEPA 2.1, JFAA uses a frozen encoder and predictor to extract observed context features and near-future latent tokens. A lightweight attentive probe is then trained to predict verb, noun, and action logits with separate task queries. To improve robustness, we further build a field-aware ensemble over selected epoch-level predictions, allowing each output field to benefit from its most reliable candidates. Experimental results on the official challenge server show that JFAA achieves first place in the EgoVis 2026 EK-100 Action Anticipation Challenge. Our code will be released at https://github.com/CorrineQiu/JFAA.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
