EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization

Xiaoqi Wang; Yi Wang; Lap-Pui Chau

arXiv:2506.14356·cs.CV·June 18, 2025

EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization

Xiaoqi Wang, Yi Wang, Lap-Pui Chau

PDF

Open Access 1 Repo

TL;DR

EVA02-AT introduces a unified, efficient egocentric video-language model with novel spatial-temporal rotary embeddings and a symmetric optimization framework, achieving state-of-the-art results with fewer parameters.

Contribution

The paper presents a single-stage pretraining method, joint spatial-temporal rotary embeddings, and a symmetric multi-similarity loss for improved egocentric video-language understanding.

Findings

01

State-of-the-art performance on Ego4D, EPIC-Kitchens-100, and Charades-Ego datasets.

02

Significant improvements in multi-instance retrieval tasks.

03

Efficient transfer from image-based CLIP to video models.

Abstract

Egocentric video-language understanding demands both high efficiency and accurate spatial-temporal modeling. Existing approaches face three key challenges: 1) Excessive pre-training cost arising from multi-stage pre-training pipelines, 2) Ineffective spatial-temporal encoding due to manually split 3D rotary positional embeddings that hinder feature interactions, and 3) Imprecise learning objectives in soft-label multi-instance retrieval, which neglect negative pair correlations. In this paper, we introduce EVA02-AT, a suite of EVA02-based video-language foundation models tailored to egocentric video understanding tasks. EVA02-AT first efficiently transfers an image-based CLIP model into a unified video encoder via a single-stage pretraining. Second, instead of applying rotary positional embeddings to isolated dimensions, we introduce spatial-temporal rotary positional embeddings along…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xqwang14/eva02-at
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Speech and dialogue systems

MethodsContrastive Language-Image Pre-training