EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization
Xiaoqi Wang, Yi Wang, Lap-Pui Chau

TL;DR
EVA02-AT introduces a unified, efficient egocentric video-language model with novel spatial-temporal rotary embeddings and a symmetric optimization framework, achieving state-of-the-art results with fewer parameters.
Contribution
The paper presents a single-stage pretraining method, joint spatial-temporal rotary embeddings, and a symmetric multi-similarity loss for improved egocentric video-language understanding.
Findings
State-of-the-art performance on Ego4D, EPIC-Kitchens-100, and Charades-Ego datasets.
Significant improvements in multi-instance retrieval tasks.
Efficient transfer from image-based CLIP to video models.
Abstract
Egocentric video-language understanding demands both high efficiency and accurate spatial-temporal modeling. Existing approaches face three key challenges: 1) Excessive pre-training cost arising from multi-stage pre-training pipelines, 2) Ineffective spatial-temporal encoding due to manually split 3D rotary positional embeddings that hinder feature interactions, and 3) Imprecise learning objectives in soft-label multi-instance retrieval, which neglect negative pair correlations. In this paper, we introduce EVA02-AT, a suite of EVA02-based video-language foundation models tailored to egocentric video understanding tasks. EVA02-AT first efficiently transfers an image-based CLIP model into a unified video encoder via a single-stage pretraining. Second, instead of applying rotary positional embeddings to isolated dimensions, we introduce spatial-temporal rotary positional embeddings along…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Speech and dialogue systems
MethodsContrastive Language-Image Pre-training
