SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video
Hector A. Valdez, Kyle Min, Subarna Tripathi

TL;DR
SViTT-Ego is a novel sparse transformer model for egocentric video-text tasks that reduces memory usage through edge and node sparsification, achieving improved accuracy without extra data augmentation.
Contribution
It introduces the first sparse egocentric video-text transformer with edge and node sparsification, pretrained on EgoClip with an egocentric-specific objective, enhancing performance and efficiency.
Findings
Achieves +2.8% accuracy on EgoMCQ compared to LAVILA large.
Pretrainable on memory-limited devices without additional data augmentation.
Incorporates egocentric-friendly objective EgoNCE.
Abstract
Pretraining egocentric vision-language models has become essential to improving downstream egocentric video-text tasks. These egocentric foundation models commonly use the transformer architecture. The memory footprint of these models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification. We pretrain on the EgoClip dataset and incorporate the egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE. Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy compared to LAVILA large, with no additional data augmentation techniques other than standard image augmentations, yet pretrainable on memory-limited devices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimedia Communication and Technology · Video Analysis and Summarization · Narrative Theory and Analysis
MethodsInfoNCE
