SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video

Hector A. Valdez; Kyle Min; Subarna Tripathi

arXiv:2406.09462·cs.CV·June 17, 2024

SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video

Hector A. Valdez, Kyle Min, Subarna Tripathi

PDF

Open Access

TL;DR

SViTT-Ego is a novel sparse transformer model for egocentric video-text tasks that reduces memory usage through edge and node sparsification, achieving improved accuracy without extra data augmentation.

Contribution

It introduces the first sparse egocentric video-text transformer with edge and node sparsification, pretrained on EgoClip with an egocentric-specific objective, enhancing performance and efficiency.

Findings

01

Achieves +2.8% accuracy on EgoMCQ compared to LAVILA large.

02

Pretrainable on memory-limited devices without additional data augmentation.

03

Incorporates egocentric-friendly objective EgoNCE.

Abstract

Pretraining egocentric vision-language models has become essential to improving downstream egocentric video-text tasks. These egocentric foundation models commonly use the transformer architecture. The memory footprint of these models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification. We pretrain on the EgoClip dataset and incorporate the egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE. Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy compared to LAVILA large, with no additional data augmentation techniques other than standard image augmentations, yet pretrainable on memory-limited devices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimedia Communication and Technology · Video Analysis and Summarization · Narrative Theory and Analysis

MethodsInfoNCE