Ego-VPA: Egocentric Video Understanding with Parameter-efficient   Adaptation

Tz-Ying Wu; Kyle Min; Subarna Tripathi; Nuno Vasconcelos

arXiv:2407.19520·cs.CV·February 28, 2025

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

Tz-Ying Wu, Kyle Min, Subarna Tripathi, Nuno Vasconcelos

PDF

Open Access

TL;DR

Ego-VPA introduces a parameter-efficient method for adapting egocentric video models to new tasks, using basis prompts to synthesize prompts and enable effective cross-modal transfer with minimal learnable parameters.

Contribution

The paper proposes Ego-VPA, a novel lightweight adaptation technique leveraging basis prompts for egocentric video understanding, reducing the need for extensive fine-tuning.

Findings

01

Ego-VPA achieves comparable performance to full fine-tuning.

02

It uses only 0.84% of parameters for adaptation.

03

The method improves efficiency and effectiveness in egocentric video tasks.

Abstract

Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts, and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities, it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters), largely improving over baselines and reaching the performance of full fine-tuning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image Processing Techniques · Human Pose and Action Recognition