Comparing Learning Paradigms for Egocentric Video Summarization

Daniel Wen

arXiv:2506.21785·cs.CV·June 30, 2025

Comparing Learning Paradigms for Egocentric Video Summarization

Daniel Wen

PDF

Open Access

TL;DR

This paper compares different learning paradigms for egocentric video summarization, revealing that prompt fine-tuning of GPT-4o outperforms specialized models, but highlights the need for further advancements in first-person video understanding.

Contribution

It provides a comparative analysis of supervised, unsupervised, and prompt fine-tuning approaches for egocentric video summarization, demonstrating the potential of prompt-based models in this domain.

Findings

01

Prompt fine-tuned GPT-4o outperforms specialized models.

02

State-of-the-art models perform less effectively on first-person videos.

03

Evaluation conducted on a small subset of egocentric videos.

Abstract

In this study, we investigate various computer vision paradigms - supervised learning, unsupervised learning, and prompt fine-tuning - by assessing their ability to understand and interpret egocentric video data. Specifically, we examine Shotluck Holmes (state-of-the-art supervised learning), TAC-SUM (state-of-the-art unsupervised learning), and GPT-4o (a prompt fine-tuned pre-trained model), evaluating their effectiveness in video summarization. Our results demonstrate that current state-of-the-art models perform less effectively on first-person videos compared to third-person videos, highlighting the need for further advancements in the egocentric video domain. Notably, a prompt fine-tuned general-purpose GPT-4o model outperforms these specialized models, emphasizing the limitations of existing approaches in adapting to the unique challenges of first-person perspectives. Although our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization