Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities   Using Web Instructional Videos

Takehiko Ohkawa; Takuma Yagi; Taichi Nishimura; Ryosuke Furuta,; Atsushi Hashimoto; Yoshitaka Ushiku; Yoichi Sato

arXiv:2311.16444·cs.CV·December 10, 2024·2 cites

Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta,, Atsushi Hashimoto, Yoshitaka Ushiku, Yoichi Sato

PDF

Open Access

TL;DR

This paper introduces a new benchmark and dataset for dense video captioning of egocentric activities, leveraging transfer learning from web instructional videos and proposing a view-invariant adversarial training method to handle complex view changes.

Contribution

It presents a novel egocentric dataset (EgoYC2), a cross-view transfer learning framework, and a view-invariant adversarial training approach for dense video captioning.

Findings

01

Transfer learning improves captioning accuracy on egocentric videos.

02

View-invariant training effectively handles dynamic view changes.

03

Benchmark establishes a new task domain for egocentric dense video captioning.

Abstract

We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain shots showing either full-body or hand regions, while the egocentric view is constantly shifting. This necessitates the in-depth study of cross-view transfer under complex view changes. To this end, we first create a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization