Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos
Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta,, Atsushi Hashimoto, Yoshitaka Ushiku, Yoichi Sato

TL;DR
This paper introduces a new benchmark and dataset for dense video captioning of egocentric activities, leveraging transfer learning from web instructional videos and proposing a view-invariant adversarial training method to handle complex view changes.
Contribution
It presents a novel egocentric dataset (EgoYC2), a cross-view transfer learning framework, and a view-invariant adversarial training approach for dense video captioning.
Findings
Transfer learning improves captioning accuracy on egocentric videos.
View-invariant training effectively handles dynamic view changes.
Benchmark establishes a new task domain for egocentric dense video captioning.
Abstract
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain shots showing either full-body or hand regions, while the egocentric view is constantly shifting. This necessitates the in-depth study of cross-view transfer under complex view changes. To this end, we first create a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
