Narration Generation for Cartoon Videos
Nikos Papasarantopoulos, Shay B. Cohen

TL;DR
This paper introduces a novel task of generating contextual, storyline-enhancing narrations for videos, specifically focusing on animated series, and presents a new dataset and models for this purpose.
Contribution
It formalizes narration generation as a dual task of timing and content creation, and provides a new dataset from Peppa Pig for training and evaluation.
Findings
Developed models for narration timing and content generation.
Collected and released a new dataset from Peppa Pig.
Demonstrated the feasibility of context-aware narration generation.
Abstract
Research on text generation from multimodal inputs has largely focused on static images, and less on video data. In this paper, we propose a new task, narration generation, that is complementing videos with narration texts that are to be interjected in several places. The narrations are part of the video and contribute to the storyline unfolding in it. Moreover, they are context-informed, since they include information appropriate for the timeframe of video they cover, and also, do not need to include every detail shown in input scenes, as a caption would. We collect a new dataset from the animated television series Peppa Pig. Furthermore, we formalize the task of narration generation as including two separate tasks, timing and content generation, and present a set of models on the new task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Artificial Intelligence in Games
