Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks
Lorenzo Baraldi, Costantino Grana, Rita Cucchiara

TL;DR
This paper introduces a deep multimodal network approach for segmenting and understanding storytelling videos by combining perceptual, audio, and semantic cues to identify meaningful story boundaries and improve retrieval and summarization.
Contribution
It presents a novel deep network architecture that integrates multiple modalities for semantic video segmentation and proposes a retrieval strategy based on aesthetic and semantic value.
Findings
Effective segmentation of videos into meaningful stories.
Improved retrieval of significant video segments based on queries.
High agreement between automatic segmentation and human annotations.
Abstract
This paper presents a novel approach for temporal and semantic segmentation of edited videos into meaningful segments, from the point of view of the storytelling structure. The objective is to decompose a long video into more manageable sequences, which can in turn be used to retrieve the most significant parts of it given a textual query and to provide an effective summarization. Previous video decomposition methods mainly employed perceptual cues, tackling the problem either as a story change detection, or as a similarity grouping task, and the lack of semantics limited their ability to identify story boundaries. Our proposal connects together perceptual, audio and semantic cues in a specialized deep network architecture designed with a combination of CNNs which generate an appropriate embedding, and clusters shots into connected sequences of semantic scenes, i.e. stories. A retrieval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Music and Audio Processing · Advanced Image and Video Retrieval Techniques
