Contextually Customized Video Summaries via Natural Language
Jinsoo Choi, Tae-Hyun Oh, In So Kweon

TL;DR
This paper presents a method for generating personalized video summaries based on simple text descriptions by learning semantic embeddings and selecting relevant segments, outperforming some baseline methods.
Contribution
We introduce a novel approach to create customized video summaries from text, leveraging semantic embeddings learned through a deep architecture trained on image-caption data.
Findings
Our method produces semantically relevant video summaries based on user text.
It achieves comparable or better performance than baseline methods.
The approach generates diverse summaries using learned visual embeddings.
Abstract
The best summary of a long video differs among different people due to its highly subjective nature. Even for the same person, the best summary may change with time or mood. In this paper, we introduce the task of generating customized video summaries through simple text. First, we train a deep architecture to effectively learn semantic embeddings of video frames by leveraging the abundance of image-caption data via a progressive and residual manner. Given a user-specific text description, our algorithm is able to select semantically relevant video segments and produce a temporally aligned video summary. In order to evaluate our textually customized video summaries, we conduct experimental comparison with baseline methods that utilize ground-truth information. Despite the challenging baselines, our method still manages to show comparable or even exceeding performance. We also show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
