Video Summarization: Towards Entity-Aware Captions
Hammad A. Ayyubi, Tianqi Liu, Arsha Nagrani, Xudong Lin, Mingda Zhang,, Anurag Arnab, Feng Han, Yukun Zhu, Jialu Liu, Shih-Fu Chang

TL;DR
This paper introduces the task of generating entity-aware captions for news videos, presents a large-scale dataset, and proposes a method that combines visual data with external knowledge to improve captioning accuracy.
Contribution
It defines a new task of entity-aware news video captioning, releases the VIEWS dataset, and proposes a knowledge-augmented captioning method that enhances existing models.
Findings
The proposed method improves caption quality on news videos.
The approach generalizes to news image caption datasets.
Extensive experiments validate the effectiveness of the method.
Abstract
Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task. Further, we propose a method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions. We demonstrate the effectiveness of our approach on three video captioning models. We also show that our approach generalizes to existing news image captions dataset. With all the extensive experiments and insights, we believe we establish a solid basis for future research…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Natural Language Processing Techniques
