NewsStories: Illustrating articles with visual summaries
Reuben Tan, Bryan A. Plummer, Kate Saenko, JP Lewis, Avneesh Sud,, Thomas Leung

TL;DR
This paper introduces a new large-scale dataset and method for learning visual-language representations that handle long, multi-image news articles with loose image-text correspondence, improving zero-shot image retrieval.
Contribution
It presents a novel setting for visual-language learning with long narratives and multiple images, along with a large dataset and a baseline method that outperforms existing approaches.
Findings
State-of-the-art methods struggle with long, multi-image narratives.
A new baseline improves zero-shot image retrieval by 10%.
The dataset contains over 31 million articles and 22 million images.
Abstract
Recent self-supervised approaches have used large-scale image-text datasets to learn powerful representations that transfer to many tasks without finetuning. These methods often assume that there is one-to-one correspondence between its images and their (short) captions. However, many tasks require reasoning about multiple images and long text narratives, such as describing news articles with visual summaries. Thus, we explore a novel setting where the goal is to learn a self-supervised visual-language representation that is robust to varying text length and the number of images. In addition, unlike prior work which assumed captions have a literal relation to the image, we assume images only contain loose illustrative correspondence with the text. To explore this problem, we introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos. We show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
