Focus! Relevant and Sufficient Context Selection for News Image Captioning
Mingyang Zhou, Grace Luo, Anna Rohrbach, Zhou Yu

TL;DR
This paper introduces a method for improving news image captioning by automatically selecting relevant context from news articles using CLIP and relation extraction, leading to state-of-the-art results.
Contribution
It proposes a novel context selection approach leveraging CLIP and relation extraction to enhance captioning accuracy in news images.
Findings
Significant performance improvement over previous models
Achieved new state-of-the-art on multiple benchmarks
Effective automatic key entity localization and extraction
Abstract
News Image Captioning requires describing an image by leveraging additional context from a news article. Previous works only coarsely leverage the article to extract the necessary context, which makes it challenging for models to identify relevant events and named entities. In our paper, we first demonstrate that by combining more fine-grained context that captures the key named entities (obtained via an oracle) and the global context that summarizes the news, we can dramatically improve the model's ability to generate accurate news captions. This begs the question, how to automatically extract such key entities from an image? We propose to use the pre-trained vision and language retrieval model CLIP to localize the visually grounded entities in the news article and then capture the non-visual entities via an open relation extraction model. Our experiments demonstrate that by simply…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
