Focus! Relevant and Sufficient Context Selection for News Image   Captioning

Mingyang Zhou; Grace Luo; Anna Rohrbach; Zhou Yu

arXiv:2212.00843·cs.CV·December 5, 2022·1 cites

Focus! Relevant and Sufficient Context Selection for News Image Captioning

Mingyang Zhou, Grace Luo, Anna Rohrbach, Zhou Yu

PDF

Open Access

TL;DR

This paper introduces a method for improving news image captioning by automatically selecting relevant context from news articles using CLIP and relation extraction, leading to state-of-the-art results.

Contribution

It proposes a novel context selection approach leveraging CLIP and relation extraction to enhance captioning accuracy in news images.

Findings

01

Significant performance improvement over previous models

02

Achieved new state-of-the-art on multiple benchmarks

03

Effective automatic key entity localization and extraction

Abstract

News Image Captioning requires describing an image by leveraging additional context from a news article. Previous works only coarsely leverage the article to extract the necessary context, which makes it challenging for models to identify relevant events and named entities. In our paper, we first demonstrate that by combining more fine-grained context that captures the key named entities (obtained via an oracle) and the global context that summarizes the news, we can dramatically improve the model's ability to generate accurate news captions. This begs the question, how to automatically extract such key entities from an image? We propose to use the pre-trained vision and language retrieval model CLIP to localize the visually grounded entities in the news article and then capture the non-visual entities via an open relation extraction model. Our experiments demonstrate that by simply…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training