Visually-Aware Context Modeling for News Image Captioning

Tingyu Qu; Tinne Tuytelaars; Marie-Francine Moens

arXiv:2308.08325·cs.CV·March 22, 2024·1 cites

Visually-Aware Context Modeling for News Image Captioning

Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a visually-aware framework for news image captioning that leverages face-naming modules, semantic retrieval with CLIP, and a novel training method CoLaM, significantly improving captioning accuracy.

Contribution

It proposes a new framework combining face recognition, semantic retrieval, and contrastive training to enhance news image captioning performance.

Findings

01

Outperforms previous state-of-the-art by 7.97 CIDEr on GoodNews

02

Outperforms previous state-of-the-art by 5.80 CIDEr on NYTimes800k

03

Demonstrates effectiveness of face-naming and semantic retrieval modules

Abstract

News Image Captioning aims to create captions from news articles and images, emphasizing the connection between textual context and visual elements. Recognizing the significance of human faces in news images and the face-name co-occurrence pattern in existing datasets, we propose a face-naming module for learning better name embeddings. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. We design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image, mimicking human thought process of linking articles to images. Furthermore, to tackle the problem of the imbalanced proportion of article context and image context in captions, we introduce a simple yet effective method Contrasting with Language Model backbone (CoLaM) to the training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tingyu215/vacnic
pytorchOfficial

Videos

Visually-Aware Context Modeling for News Image Captioning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training