Transform, Contrast and Tell: Coherent Entity-Aware Multi-Image   Captioning

Jingqiang Chen

arXiv:2302.02124·cs.CV·November 30, 2023·1 cites

Transform, Contrast and Tell: Coherent Entity-Aware Multi-Image Captioning

Jingqiang Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel multi-image captioning model that leverages coherence relationships among neighboring images and captions, utilizing contrastive learning and a new dataset to improve caption coherence and entity consistency.

Contribution

It proposes a coherent entity-aware multi-image captioning model with contrastive coherence mechanisms and introduces the DM800K dataset for better multi-image captioning evaluation.

Findings

01

The model outperforms 7 baselines on multiple metrics.

02

Generated captions show higher coherence and entity consistency.

03

Proposed coherence metrics align well with human judgments.

Abstract

Coherent entity-aware multi-image captioning aims to generate coherent captions for neighboring images in a news document. There are coherence relationships among neighboring images because they often describe same entities or events. These relationships are important for entity-aware multi-image captioning, but are neglected in entity-aware single-image captioning. Most existing work focuses on single-image captioning, while multi-image captioning has not been explored before. Hence, this paper proposes a coherent entity-aware multi-image captioning model by making use of coherence relationships. The model consists of a Transformer-based caption generation model and two types of contrastive learning-based coherence mechanisms. The generation model generates the caption by paying attention to the image and the accompanying text. The caption-caption coherence mechanism aims to render…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jingqiangchen/concaps
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques