Transform, Contrast and Tell: Coherent Entity-Aware Multi-Image Captioning
Jingqiang Chen

TL;DR
This paper introduces a novel multi-image captioning model that leverages coherence relationships among neighboring images and captions, utilizing contrastive learning and a new dataset to improve caption coherence and entity consistency.
Contribution
It proposes a coherent entity-aware multi-image captioning model with contrastive coherence mechanisms and introduces the DM800K dataset for better multi-image captioning evaluation.
Findings
The model outperforms 7 baselines on multiple metrics.
Generated captions show higher coherence and entity consistency.
Proposed coherence metrics align well with human judgments.
Abstract
Coherent entity-aware multi-image captioning aims to generate coherent captions for neighboring images in a news document. There are coherence relationships among neighboring images because they often describe same entities or events. These relationships are important for entity-aware multi-image captioning, but are neglected in entity-aware single-image captioning. Most existing work focuses on single-image captioning, while multi-image captioning has not been explored before. Hence, this paper proposes a coherent entity-aware multi-image captioning model by making use of coherence relationships. The model consists of a Transformer-based caption generation model and two types of contrastive learning-based coherence mechanisms. The generation model generates the caption by paying attention to the image and the accompanying text. The caption-caption coherence mechanism aims to render…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
