Visual News: Benchmark and Challenges in News Image Captioning

Fuxiao Liu; Yinghan Wang; Tianlu Wang; Vicente Ordonez

arXiv:2010.03743·cs.CV·September 15, 2021·5 cites

Visual News: Benchmark and Challenges in News Image Captioning

Fuxiao Liu, Yinghan Wang, Tianlu Wang, Vicente Ordonez

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper introduces Visual News, a large-scale dataset and an entity-aware Transformer-based model for news image captioning, emphasizing the importance of entities and events, and demonstrating improved captioning performance with fewer parameters.

Contribution

The paper presents a new large-scale dataset and a novel entity-aware Transformer model with multi-modal fusion for news image captioning, addressing the unique challenges of this task.

Findings

01

The proposed model achieves better accuracy with fewer parameters.

02

Visual News dataset contains over one million news images with rich metadata.

03

The dataset highlights remaining challenges in captioning complex news images.

Abstract

We propose Visual News Captioner, an entity-aware model for the task of news image captioning. We also introduce Visual News, a large-scale benchmark consisting of more than one million news images along with associated news articles, image captions, author information, and other metadata. Unlike the standard image captioning task, news images depict situations where people, locations, and events are of paramount importance. Our proposed method can effectively combine visual and textual features to generate captions with richer information such as events and entities. More specifically, built upon the Transformer architecture, our model is further equipped with novel multi-modal feature fusion techniques and attention mechanisms, which are designed to generate named entities more accurately. Our method utilizes much fewer parameters while achieving slightly better prediction results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

FuxiaoLiu/VisualNews-Repository
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization · Softmax · Byte Pair Encoding · Residual Connection · Dense Connections