Enhanced Modality Transition for Image Captioning

Ziwei Wang; Yadan Luo; Zi Huang

arXiv:2102.11526·cs.CV·February 24, 2021

Enhanced Modality Transition for Image Captioning

Ziwei Wang, Yadan Luo, Zi Huang

PDF

Open Access

TL;DR

This paper introduces a Modality Transition Module (MTM) that transforms visual features into semantic representations to improve image captioning, resulting in more detailed and contextually accurate captions.

Contribution

The paper proposes a novel MTM that explicitly bridges the modality gap in image captioning, enhancing caption quality over existing encoder-decoder models.

Findings

01

Improves captioning performance by 3.4% on MS-COCO.

02

Effectively transfers visual features into semantic space.

03

Enhances contextual and detailed caption generation.

Abstract

Image captioning model is a cross-modality knowledge discovery task, which targets at automatically describing an image with an informative and coherent sentence. To generate the captions, the previous encoder-decoder frameworks directly forward the visual vectors to the recurrent language model, forcing the recurrent units to generate a sentence based on the visual features. Although these sentences are generally readable, they still suffer from the lack of details and highlights, due to the fact that the substantial gap between the image and text modalities is not sufficiently addressed. In this work, we explicitly build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model. During the training phase, the modality transition network is optimised by the proposed modality loss, which compares the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques