Enhanced Modality Transition for Image Captioning
Ziwei Wang, Yadan Luo, Zi Huang

TL;DR
This paper introduces a Modality Transition Module (MTM) that transforms visual features into semantic representations to improve image captioning, resulting in more detailed and contextually accurate captions.
Contribution
The paper proposes a novel MTM that explicitly bridges the modality gap in image captioning, enhancing caption quality over existing encoder-decoder models.
Findings
Improves captioning performance by 3.4% on MS-COCO.
Effectively transfers visual features into semantic space.
Enhances contextual and detailed caption generation.
Abstract
Image captioning model is a cross-modality knowledge discovery task, which targets at automatically describing an image with an informative and coherent sentence. To generate the captions, the previous encoder-decoder frameworks directly forward the visual vectors to the recurrent language model, forcing the recurrent units to generate a sentence based on the visual features. Although these sentences are generally readable, they still suffer from the lack of details and highlights, due to the fact that the substantial gap between the image and text modalities is not sufficiently addressed. In this work, we explicitly build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model. During the training phase, the modality transition network is optimised by the proposed modality loss, which compares the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
