Multi-modal reward for visual relationships-based image captioning

Ali Abedi; Hossein Karshenas; Peyman Adibi

arXiv:2303.10766·cs.CV·March 22, 2023·1 cites

Multi-modal reward for visual relationships-based image captioning

Ali Abedi, Hossein Karshenas, Peyman Adibi

PDF

Open Access

TL;DR

This paper introduces a novel image captioning approach that integrates visual relationship information from scene graphs with spatial features, employing a multi-modal reward for reinforcement learning to enhance caption quality.

Contribution

It proposes a deep neural network that fuses visual relationships with spatial features and uses a multi-modal reward in reinforcement learning, improving captioning performance.

Findings

01

Outperforms state-of-the-art methods on MSCOCO dataset

02

Utilizes lightweight features for effective captioning

03

Enhances model optimization with multi-modal reward

Abstract

Deep neural networks have achieved promising results in automatic image captioning due to their effective representation learning and context-based content generation capabilities. As a prominent type of deep features used in many of the recent image captioning methods, the well-known bottomup features provide a detailed representation of different objects of the image in comparison with the feature maps directly extracted from the raw image. However, the lack of high-level semantic information about the relationships between these objects is an important drawback of bottom-up features, despite their expensive and resource-demanding extraction procedure. To take advantage of visual relationships in caption generation, this paper proposes a deep neural network architecture for image captioning based on fusing the visual relationships information extracted from an image's scene graph with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization