Multi-modal reward for visual relationships-based image captioning
Ali Abedi, Hossein Karshenas, Peyman Adibi

TL;DR
This paper introduces a novel image captioning approach that integrates visual relationship information from scene graphs with spatial features, employing a multi-modal reward for reinforcement learning to enhance caption quality.
Contribution
It proposes a deep neural network that fuses visual relationships with spatial features and uses a multi-modal reward in reinforcement learning, improving captioning performance.
Findings
Outperforms state-of-the-art methods on MSCOCO dataset
Utilizes lightweight features for effective captioning
Enhances model optimization with multi-modal reward
Abstract
Deep neural networks have achieved promising results in automatic image captioning due to their effective representation learning and context-based content generation capabilities. As a prominent type of deep features used in many of the recent image captioning methods, the well-known bottomup features provide a detailed representation of different objects of the image in comparison with the feature maps directly extracted from the raw image. However, the lack of high-level semantic information about the relationships between these objects is an important drawback of bottom-up features, despite their expensive and resource-demanding extraction procedure. To take advantage of visual relationships in caption generation, this paper proposes a deep neural network architecture for image captioning based on fusing the visual relationships information extracted from an image's scene graph with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
