Enhancing image captioning with depth information using a Transformer-based framework
Aya Mahmoud Ahmed, Mohamed Yousef, Khaled F. Hussain, Yousef Bassyouni, Mahdy

TL;DR
This paper introduces a Transformer-based framework that integrates depth information with RGB images to improve multi-sentence image captioning, demonstrating benefits on benchmark datasets after dataset cleaning.
Contribution
It proposes a novel multi-modal Transformer framework combining RGB and depth data for enhanced scene understanding in captioning tasks.
Findings
Depth information improves caption quality with clean datasets.
Dataset inconsistencies can hinder the benefits of depth integration.
The framework works with both ground truth and estimated depth maps.
Abstract
Captioning images is a challenging scene-understanding task that connects computer vision and natural language processing. While image captioning models have been successful in producing excellent descriptions, the field has primarily focused on generating a single sentence for 2D images. This paper investigates whether integrating depth information with RGB images can enhance the captioning task and generate better descriptions. For this purpose, we propose a Transformer-based encoder-decoder framework for generating a multi-sentence description of a 3D scene. The RGB image and its corresponding depth map are provided as inputs to our framework, which combines them to produce a better understanding of the input scene. Depth maps could be ground truth or estimated, which makes our framework widely applicable to any RGB captioning dataset. We explored different fusion approaches to fuse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
