Image Captioning with Object Detection and Localization
Zhongliang Yang, Yu-Jin Zhang, Sadaqat ur Rehman, Yongfeng Huang

TL;DR
This paper introduces a neural network model for image captioning that combines object detection, localization, and an attention-based LSTM to generate descriptive sentences, aligning words with image objects.
Contribution
The novel integration of object detection, localization, and attention mechanisms in an LSTM-based model for improved image captioning performance.
Findings
Outperforms previous benchmark models on COCO dataset
Effectively aligns words with image objects during caption generation
Demonstrates the benefit of combining object detection with attention in image captioning
Abstract
Automatically generating a natural language description of an image is a task close to the heart of image understanding. In this paper, we present a multi-model neural network method closely related to the human visual system that automatically learns to describe the content of images. Our model consists of two sub-models: an object detection and localization model, which extract the information of objects and their spatial relationship in images respectively; Besides, a deep recurrent neural network (RNN) based on long short-term memory (LSTM) units with attention mechanism for sentences generation. Each word of the description will be automatically aligned to different objects of the input image when it is generated. This is similar to the attention mechanism of the human visual system. Experimental results on the COCO dataset showcase the merit of the proposed method, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
