Relationship-based Neural Baby Talk
Fan Fu, Tingting Xie, Ioannis Patras, Sepehr Jalali

TL;DR
This paper introduces a relationship-based neural model for image captioning that encodes spatial, semantic, and implicit object interactions using graph attention networks, leading to improved captioning performance.
Contribution
The paper presents a novel R-NBT model that integrates multiple types of object relationships via graph attention networks for enhanced image captioning.
Findings
Outperforms state-of-the-art models on COCO captioning tasks
Effectively models spatial, semantic, and implicit relationships
Improves caption quality by incorporating diverse object interactions
Abstract
Understanding interactions between objects in an image is an important element for generating captions. In this paper, we propose a relationship-based neural baby talk (R-NBT) model to comprehensively investigate several types of pairwise object interactions by encoding each image via three different relationship-based graph attention networks (GATs). We study three main relationships: \textit{spatial relationships} to explore geometric interactions, \textit{semantic relationships} to extract semantic interactions, and \textit{implicit relationships} to capture hidden information that could not be modelled explicitly as above. We construct three relationship graphs with the objects in an image as nodes, and the mutual relationships of pairwise objects as edges. By exploring features of neighbouring regions individually via GATs, we integrate different types of relationships into visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
