Relationship-based Neural Baby Talk

Fan Fu; Tingting Xie; Ioannis Patras; Sepehr Jalali

arXiv:2103.04846·cs.CV·March 9, 2021

Relationship-based Neural Baby Talk

Fan Fu, Tingting Xie, Ioannis Patras, Sepehr Jalali

PDF

Open Access

TL;DR

This paper introduces a relationship-based neural model for image captioning that encodes spatial, semantic, and implicit object interactions using graph attention networks, leading to improved captioning performance.

Contribution

The paper presents a novel R-NBT model that integrates multiple types of object relationships via graph attention networks for enhanced image captioning.

Findings

01

Outperforms state-of-the-art models on COCO captioning tasks

02

Effectively models spatial, semantic, and implicit relationships

03

Improves caption quality by incorporating diverse object interactions

Abstract

Understanding interactions between objects in an image is an important element for generating captions. In this paper, we propose a relationship-based neural baby talk (R-NBT) model to comprehensively investigate several types of pairwise object interactions by encoding each image via three different relationship-based graph attention networks (GATs). We study three main relationships: \textit{spatial relationships} to explore geometric interactions, \textit{semantic relationships} to extract semantic interactions, and \textit{implicit relationships} to capture hidden information that could not be modelled explicitly as above. We construct three relationship graphs with the objects in an image as nodes, and the mutual relationships of pairwise objects as edges. By exploring features of neighbouring regions individually via GATs, we integrate different types of relationships into visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization