Learning to Relate from Captions and Bounding Boxes

Sarthak Garg; Joel Ruben Antony Moniz; Anshu Aviral; Priyatham; Bollimpalli

arXiv:1912.00311·cs.CV·December 3, 2019

Learning to Relate from Captions and Bounding Boxes

Sarthak Garg, Joel Ruben Antony Moniz, Anshu Aviral, Priyatham, Bollimpalli

PDF

TL;DR

This paper introduces a weakly supervised method for predicting image relationships using captions and bounding boxes, employing attention mechanisms and syntactic cues to improve relation understanding.

Contribution

It presents a novel approach that leverages captions and bounding boxes with attention and syntax to train relation classifiers without explicit relation annotations.

Findings

01

Achieved 15% recall@50 and 25% recall@100 on Visual Genome relationships.

02

Successfully predicts relations not explicitly mentioned in captions.

03

Demonstrates effectiveness of weak supervision in relation prediction.

Abstract

In this work, we propose a novel approach that predicts the relationships between various entities in an image in a weakly supervised manner by relying on image captions and object bounding box annotations as the sole source of supervision. Our proposed approach uses a top-down attention mechanism to align entities in captions to objects in the image, and then leverage the syntactic structure of the captions to align the relations. We use these alignments to train a relation classification network, thereby obtaining both grounded captions and dense relationships. We demonstrate the effectiveness of our model on the Visual Genome dataset by achieving a recall@50 of 15% and recall@100 of 25% on the relationships present in the image. We also show that the model successfully predicts relations that are not present in the corresponding captions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.