Learning Object Detection from Captions via Textual Scene Attributes

Achiya Jerbi; Roei Herzig; Jonathan Berant; Gal Chechik; Amir; Globerson

arXiv:2009.14558·cs.CV·October 1, 2020·6 cites

Learning Object Detection from Captions via Textual Scene Attributes

Achiya Jerbi, Roei Herzig, Jonathan Berant, Gal Chechik, Amir, Globerson

PDF

Open Access

TL;DR

This paper introduces a method to leverage image captions, which contain rich scene attributes and relations, to train object detectors effectively, reducing the need for extensive bounding box annotations.

Contribution

It proposes utilizing textual scene attributes from captions as supervision signals for object detection, advancing weakly supervised learning methods.

Findings

01

Achieves state-of-the-art results on multiple datasets.

02

Outperforms recent weak supervision approaches.

03

Effectively uses caption attributes for object detection.

Abstract

Object detection is a fundamental task in computer vision, requiring large annotated datasets that are difficult to collect, as annotators need to label objects and their bounding boxes. Thus, it is a significant challenge to use cheaper forms of supervision effectively. Recent work has begun to explore image captions as a source for weak supervision, but to date, in the context of object detection, captions have only been used to infer the categories of the objects in the image. In this work, we argue that captions contain much richer information about the image, including attributes of objects and their relations. Namely, the text represents a scene of the image, as described recently in the literature. We present a method that uses the attributes in this "textual scene graph" to train object detectors. We empirically demonstrate that the resulting model achieves state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning