Visual Relationship Detection with Language Priors
Cewu Lu, Ranjay Krishna, Michael Bernstein, Li Fei-Fei

TL;DR
This paper introduces a scalable visual relationship detection model that leverages language priors from word embeddings to predict numerous relationships in images, improving object localization and content-based image retrieval.
Contribution
The authors propose a novel approach combining object and predicate models with language priors, enabling scalable prediction of thousands of relationships with limited training data.
Findings
Outperforms previous models in relationship prediction accuracy
Can predict thousands of relationships using few examples
Enhances image retrieval through relationship understanding
Abstract
Visual relationships capture a wide variety of interactions between pairs of objects in images (e.g. "man riding bicycle" and "man pushing bicycle"). Consequently, the set of possible relationships is extremely large and it is difficult to obtain sufficient training examples for all possible relationships. Because of this limitation, previous work on visual relationship detection has concentrated on predicting only a handful of relationships. Though most relationships are infrequent, their objects (e.g. "man" and "bicycle") and predicates (e.g. "riding" and "pushing") independently occur more frequently. We propose a model that uses this insight to train visual models for objects and predicates individually and later combines them together to predict multiple relationships per image. We improve on prior work by leveraging language priors from semantic word embeddings to finetune the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
