Improving Visual Relationship Detection using Semantic Modeling of Scene Descriptions
Stephan Baier, Yunpu Ma, Volker Tresp

TL;DR
This paper enhances visual relationship detection by combining semantic link prediction models with CNN-based object detection, significantly improving accuracy and generalization on complex scene datasets.
Contribution
It introduces a novel integration of semantic link prediction with visual models, enabling better detection and generalization of unseen scene triples.
Findings
Semantic modeling improves detection accuracy.
Link prediction generalizes to unseen triples.
Outperforms previous state-of-the-art methods.
Abstract
Structured scene descriptions of images are useful for the automatic processing and querying of large image databases. We show how the combination of a semantic and a visual statistical model can improve on the task of mapping images to their associated scene description. In this paper we consider scene descriptions which are represented as a set of triples (subject, predicate, object), where each triple consists of a pair of visual objects, which appear in the image, and the relationship between them (e.g. man-riding-elephant, man-wearing-hat). We combine a standard visual model for object detection, based on convolutional neural networks, with a latent variable model for link prediction. We apply multiple state-of-the-art link prediction methods and compare their capability for visual relationship detection. One of the main advantages of link prediction methods is that they can also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Graph Neural Networks
