Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation
Ruichi Yu, Ang Li, Vlad I. Morariu, Larry S. Davis

TL;DR
This paper introduces a method that leverages internal and external linguistic knowledge distillation to improve visual relationship detection, especially for unseen relationships, by regularizing the learning process with linguistic statistics.
Contribution
It proposes a novel approach that distills linguistic knowledge from annotations and external text sources into a visual model to enhance generalization and zero-shot prediction capabilities.
Findings
Significant improvement in zero-shot recall on VRD dataset.
Outperforms state-of-the-art methods in visual relationship detection.
Effective use of linguistic knowledge from Wikipedia and annotations.
Abstract
Understanding visual relationships involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the (subj,obj) pair (both semantically and spatially) to predict the predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships, but complicates learning since the semantic space of visual relationships is huge and the training data is limited, especially for the long-tail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
