CREPE: Learnable Prompting With CLIP Improves Visual Relationship Prediction
Rakshith Subramanyam, T. S. Jayram, Rushil Anirudh, Jayaraman J., Thiagarajan

TL;DR
This paper introduces CREPE, a novel approach leveraging CLIP's language priors and contrastive training within the UVTransE framework to significantly improve visual relationship prediction accuracy on the Visual Genome benchmark.
Contribution
CREPE is the first method to systematically incorporate CLIP representations into relation prediction, achieving state-of-the-art results with a simpler, contrastive training strategy.
Findings
Achieves mR@5 of 27.79 and mR@20 of 31.95 on Visual Genome.
Outperforms recent state-of-the-art by 15.3% at mR@20.
Demonstrates CLIP's effectiveness in object relation prediction.
Abstract
In this paper, we explore the potential of Vision-Language Models (VLMs), specifically CLIP, in predicting visual object relationships, which involves interpreting visual features from images into language-based relations. Current state-of-the-art methods use complex graphical models that utilize language cues and visual features to address this challenge. We hypothesize that the strong language priors in CLIP embeddings can simplify these graphical models paving for a simpler approach. We adopt the UVTransE relation prediction framework, which learns the relation as a translational embedding with subject, object, and union box embeddings from a scene. We systematically explore the design of CLIP-based subject, object, and union-box representations within the UVTransE framework and propose CREPE (CLIP Representation Enhanced Predicate Estimation). CREPE utilizes text-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsContrastive Language-Image Pre-training
