CREPE: Learnable Prompting With CLIP Improves Visual Relationship   Prediction

Rakshith Subramanyam; T. S. Jayram; Rushil Anirudh; Jayaraman J.; Thiagarajan

arXiv:2307.04838·cs.CV·July 20, 2023

CREPE: Learnable Prompting With CLIP Improves Visual Relationship Prediction

Rakshith Subramanyam, T. S. Jayram, Rushil Anirudh, Jayaraman J., Thiagarajan

PDF

Open Access 1 Repo

TL;DR

This paper introduces CREPE, a novel approach leveraging CLIP's language priors and contrastive training within the UVTransE framework to significantly improve visual relationship prediction accuracy on the Visual Genome benchmark.

Contribution

CREPE is the first method to systematically incorporate CLIP representations into relation prediction, achieving state-of-the-art results with a simpler, contrastive training strategy.

Findings

01

Achieves mR@5 of 27.79 and mR@20 of 31.95 on Visual Genome.

02

Outperforms recent state-of-the-art by 15.3% at mR@20.

03

Demonstrates CLIP's effectiveness in object relation prediction.

Abstract

In this paper, we explore the potential of Vision-Language Models (VLMs), specifically CLIP, in predicting visual object relationships, which involves interpreting visual features from images into language-based relations. Current state-of-the-art methods use complex graphical models that utilize language cues and visual features to address this challenge. We hypothesize that the strong language priors in CLIP embeddings can simplify these graphical models paving for a simpler approach. We adopt the UVTransE relation prediction framework, which learns the relation as a translational embedding with subject, object, and union box embeddings from a scene. We systematically explore the design of CLIP-based subject, object, and union-box representations within the UVTransE framework and propose CREPE (CLIP Representation Enhanced Predicate Estimation). CREPE utilizes text-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llnl/crepe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

MethodsContrastive Language-Image Pre-training