ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in   Situation Recognition

Debaditya Roy; Dhruv Verma; Basura Fernando

arXiv:2307.00586·cs.CV·September 12, 2023·1 cites

ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in Situation Recognition

Debaditya Roy, Dhruv Verma, Basura Fernando

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces ClipSitu, a model leveraging CLIP's visual-linguistic knowledge with enhanced MLP and cross-attention Transformer architectures to significantly improve situation recognition accuracy in images.

Contribution

It proposes a novel approach combining CLIP features with advanced MLP and cross-attention Transformer models for better situation recognition.

Findings

01

Outperforms state-of-the-art models by 14.1% in semantic role labeling accuracy.

02

Achieves superior situation localization performance.

03

Demonstrates the effectiveness of CLIP-based features in structured image understanding.

Abstract

Situation Recognition is the task of generating a structured summary of what is happening in an image using an activity verb and the semantic roles played by actors and objects. In this task, the same activity verb can describe a diverse set of situations as well as the same actor or object category can play a diverse set of semantic roles depending on the situation depicted in the image. Hence a situation recognition model needs to understand the context of the image and the visual-linguistic meaning of semantic roles. Therefore, we leverage the CLIP foundational model that has learned the context of images via language descriptions. We show that deeper-and-wider multi-layer perceptron (MLP) blocks obtain noteworthy results for the situation recognition task by using CLIP image and text embedding features and it even outperforms the state-of-the-art CoFormer, a Transformer-based model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LUNAProject22/CLIPSitu
pytorchOfficial

Videos

ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in Situation Recognition· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Absolute Position Encodings · Adam · Byte Pair Encoding · Linear Layer · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection