ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in Situation Recognition
Debaditya Roy, Dhruv Verma, Basura Fernando

TL;DR
This paper introduces ClipSitu, a model leveraging CLIP's visual-linguistic knowledge with enhanced MLP and cross-attention Transformer architectures to significantly improve situation recognition accuracy in images.
Contribution
It proposes a novel approach combining CLIP features with advanced MLP and cross-attention Transformer models for better situation recognition.
Findings
Outperforms state-of-the-art models by 14.1% in semantic role labeling accuracy.
Achieves superior situation localization performance.
Demonstrates the effectiveness of CLIP-based features in structured image understanding.
Abstract
Situation Recognition is the task of generating a structured summary of what is happening in an image using an activity verb and the semantic roles played by actors and objects. In this task, the same activity verb can describe a diverse set of situations as well as the same actor or object category can play a diverse set of semantic roles depending on the situation depicted in the image. Hence a situation recognition model needs to understand the context of the image and the visual-linguistic meaning of semantic roles. Therefore, we leverage the CLIP foundational model that has learned the context of images via language descriptions. We show that deeper-and-wider multi-layer perceptron (MLP) blocks obtain noteworthy results for the situation recognition task by using CLIP image and text embedding features and it even outperforms the state-of-the-art CoFormer, a Transformer-based model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in Situation Recognition· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Absolute Position Encodings · Adam · Byte Pair Encoding · Linear Layer · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection
