Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer
Jiaming Lei, Lin Li, Chunping Wang, Jun Xiao, Long Chen

TL;DR
This paper introduces LEX, a novel zero-shot grounded situation recognition method that uses language explainers to improve verb, role, and noun understanding, outperforming existing approaches on SWiG dataset.
Contribution
The paper proposes a new zero-shot GSR approach with three explainers that enhance verb, role, and noun comprehension, addressing limitations of prior class-based prompt methods.
Findings
LEX significantly improves zero-shot GSR accuracy.
The method demonstrates strong generalization on SWiG dataset.
Explainability aids in better scene understanding.
Abstract
Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
