Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via   Language Explainer

Jiaming Lei; Lin Li; Chunping Wang; Jun Xiao; Long Chen

arXiv:2404.15785·cs.CV·April 25, 2024

Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

Jiaming Lei, Lin Li, Chunping Wang, Jun Xiao, Long Chen

PDF

Open Access

TL;DR

This paper introduces LEX, a novel zero-shot grounded situation recognition method that uses language explainers to improve verb, role, and noun understanding, outperforming existing approaches on SWiG dataset.

Contribution

The paper proposes a new zero-shot GSR approach with three explainers that enhance verb, role, and noun comprehension, addressing limitations of prior class-based prompt methods.

Findings

01

LEX significantly improves zero-shot GSR accuracy.

02

The method demonstrates strong generalization on SWiG dataset.

03

Explainability aids in better scene understanding.

Abstract

Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsContrastive Language-Image Pre-training