Rethinking the Two-Stage Framework for Grounded Situation Recognition

Meng Wei; Long Chen; Wei Ji; Xiaoyu Yue; Tat-Seng Chua

arXiv:2112.05375·cs.CV·December 13, 2021

Rethinking the Two-Stage Framework for Grounded Situation Recognition

Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Tat-Seng Chua

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces SituFormer, a novel approach for Grounded Situation Recognition that improves verb classification and semantic role detection by using a coarse-to-fine model and a transformer-based role detector, achieving state-of-the-art results.

Contribution

The paper proposes a new two-stage framework with a coarse-to-fine verb model and a transformer-based role detector, addressing limitations of previous methods.

Findings

01

Achieves new state-of-the-art performance on SWiG benchmark.

02

Significant improvements in various evaluation metrics.

03

Effective modeling of semantic role dependencies.

Abstract

Grounded Situation Recognition (GSR), i.e., recognizing the salient activity (or verb) category in an image (e.g., buying) and detecting all corresponding semantic roles (e.g., agent and goods), is an essential step towards "human-like" event understanding. Since each verb is associated with a specific set of semantic roles, all existing GSR methods resort to a two-stage framework: predicting the verb in the first stage and detecting the semantic roles in the second stage. However, there are obvious drawbacks in both stages: 1) The widely-used cross-entropy (XE) loss for object recognition is insufficient in verb classification due to the large intra-class variation and high inter-class similarity among daily activities. 2) All semantic roles are detected in an autoregressive manner, which fails to model the complex semantic relations between different roles. To this end, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kellyiss/situformer
pytorchOfficial

Videos

Rethinking the Two-Stage Framework for Grounded Situation Recognition· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Human Pose and Action Recognition

MethodsTriplet Loss