Rethinking the Two-Stage Framework for Grounded Situation Recognition
Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Tat-Seng Chua

TL;DR
This paper introduces SituFormer, a novel approach for Grounded Situation Recognition that improves verb classification and semantic role detection by using a coarse-to-fine model and a transformer-based role detector, achieving state-of-the-art results.
Contribution
The paper proposes a new two-stage framework with a coarse-to-fine verb model and a transformer-based role detector, addressing limitations of previous methods.
Findings
Achieves new state-of-the-art performance on SWiG benchmark.
Significant improvements in various evaluation metrics.
Effective modeling of semantic role dependencies.
Abstract
Grounded Situation Recognition (GSR), i.e., recognizing the salient activity (or verb) category in an image (e.g., buying) and detecting all corresponding semantic roles (e.g., agent and goods), is an essential step towards "human-like" event understanding. Since each verb is associated with a specific set of semantic roles, all existing GSR methods resort to a two-stage framework: predicting the verb in the first stage and detecting the semantic roles in the second stage. However, there are obvious drawbacks in both stages: 1) The widely-used cross-entropy (XE) loss for object recognition is insufficient in verb classification due to the large intra-class variation and high inter-class similarity among daily activities. 2) All semantic roles are detected in an autoregressive manner, which fails to model the complex semantic relations between different roles. To this end, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Human Pose and Action Recognition
MethodsTriplet Loss
