TL;DR
GSRFormer introduces a novel two-stage transformer-based framework for Grounded Situation Recognition that models bidirectional relations between verbs and semantic roles, improving understanding and accuracy over existing methods.
Contribution
It proposes a new framework that postpones verb detection, learns intermediate role representations, and exploits semantic relations bidirectionally, outperforming prior approaches.
Findings
Outperforms state-of-the-art methods on SWiG benchmarks
Effectively models semantic relations between verbs and roles
Utilizes support images for improved learning
Abstract
Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for "human-like" event understanding. Specifically, GSR task not only detects the salient activity verb (e.g. buying), but also predicts all corresponding semantic roles (e.g. agent and goods). Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework: 1) detect the activity verb, and then 2) predict semantic roles based on the detected verb. Obviously, this illogical framework constitutes a huge obstacle to semantic understanding. First, pre-detecting verbs solely without semantic roles inevitably fails to distinguish many similar daily activities (e.g., offering and giving, buying and selling). Second, predicting semantic roles in a closed auto-regressive manner can hardly exploit the semantic relations among the verb and roles. To this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
