TL;DR
This paper introduces a novel model for compositional action recognition that explicitly reasons about object-agent spatial-temporal interactions, enabling better generalization to unseen object-action combinations.
Contribution
The paper proposes a new model that explicitly captures geometric relations in object-agent interactions and introduces a compositional recognition task with non-overlapping training and test verb-noun pairs.
Findings
Effective on compositional action recognition task
Improves generalization in few-shot settings
Utilizes dense object annotations for training
Abstract
Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations. In this paper, we study the compositionality of action by looking into the dynamics of subject-object interactions. We propose a novel model which can explicitly reason about the geometric relations between constituent objects and an agent performing an action. To train our model, we collect dense object box annotations on the Something-Something dataset. We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set. The novel aspects of our model are applicable to activities with prominent object interaction dynamics and to objects which can be tracked using state-of-the-art approaches; for activities without clearly defined spatial object-agent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks· youtube
Taxonomy
MethodsTest
