TL;DR
ConsNet introduces a graph-based approach that leverages multi-level consistencies among objects, actions, and interactions to improve zero-shot and supervised human-object interaction detection.
Contribution
The paper proposes ConsNet, a novel framework that encodes relations among HOI components into a graph and uses GATs to enhance detection, especially for unseen categories.
Findings
Outperforms state-of-the-art on V-COCO and HICO-DET datasets.
Effective in zero-shot HOI detection scenarios.
Utilizes visual and semantic features for improved recognition.
Abstract
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of <human, action, object> in images. Most existing works treat HOIs as individual interaction categories, thus can not handle the problem of long-tail distribution and polysemy of action labels. We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs. Leveraging the compositional and relational peculiarities of HOI labels, we propose ConsNet, a knowledge-aware framework that explicitly encodes the relations among objects, actions and interactions into an undirected graph called consistency graph, and exploits Graph Attention Networks (GATs) to propagate knowledge among HOI categories as well as their constituents. Our model takes visual features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
