TL;DR
RegFormer introduces a spatially grounded transformer module that enhances weakly-supervised human-object interaction detection by enabling efficient, accurate, and transferable instance-level reasoning from image-level annotations.
Contribution
It proposes a novel relational grounding transformer that learns localized interaction cues, improving efficiency and accuracy in weakly-supervised HOI detection without extra training.
Findings
Achieves performance comparable to fully supervised models.
Operates with high efficiency due to localized reasoning.
Effectively transfers from image-level to instance-level reasoning.
Abstract
Weakly-supervised Human-Object Interaction (HOI) detection is essential for scalable scene understanding, as it learns interactions from only image-level annotations. Due to the lack of localization signals, prior works typically rely on an external object detector to generate candidate pairs and then infer their interactions through pairwise reasoning. However, this framework often struggles to scale due to the substantial computational cost incurred by enumerating numerous instance pairs. In addition, it suffers from false positives arising from non-interactive combinations, which hinder accurate instance-level HOI reasoning. To address these issues, we introduce Relational Grounding Transformer (RegFormer), a versatile interaction recognition module for efficient and accurate HOI reasoning. Under image-level supervision, RegFormer leverages spatially grounded signals as guidance for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
