TL;DR
LIGHTEN introduces a hierarchical, graph-based neural network that captures multi-granularity spatio-temporal cues in videos for human-object interaction detection without relying on depth or 3D pose data, achieving state-of-the-art results.
Contribution
The paper proposes LIGHTEN, a novel hierarchical approach that learns visual features for HOI detection using only RGB data, avoiding reliance on ground truth depth or pose information.
Findings
Achieves 88.9% and 92.6% accuracy on HOI detection tasks
Outperforms existing methods on CAD-120 and V-COCO datasets
Sets a new benchmark for visual feature-based HOI detection
Abstract
Analyzing the interactions between humans and objects from a video includes identification of the relationships between humans and the objects present in the video. It can be thought of as a specialized version of Visual Relationship Detection, wherein one of the objects must be a human. While traditional methods formulate the problem as inference on a sequence of video segments, we present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture spatio-temporal cues at multiple granularities in a video. Unlike current approaches, LIGHTEN avoids using ground truth data like depth maps or 3D human pose, thus increasing generalization across non-RGBD datasets as well. Furthermore, we achieve the same using only the visual features, instead of the commonly used hand-crafted spatial features. We achieve state-of-the-art results in human-object interaction detection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
