Funnel-HOI: Top-Down Perception for Zero-Shot HOI Detection
Sandipan Sarma, Agney Talwarr, Arijit Sur

TL;DR
Funnel-HOI introduces a top-down encoder-focused approach with a novel co-attention mechanism for improved zero-shot human-object interaction detection, achieving state-of-the-art results on benchmark datasets.
Contribution
The paper proposes a new top-down framework with an asymmetric co-attention mechanism and a novel loss for better scene understanding in HOID, especially in zero-shot scenarios.
Findings
Achieves up to 12.4% and 8.4% improvements on unseen and rare HOI categories.
Outperforms existing methods on HICO-DET and V-COCO datasets.
Effective in both fully-supervised and zero-shot settings.
Abstract
Human-object interaction detection (HOID) refers to localizing interactive human-object pairs in images and identifying the interactions. Since there could be an exponential number of object-action combinations, labeled data is limited - leading to a long-tail distribution problem. Recently, zero-shot learning emerged as a solution, with end-to-end transformer-based object detectors adapted for HOID becoming successful frameworks. However, their primary focus is designing improved decoders for learning entangled or disentangled interpretations of interactions. We advocate that HOI-specific cues must be anticipated at the encoder stage itself to obtain a stronger scene interpretation. Consequently, we build a top-down framework named Funnel-HOI inspired by the human tendency to grasp well-defined concepts first and then associate them with abstract concepts during scene understanding. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Detection and Scintillator Technologies · Infrared Target Detection Methodologies
