Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection
Yupeng Hu, Changxing Ding, Chang Sun, Shaoli Huang, Xiangmin Xu

TL;DR
This paper introduces BC-HOI, a novel framework that enhances open vocabulary human-object interaction detection by enabling fine-grained feature extraction through bilateral collaboration between vision and language models.
Contribution
The paper proposes a Bilateral Collaboration framework with Attention Bias Guidance and LLM-based Supervision Guidance to improve fine-grained interaction detection in open vocabulary settings.
Findings
Achieves superior performance on HICO-DET and V-COCO benchmarks.
Effectively generates fine-grained interaction features.
Outperforms existing methods in open vocabulary detection.
Abstract
Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all <human, verb, object> triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
