Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection

Yupeng Hu; Changxing Ding; Chang Sun; Shaoli Huang; Xiangmin Xu

arXiv:2507.06510·cs.CV·July 10, 2025

Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection

Yupeng Hu, Changxing Ding, Chang Sun, Shaoli Huang, Xiangmin Xu

PDF

Open Access 1 Repo

TL;DR

This paper introduces BC-HOI, a novel framework that enhances open vocabulary human-object interaction detection by enabling fine-grained feature extraction through bilateral collaboration between vision and language models.

Contribution

The paper proposes a Bilateral Collaboration framework with Attention Bias Guidance and LLM-based Supervision Guidance to improve fine-grained interaction detection in open vocabulary settings.

Findings

01

Achieves superior performance on HICO-DET and V-COCO benchmarks.

02

Effectively generates fine-grained interaction features.

03

Outperforms existing methods in open vocabulary detection.

Abstract

Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all <human, verb, object> triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mpi-lab/bc-hoi
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques