Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition
Shiyu Xuan, Dongkai Wang, Zechao Li, Jinhui Tang

TL;DR
This paper introduces a decoupled, zero-shot HOI detection framework using MLLMs that formulates interaction recognition as a deterministic visual question answering task, enabling flexible, training-free generalization to unseen interactions.
Contribution
It proposes a detector-agnostic, zero-shot HOI detection method leveraging MLLMs, with a deterministic IR formulation and a spatial-aware pooling module for improved performance.
Findings
Achieves superior zero-shot performance on HICO-DET and V-COCO datasets.
Demonstrates strong cross-dataset generalization.
Allows integration with any object detector without retraining.
Abstract
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions. While advances in open-vocabulary object detection provide promising solutions for object localization, interaction recognition (IR) remains challenging due to the combinatorial diversity of interactions. Existing methods, including two-stage methods, tightly couple IR with a specific detector and rely on coarse-grained vision-language model (VLM) features, which limit generalization to unseen interactions. In this work, we propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR. We introduce a deterministic generation method that formulates IR as a visual question answering task and enforces deterministic outputs, enabling training-free zero-shot IR. To further enhance…
Peer Reviews
Decision·ICLR 2026 Poster
1. Clear Motivation. The paper identifies the key limitations of existing HOI approaches, including coupling with a specific detector, poor generalization, and coarse-grained VLM features. In response, This paper propose a method that explicitly decouples object detection from interaction recognition, supports flexible plug-and-play integration with any detector, and demonstrates strong generalization capability. 2. In contrast to prior work that relies on CLIP embeddings, this paper introduces
1. Limited Novelty. The proposed method appears relatively simple and heavily relys on the capabilities of open-vocabulary detectors and MLLMs, which also means it inherits their inherent weaknesses, such as missed detections, redundant detections, incorrect category predictions in detectors, and hallucination issues in MLLMs. 2. The method first defines the candidate interaction list based on the categories of detected objects, thereby reformulating the HOI task into a multi-label classificati
- This paper investigates an interesting question of how to utilize MLLM to perform the HOI detection task. - The presentation is clear and easy to follow. - The proposed One-Pass Deterministic Matching is effective and interesting.
- Lack of Justification for the One-Pass Prompt Design. The proposed one-pass method, which appends the same special token <|hoi|> after each candidate interaction in the prompt (Line 286), requires further theoretical and empirical justification. This design raises a fundamental concern regarding the contextual representation of the <|hoi|> tokens within the Transformer architecture. Specifically, the representation of a particular <|hoi|> token (e.g., the one following candidate T_n) is compu
- The approach yields strong zero‑shot results on HICO‑DET, including a training‑free variant (31.5 mAP), and improves further with LoRA fine‑tuning and the proposed modules; it also transfers well to V‑COCO (59.91 mAP). - The paper demonstrates plug‑and‑play use with multiple detectors (ResNet‑50 DETR, Grounding‑DINO, YOLO‑World) without retraining the IR head. This supports the “detector‑agnostic” design goal.
- From both architectural and modular perspectives, the proposed method exhibits limited novelty. Architecturally, it essentially follows the conventional two-stage HOI detection paradigm, where the detection and interaction recognition stages are sequentially processed. The only modification lies in substituting these two components with more powerful counterparts, an open-vocabulary detector (e.g., Grounding-DINO) and an MLLM-based interaction recognizer, without introducing new mechanisms or
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
