QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection

Yuxiao Wang; Wolin Liang; Yu Lei; Weiying Xue; Nan Zhuang; Qi Liu

arXiv:2508.08590·cs.CV·August 13, 2025

QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection

Yuxiao Wang, Wolin Liang, Yu Lei, Weiying Xue, Nan Zhuang, Qi Liu

PDF

Open Access 1 Video

TL;DR

QueryCraft introduces a transformer-guided query initialization framework for HOI detection, leveraging semantic priors and cross-modal attention to improve detection accuracy and interpretability.

Contribution

It proposes a novel transformer-based query initialization method incorporating semantic priors and language-guided attention for enhanced HOI detection.

Findings

01

Achieves state-of-the-art results on HICO-Det and V-COCO benchmarks.

02

Demonstrates improved detection accuracy and generalization.

03

Provides more interpretable queries for human-object interaction detection.

Abstract

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions in images. Although DETR-based methods have recently emerged as the mainstream framework for HOI detection, they still suffer from a key limitation: Randomly initialized queries lack explicit semantics, leading to suboptimal detection performance. To address this challenge, we propose QueryCraft, a novel plug-and-play HOI detection framework that incorporates semantic priors and guided feature learning through transformer-based query initialization. Central to our approach is \textbf{ACTOR} (\textbf{A}ction-aware \textbf{C}ross-modal \textbf{T}ransf\textbf{OR}mer), a cross-modal Transformer encoder that jointly attends to visual regions and textual prompts to extract action-relevant features. Rather than merely aligning modalities, ACTOR leverages language-guided attention to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection