Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection
Xubin Zhong, Changxing Ding, Zijian Li, and Shaoli Huang

TL;DR
This paper introduces a novel method for enhancing DETR-based human-object interaction detection by mining hard-positive queries, which improves robustness and achieves state-of-the-art results on multiple benchmarks.
Contribution
It proposes explicit and implicit hard-positive query mining techniques to improve DETR's robustness in HOI detection, a novel approach not previously explored.
Findings
Achieves state-of-the-art performance on HICO-DET, V-COCO, and HOI-A benchmarks.
Enhances DETR's robustness to object location changes.
Widely applicable to existing DETR-based HOI detectors.
Abstract
Human-Object Interaction (HOI) detection is a core task for high-level image understanding. Recently, Detection Transformer (DETR)-based HOI detectors have become popular due to their superior performance and efficient structure. However, these approaches typically adopt fixed HOI queries for all testing images, which is vulnerable to the location change of objects in one specific image. Accordingly, in this paper, we propose to enhance DETR's robustness by mining hard-positive queries, which are forced to make correct predictions using partial visual cues. First, we explicitly compose hard-positive queries according to the ground-truth (GT) position of labeled human-object pairs for each training image. Specifically, we shift the GT bounding boxes of each labeled human-object pair so that the shifted boxes cover only a certain portion of the GT ones. We encode the coordinates of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Adam
