Detecting Human-Object Interactions with Object-Guided Cross-Modal   Calibrated Semantics

Hangjie Yuan; Mang Wang; Dong Ni; Liangpeng Xu

arXiv:2202.00259·cs.CV·February 2, 2022

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

Hangjie Yuan, Mang Wang, Dong Ni, Liangpeng Xu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces an object-guided cross-modal calibration network that leverages semantic priors and cross-modal features to improve human-object interaction detection, achieving state-of-the-art results.

Contribution

It proposes a novel object-guided hierarchical approach with semantic aggregation, SKL loss, and cross-modal calibration to enhance end-to-end HOI detection models.

Findings

01

Achieves state-of-the-art performance on HOI benchmarks.

02

Semantic priors significantly improve verb prediction accuracy.

03

Cross-modal features enhance the robustness of HOI detection.

Abstract

Human-Object Interaction (HOI) detection is an essential task to understand human-centric images from a fine-grained perspective. Although end-to-end HOI detection models thrive, their paradigm of parallel human/object detection and verb class prediction loses two-stage methods' merit: object-guided hierarchy. The object in one HOI triplet gives direct clues to the verb to be predicted. In this paper, we aim to boost end-to-end models with object-guided statistical priors. Specifically, We propose to utilize a Verb Semantic Model (VSM) and use semantic aggregation to profit from this object-guided hierarchy. Similarity KL (SKL) loss is proposed to optimize VSM to align with the HOI dataset's priors. To overcome the static semantic embedding problem, we propose to generate cross-modality-aware visual and semantic features by Cross-Modal Calibration (CMC). The above modules combined…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jacobyuan7/ocn-hoi-benchmark
pytorchOfficial

Videos

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection