Human-Object Interaction Detection via Weak Supervision

Mert Kilickaya; Arnold Smeulders

arXiv:2112.00492·cs.CV·December 2, 2021

Human-Object Interaction Detection via Weak Supervision

Mert Kilickaya, Arnold Smeulders

PDF

Open Access

TL;DR

This paper introduces Align-Former, a transformer-based model that detects human-object interactions using only image-level supervision, significantly outperforming previous methods relying on more detailed annotations.

Contribution

The paper presents a novel weakly supervised approach for HO-I detection using image-level labels, with a new transformer-based model and an HO-I align layer for target selection.

Findings

01

Align-Former outperforms existing image-level supervised HO-I detectors by 4.71% mAP on HICO-DET.

02

The method achieves 20.85% mAP, a substantial improvement over previous 16.14%.

03

The approach reduces reliance on expensive alignment supervision annotations.

Abstract

The goal of this paper is Human-object Interaction (HO-I) detection. HO-I detection aims to find interacting human-objects regions and classify their interaction from an image. Researchers obtain significant improvement in recent years by relying on strong HO-I alignment supervision from [5]. HO-I alignment supervision pairs humans with their interacted objects, and then aligns human-object pair(s) with their interaction categories. Since collecting such annotation is expensive, in this paper, we propose to detect HO-I without alignment supervision. We instead rely on image-level supervision that only enumerates existing interactions within the image without pointing where they happen. Our paper makes three contributions: i) We propose Align-Former, a visual-transformer based CNN that can detect HO-I with only image-level supervision. ii) Align-Former is equipped with HO-I align layer,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Human Pose and Action Recognition