What to look at and where: Semantic and Spatial Refined Transformer for   detecting human-object interactions

A S M Iftekhar; Hao Chen; Kaustav Kundu; Xinyu Li; Joseph Tighe,; Davide Modolo

arXiv:2204.00746·cs.CV·May 27, 2022·1 cites

What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions

A S M Iftekhar, Hao Chen, Kaustav Kundu, Xinyu Li, Joseph Tighe,, Davide Modolo

PDF

Open Access

TL;DR

This paper introduces SSRT, a Transformer-based model that improves human-object interaction detection by refining object-action pair selection and query representations using semantic and spatial features, achieving state-of-the-art results.

Contribution

The paper presents a novel one-stage Transformer model with modules for selecting relevant pairs and refining queries, advancing HOI detection methods.

Findings

01

Achieves state-of-the-art performance on V-COCO and HICO-DET benchmarks.

02

Introduces modules for pair selection and query refinement using semantic and spatial features.

03

Outperforms previous Transformer-based HOI approaches.

Abstract

We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions. Differently from previous Transformer-based HOI approaches, which mostly focus at improving the design of the decoder outputs for the final detection, SSRT introduces two new modules to help select the most relevant object-action pairs within an image and refine the queries' representation using rich semantic and spatial features. These enhancements lead to state-of-the-art results on the two most popular HOI benchmarks: V-COCO and HICO-DET.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Context-Aware Activity Recognition Systems