What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions
A S M Iftekhar, Hao Chen, Kaustav Kundu, Xinyu Li, Joseph Tighe,, Davide Modolo

TL;DR
This paper introduces SSRT, a Transformer-based model that improves human-object interaction detection by refining object-action pair selection and query representations using semantic and spatial features, achieving state-of-the-art results.
Contribution
The paper presents a novel one-stage Transformer model with modules for selecting relevant pairs and refining queries, advancing HOI detection methods.
Findings
Achieves state-of-the-art performance on V-COCO and HICO-DET benchmarks.
Introduces modules for pair selection and query refinement using semantic and spatial features.
Outperforms previous Transformer-based HOI approaches.
Abstract
We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions. Differently from previous Transformer-based HOI approaches, which mostly focus at improving the design of the decoder outputs for the final detection, SSRT introduces two new modules to help select the most relevant object-action pairs within an image and refine the queries' representation using rich semantic and spatial features. These enhancements lead to state-of-the-art results on the two most popular HOI benchmarks: V-COCO and HICO-DET.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Context-Aware Activity Recognition Systems
