Spatial-Temporal Human-Object Interaction Detection

Xu Sun; Yunqing He; Tongwei Ren; Gangshan Wu

arXiv:2508.17270·cs.CV·August 26, 2025

Spatial-Temporal Human-Object Interaction Detection

Xu Sun, Yunqing He, Tongwei Ren, Gangshan Wu

PDF

TL;DR

This paper introduces a new task called ST-HOID for detecting fine-grained human-object interactions in videos, along with a novel method and a dataset for evaluation, advancing human-centric video understanding.

Contribution

It proposes the first dataset VidOR-HOID and a novel method combining object trajectory detection and interaction reasoning for ST-HOID.

Findings

01

Our method outperforms existing baselines.

02

The VidOR-HOID dataset enables comprehensive evaluation.

03

Experimental results show significant improvement over state-of-the-art methods.

Abstract

In this paper, we propose a new instance-level human-object interaction detection task on videos called ST-HOID, which aims to distinguish fine-grained human-object interactions (HOIs) and the trajectories of subjects and objects. It is motivated by the fact that HOI is crucial for human-centric video content understanding. To solve ST-HOID, we propose a novel method consisting of an object trajectory detection module and an interaction reasoning module. Furthermore, we construct the first dataset named VidOR-HOID for ST-HOID evaluation, which contains 10,831 spatial-temporal HOI instances. We conduct extensive experiments to evaluate the effectiveness of our method. The experimental results demonstrate that our method outperforms the baselines generated by the state-of-the-art methods of image human-object interaction detection, video visual relation detection and video human-object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.