Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

Yuanhao Luo; Di Wen; Kunyu Peng; Ruiping Liu; Junwei Zheng; Yufan Chen; Jiale Wei; Rainer Stiefelhage

arXiv:2604.10397·cs.CV·April 14, 2026

Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

Yuanhao Luo, Di Wen, Kunyu Peng, Ruiping Liu, Junwei Zheng, Yufan Chen, Jiale Wei, Rainer Stiefelhage

PDF

TL;DR

This paper introduces a new benchmark and a joint framework for detecting and anticipating human-object interactions in videos, emphasizing the importance of simultaneous learning for improved long-term prediction.

Contribution

It proposes DETAnt-HOI, a corrected benchmark for better evaluation, and HOI-DA, a pair-centric model that jointly localizes, detects, and anticipates interactions in videos.

Findings

01

Joint detection and anticipation improve performance, especially at longer horizons.

02

The proposed methods outperform existing approaches in both detection and anticipation tasks.

03

Anticipation benefits significantly from being learned as a structural constraint alongside detection.

Abstract

Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.