Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation
Yuanhao Luo, Di Wen, Kunyu Peng, Ruiping Liu, Junwei Zheng, Yufan Chen, Jiale Wei, Rainer Stiefelhage

TL;DR
This paper introduces a new benchmark and a joint framework for detecting and anticipating human-object interactions in videos, emphasizing the importance of simultaneous learning for improved long-term prediction.
Contribution
It proposes DETAnt-HOI, a corrected benchmark for better evaluation, and HOI-DA, a pair-centric model that jointly localizes, detects, and anticipates interactions in videos.
Findings
Joint detection and anticipation improve performance, especially at longer horizons.
The proposed methods outperform existing approaches in both detection and anticipation tasks.
Anticipation benefits significantly from being learned as a structural constraint alongside detection.
Abstract
Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
