ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection
Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le, SuiYang Guang, Tuan Kiet Pham, Linh Chi Vo

TL;DR
ScriptHOI introduces a structured approach to open-vocabulary human-object interaction detection by decomposing interactions into script-based state transitions, improving recognition of rare and unseen interactions.
Contribution
It proposes a novel structured framework that models interactions as scripted state transitions with visual state parsing and partial-label learning, enhancing open-vocabulary HOI detection.
Findings
Improves recognition of rare and unseen interactions.
Reduces false positives caused by affordance conflicts.
Enhances detection performance on multiple benchmarks.
Abstract
Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
