ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

Minh Anh Nguyen; Quang Huy Tran; Bao Ngoc Le; SuiYang Guang; Tuan Kiet Pham; Linh Chi Vo

arXiv:2605.05057·cs.CV·May 13, 2026

ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le, SuiYang Guang, Tuan Kiet Pham, Linh Chi Vo

PDF

TL;DR

ScriptHOI introduces a structured approach to open-vocabulary human-object interaction detection by decomposing interactions into script-based state transitions, improving recognition of rare and unseen interactions.

Contribution

It proposes a novel structured framework that models interactions as scripted state transitions with visual state parsing and partial-label learning, enhancing open-vocabulary HOI detection.

Findings

01

Improves recognition of rare and unseen interactions.

02

Reduces false positives caused by affordance conflicts.

03

Enhances detection performance on multiple benchmarks.

Abstract

Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.