TL;DR
This paper explores using Multimodal Large Language Models for unconstrained human-object interaction detection, removing the need for predefined interaction vocabularies and enabling in-the-wild analysis.
Contribution
It introduces the U-HOI task, a new paradigm for HOI detection that leverages MLLMs and a pipeline for structured interaction extraction from free-form text.
Findings
MLLMs show promise for flexible HOI detection in unconstrained environments.
Current HOI detectors have significant limitations in open-vocabulary settings.
The proposed pipeline effectively extracts structured interactions from free-form language.
Abstract
Human-Object Interaction (HOI) detection is a longstanding computer vision problem concerned with predicting the interaction between humans and objects. Current HOI models rely on a vocabulary of interactions at training and inference time, limiting their applicability to static environments. With the advent of Multimodal Large Language Models (MLLMs), it has become feasible to explore more flexible paradigms for interaction recognition. In this work, we revisit HOI detection through the lens of MLLMs and apply them to in-the-wild HOI detection. We define the Unconstrained HOI (U-HOI) task, a novel HOI domain that removes the requirement for a predefined list of interactions at both training and inference. We evaluate a range of MLLMs on this setting and introduce a pipeline that includes test-time inference and language-to-graph conversion to extract structured interactions from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
