Towards Unconstrained Human-Object Interaction

Francesco Tonini; Alessandro Conti; Lorenzo Vaquero; Cigdem Beyan; Elisa Ricci

arXiv:2604.14069·cs.CV·April 16, 2026

Towards Unconstrained Human-Object Interaction

Francesco Tonini, Alessandro Conti, Lorenzo Vaquero, Cigdem Beyan, Elisa Ricci

PDF

1 Repo

TL;DR

This paper explores using Multimodal Large Language Models for unconstrained human-object interaction detection, removing the need for predefined interaction vocabularies and enabling in-the-wild analysis.

Contribution

It introduces the U-HOI task, a new paradigm for HOI detection that leverages MLLMs and a pipeline for structured interaction extraction from free-form text.

Findings

01

MLLMs show promise for flexible HOI detection in unconstrained environments.

02

Current HOI detectors have significant limitations in open-vocabulary settings.

03

The proposed pipeline effectively extracts structured interactions from free-form language.

Abstract

Human-Object Interaction (HOI) detection is a longstanding computer vision problem concerned with predicting the interaction between humans and objects. Current HOI models rely on a vocabulary of interactions at training and inference time, limiting their applicability to static environments. With the advent of Multimodal Large Language Models (MLLMs), it has become feasible to explore more flexible paradigms for interaction recognition. In this work, we revisit HOI detection through the lens of MLLMs and apply them to in-the-wild HOI detection. We define the Unconstrained HOI (U-HOI) task, a novel HOI domain that removes the requirement for a predefined list of interactions at both training and inference. We evaluate a range of MLLMs on this setting and introduce a pipeline that includes test-time inference and language-to-graph conversion to extract structured interactions from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

francescotonini/anyhoi
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.