What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

Zhenlong Yuan; Yue Wang; Dapeng Zhang; Kejin Cui; Rui Chen; Jing Tang; Lei Sun; Hongwei Yu; Chengxuan Qian; Xiangxiang Chu; Shuo Li; Yuyin Zhou

arXiv:2602.11499·cs.CV·May 21, 2026

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

Zhenlong Yuan, Yue Wang, Dapeng Zhang, Kejin Cui, Rui Chen, Jing Tang, Lei Sun, Hongwei Yu, Chengxuan Qian, Xiangxiang Chu, Shuo Li, Yuyin Zhou

PDF

TL;DR

This paper introduces ImagineAgent, a novel framework that enhances open-vocabulary human-object interaction understanding by integrating cognitive mapping, tool-augmented reinforcement learning, and generative world modeling to reduce hallucinations and improve viewpoint robustness.

Contribution

The paper presents a new multimodal framework with a specialized dataset, dynamic tool integration, and viewpoint imagination to advance OV-HOI comprehension beyond existing methods.

Findings

01

Achieves state-of-the-art performance on SWIG-HOI and HICO-DET datasets.

02

Requires only 36.7% of the training data compared to previous methods.

03

Effectively reduces hallucinations and improves viewpoint robustness.

Abstract

Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and limited viewpoints of images. To address this, we propose ImagineAgent, an agentic framework that integrates cognitive mapping, tool-augmented reinforcement learning (RL), and generative world modeling for robust OV-HOI understanding. Specifically, we first propose an innovative CoT dataset named hicodet-6K for supervised fine-tuning (SFT), which effectively bridges the perception-to-cognition gap by structuring perceived entities into interaction pairs for comprehensive predictions. Subsequently, we develop a multimodal tool library integrating online retrieval, image cropping, and generative modeling, enabling the agent to dynamically augment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Topic Modeling