What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation
Zhenlong Yuan, Yue Wang, Dapeng Zhang, Kejin Cui, Rui Chen, Jing Tang, Lei Sun, Hongwei Yu, Chengxuan Qian, Xiangxiang Chu, Shuo Li, Yuyin Zhou

TL;DR
This paper introduces ImagineAgent, a novel framework that enhances open-vocabulary human-object interaction understanding by integrating cognitive mapping, tool-augmented reinforcement learning, and generative world modeling to reduce hallucinations and improve viewpoint robustness.
Contribution
The paper presents a new multimodal framework with a specialized dataset, dynamic tool integration, and viewpoint imagination to advance OV-HOI comprehension beyond existing methods.
Findings
Achieves state-of-the-art performance on SWIG-HOI and HICO-DET datasets.
Requires only 36.7% of the training data compared to previous methods.
Effectively reduces hallucinations and improves viewpoint robustness.
Abstract
Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and limited viewpoints of images. To address this, we propose ImagineAgent, an agentic framework that integrates cognitive mapping, tool-augmented reinforcement learning (RL), and generative world modeling for robust OV-HOI understanding. Specifically, we first propose an innovative CoT dataset named hicodet-6K for supervised fine-tuning (SFT), which effectively bridges the perception-to-cognition gap by structuring perceived entities into interaction pairs for comprehensive predictions. Subsequently, we develop a multimodal tool library integrating online retrieval, image cropping, and generative modeling, enabling the agent to dynamically augment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Topic Modeling
