Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

Youliang Zhang; Zhengguang Zhou; Zhentao Yu; Ziyao Huang; Teng Hu; Sen Liang; Guozhen Zhang; Ziqiao Peng; Shunkai Li; Yi Chen; Zixiang Zhou; Yuan Zhou; Qinglin Lu; Xiu Li

arXiv:2602.01538·cs.CV·February 4, 2026

Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

Youliang Zhang, Zhengguang Zhou, Zhentao Yu, Ziyao Huang, Teng Hu, Sen Liang, Guozhen Zhang, Ziqiao Peng, Shunkai Li, Yi Chen, Zixiang Zhou, Yuan Zhou, Qinglin Lu, Xiu Li

PDF

Open Access

TL;DR

This paper introduces InteractAvatar, a dual-stream framework that enables controllable, text-driven human-object interactions in talking avatars by decoupling perception, planning, and video synthesis, and establishes a new benchmark for evaluation.

Contribution

The paper presents a novel dual-stream framework with perception and interaction modules for grounded human-object interaction in talking avatars, addressing control-quality challenges and establishing a new evaluation benchmark.

Findings

01

Effective generation of grounded human-object interactions in talking avatars.

02

The proposed method outperforms existing approaches in realism and interaction accuracy.

03

The benchmark GroundedInter facilitates standardized evaluation of GHOI video generation.

Abstract

Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Multimodal Machine Learning Applications