Active Intelligence in Video Avatars via Closed-loop World Modeling
Xuanhua He, Tianyu Yang, Ke Cao, Ruiqi Wu, Cheng Meng, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Qifeng Chen

TL;DR
This paper introduces ORCA, a novel framework for active, goal-directed video avatars that use a closed-loop world model to adaptively interact with their environment, enabling autonomous multi-step task completion.
Contribution
We propose ORCA, the first framework integrating internal world modeling with a hierarchical dual-system architecture for active, goal-oriented video avatars in stochastic environments.
Findings
ORCA outperforms open-loop baselines in task success rate.
It achieves higher behavioral coherence in complex scenarios.
The framework demonstrates effective continuous belief updating and outcome verification.
Abstract
Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Multimodal Machine Learning Applications · Artificial Intelligence in Games
