Active Intelligence in Video Avatars via Closed-loop World Modeling

Xuanhua He; Tianyu Yang; Ke Cao; Ruiqi Wu; Cheng Meng; Yong Zhang; Zhuoliang Kang; Xiaoming Wei; Qifeng Chen

arXiv:2512.20615·cs.CV·December 24, 2025

Active Intelligence in Video Avatars via Closed-loop World Modeling

Xuanhua He, Tianyu Yang, Ke Cao, Ruiqi Wu, Cheng Meng, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Qifeng Chen

PDF

Open Access 1 Datasets

TL;DR

This paper introduces ORCA, a novel framework for active, goal-directed video avatars that use a closed-loop world model to adaptively interact with their environment, enabling autonomous multi-step task completion.

Contribution

We propose ORCA, the first framework integrating internal world modeling with a hierarchical dual-system architecture for active, goal-oriented video avatars in stochastic environments.

Findings

01

ORCA outperforms open-loop baselines in task success rate.

02

It achieves higher behavioral coherence in complex scenarios.

03

The framework demonstrates effective continuous belief updating and outcome verification.

Abstract

Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Alexhe101/L-IVA
dataset· 160 dl
160 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Multimodal Machine Learning Applications · Artificial Intelligence in Games