SARAH: Spatially Aware Real-time Agentic Humans

Evonne Ng; Siwei Zhang; Zhang Chen; Michael Zollhoefer; Alexander Richard

arXiv:2602.18432·cs.CV·February 23, 2026

SARAH: Spatially Aware Real-time Agentic Humans

Evonne Ng, Siwei Zhang, Zhang Chen, Michael Zollhoefer, Alexander Richard

PDF

Open Access

TL;DR

SARAH introduces a real-time, causal method for creating spatially-aware conversational agents in VR, enabling natural gestures, gaze, and orientation aligned with user movement and speech, suitable for live deployment.

Contribution

The paper presents the first fully causal, streaming VR-compatible approach for spatially-aware conversational motion using a novel transformer-based architecture.

Findings

01

Achieves state-of-the-art motion quality at over 300 FPS

02

Outperforms non-causal baselines in natural spatial dynamics

03

Validated on live VR system for real-time deployment

Abstract

As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Multimodal Machine Learning Applications · Social Robot Interaction and HRI