SARAH: Spatially Aware Real-time Agentic Humans
Evonne Ng, Siwei Zhang, Zhang Chen, Michael Zollhoefer, Alexander Richard

TL;DR
SARAH introduces a real-time, causal method for creating spatially-aware conversational agents in VR, enabling natural gestures, gaze, and orientation aligned with user movement and speech, suitable for live deployment.
Contribution
The paper presents the first fully causal, streaming VR-compatible approach for spatially-aware conversational motion using a novel transformer-based architecture.
Findings
Achieves state-of-the-art motion quality at over 300 FPS
Outperforms non-causal baselines in natural spatial dynamics
Validated on live VR system for real-time deployment
Abstract
As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Multimodal Machine Learning Applications · Social Robot Interaction and HRI
