ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Juze Zhang; Changan Chen; Xin Chen; Heng Yu; Tiange Xiang; Ali Sartaz Khan; Shrinidhi K. Lakshmikanth; Ehsan Adeli

arXiv:2512.14234·cs.CV·April 16, 2026

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi K. Lakshmikanth, Ehsan Adeli

PDF

1 Repo

TL;DR

ViBES is a multimodal conversational agent that jointly plans language and body movements, enabling socially competent 3D interactions with controllable, dialogue-conditioned behaviors.

Contribution

It introduces a multimodal transformer model with modality-specific experts for synchronized speech, facial expressions, and body motion in dialogue.

Findings

01

Achieves better dialogue-motion alignment than baselines.

02

Supports mixed-initiative interaction with controllable behaviors.

03

Demonstrates socially competent 3D virtual body interactions.

Abstract

Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.