SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction   with 3D Autonomous Characters

Jianping Jiang; Weiye Xiao; Zhengyu Lin; Huaizhong Zhang; Tianxiang; Ren; Yang Gao; Zhiqian Lin; Zhongang Cai; Lei Yang; Ziwei Liu

arXiv:2412.00174·cs.CV·December 3, 2024

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang, Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, Ziwei Liu

PDF

Open Access

TL;DR

SOLAMI introduces an end-to-end framework for immersive social interaction with 3D autonomous characters, integrating multimodal perception, response generation, and VR interface to enhance naturalness and responsiveness.

Contribution

The paper presents the first comprehensive social VLA modeling framework for 3D characters, including a new dataset and immersive VR interaction system.

Findings

01

More natural and precise character responses

02

Lower latency in social interactions

03

Enhanced user engagement and satisfaction

Abstract

Human beings are social animals. How to equip 3D autonomous characters with similar social intelligence that can perceive, understand and interact with humans remains an open yet foundamental problem. In this paper, we introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters. Specifically, SOLAMI builds 3D autonomous characters from three aspects: (1) Social VLA Architecture: We propose a unified social VLA framework to generate multimodal response (speech and motion) based on the user's multimodal input to drive the character for social interaction. (2) Interactive Multimodal Data: We present SynMSI, a synthetic multimodal social interaction dataset generated by an automatic pipeline using only existing motion datasets to address the issue of data scarcity. (3) Immersive VR Interface: We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Multimodal Machine Learning Applications