MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang, Gan

TL;DR
MultiPLY is a novel multisensory embodied large language model that actively interacts with 3D environments, integrating visual, audio, tactile, and thermal data to improve performance on various embodied tasks.
Contribution
The paper introduces MultiPLY, a multisensory embodied LLM with a large-scale interaction dataset and a new encoding method for 3D scenes, enabling active environment interaction and multisensory data integration.
Findings
MultiPLY significantly outperforms baselines on embodied tasks.
The dataset contains 500k multisensory interaction data points.
Active interaction with multisensory data enhances task performance.
Abstract
Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area, we propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and percepts. To this end, we first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data by deploying an LLM-powered embodied agent to engage with the 3D environment. To perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition
MethodsSparse Evolutionary Training
