PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI

Yingchen He; Christian D. Weilbach; Martyna E. Wojciechowska; Yuxuan Zhang; Frank Wood

arXiv:2505.12707·cs.LG·February 19, 2026

PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI

Yingchen He, Christian D. Weilbach, Martyna E. Wojciechowska, Yuxuan Zhang, Frank Wood

PDF

Open Access

TL;DR

PLAICraft introduces a large-scale, multi-modal dataset of multiplayer Minecraft interactions with millisecond precision, enabling research in embodied AI for real-time, socially interactive agents.

Contribution

The paper presents PLAICraft, a comprehensive dataset and platform capturing synchronized vision, audio, and action data from multiplayer Minecraft gameplay, facilitating embodied AI research.

Findings

01

Over 10,000 hours of gameplay data collected from 10,000+ participants.

02

Benchmark evaluation suite for object recognition, spatial awareness, language grounding, and memory.

03

Dataset enables training and testing of real-time, socially interactive embodied agents.

Abstract

Advances in deep generative modeling have made it increasingly plausible to train human-level embodied agents. Yet progress has been limited by the absence of large-scale, real-time, multi-modal, and socially interactive datasets that reflect the sensory-motor complexity of natural environments. To address this, we present PLAICraft, a novel data collection platform and dataset capturing multiplayer Minecraft interactions across five time-aligned modalities: video, game output audio, microphone input audio, mouse, and keyboard actions. Each modality is logged with millisecond time precision, enabling the study of synchronous, embodied behaviour in a rich, open-ended world. The dataset comprises over 10,000 hours of gameplay from more than 10,000 global participants. Alongside the dataset, we provide an evaluation suite for benchmarking model capabilities in object recognition, spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization