Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis

Kunyu Feng; Yue Ma; Xinhua Zhang; Boshi Liu; Yikuang Yuluo; Yinhan Zhang; Runtao Liu; Hongyu Liu; Zhiyuan Qin; Shanhui Mo; Qifeng Chen; Zeyu Wang

arXiv:2508.05580·cs.CV·August 8, 2025

Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis

Kunyu Feng, Yue Ma, Xinhua Zhang, Boshi Liu, Yikuang Yuluo, Yinhan Zhang, Runtao Liu, Hongyu Liu, Zhiyuan Qin, Shanhui Mo, Qifeng Chen, Zeyu Wang

PDF

TL;DR

Follow-Your-Instruction is a multimodal LLM-based framework that automatically synthesizes high-quality multi-dimensional data, significantly enhancing generative models' performance and addressing scalability issues in data collection.

Contribution

It introduces a comprehensive MLLM-driven pipeline for automatic multi-dimensional data synthesis, reducing reliance on manual scene construction and improving scalability and quality.

Findings

01

Synthetic data boosts baseline model performance.

02

Framework effectively generates 2D, 3D, and 4D data.

03

Demonstrates scalability and high quality of generated data.

Abstract

With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.