DataArc-SynData-Toolkit: A Unified Closed-Loop Framework for Multi-Path, Multimodal, and Multilingual Data Synthesis
Zhichao Shi, Cehao Yang, Hao Zhou, Xiaojun Wu, Huajie Li, Xuhui Jiang, Chengjin Xu, Yuanzhuo Wang, Jian Guo

TL;DR
DataArc-SynData-Toolkit is an open-source, user-friendly framework that streamlines multi-path, multimodal, and multilingual synthetic data generation for large language models, enhancing scalability and usability.
Contribution
It introduces a unified, configurable, and modular toolkit with an intuitive interface that simplifies synthetic data creation across diverse modalities and languages.
Findings
Achieves a balance between generation efficiency and data quality.
Reduces technical barriers with an end-to-end visual pipeline.
Demonstrates effectiveness in multiple application scenarios.
Abstract
Synthetic data has emerged as a crucial solution to the data scarcity bottleneck in large language models (LLMs), particularly for specialized domains and low-resource languages. However, the broader adoption of existing synthetic data tools is severely hindered by convoluted workflows, fragmented data standards, and limited scalability across modalities. To address these limitations, we develop DataArc-SynData-Toolkit, an open-source framework featuring: (1) a configuration-driven, end-to-end pipeline equipped with an intuitive visual interface and simplified CLI for exceptional usability; (2) a unified, quality-controllable synthesis paradigm that standardizes multi-source data generation to ensure high reusability; and (3) a highly modular architecture designed for seamless multimodal, multilingual, and multi-task adaptation. We apply the toolkit in multiple application scenarios.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
