DataArc-SynData-Toolkit: A Unified Closed-Loop Framework for Multi-Path, Multimodal, and Multilingual Data Synthesis

Zhichao Shi; Cehao Yang; Hao Zhou; Xiaojun Wu; Huajie Li; Xuhui Jiang; Chengjin Xu; Yuanzhuo Wang; Jian Guo

arXiv:2605.08138·cs.LG·May 12, 2026

DataArc-SynData-Toolkit: A Unified Closed-Loop Framework for Multi-Path, Multimodal, and Multilingual Data Synthesis

Zhichao Shi, Cehao Yang, Hao Zhou, Xiaojun Wu, Huajie Li, Xuhui Jiang, Chengjin Xu, Yuanzhuo Wang, Jian Guo

PDF

TL;DR

DataArc-SynData-Toolkit is an open-source, user-friendly framework that streamlines multi-path, multimodal, and multilingual synthetic data generation for large language models, enhancing scalability and usability.

Contribution

It introduces a unified, configurable, and modular toolkit with an intuitive interface that simplifies synthetic data creation across diverse modalities and languages.

Findings

01

Achieves a balance between generation efficiency and data quality.

02

Reduces technical barriers with an end-to-end visual pipeline.

03

Demonstrates effectiveness in multiple application scenarios.

Abstract

Synthetic data has emerged as a crucial solution to the data scarcity bottleneck in large language models (LLMs), particularly for specialized domains and low-resource languages. However, the broader adoption of existing synthetic data tools is severely hindered by convoluted workflows, fragmented data standards, and limited scalability across modalities. To address these limitations, we develop DataArc-SynData-Toolkit, an open-source framework featuring: (1) a configuration-driven, end-to-end pipeline equipped with an intuitive visual interface and simplified CLI for exceptional usability; (2) a unified, quality-controllable synthesis paradigm that standardizes multi-source data generation to ensure high reusability; and (3) a highly modular architecture designed for seamless multimodal, multilingual, and multi-task adaptation. We apply the toolkit in multiple application scenarios.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.