Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models
Daoyuan Chen, Yilun Huang, Xuchen Pan, Nana Jiang, Haibin Wang, Yilei Zhang, Ce Ge, Yushuo Chen, Wenhao Zhang, Zhijian Ma, Jun Huang, Wei Lin, Yaliang Li, Bolin Ding, Jingren Zhou

TL;DR
Data-Juicer 2.0 is a scalable, versatile data processing system designed for multimodal datasets, enhancing foundation model workflows with improved efficiency, usability, and compatibility across large-scale environments.
Contribution
It introduces a comprehensive, adaptive data processing framework supporting multimodal data and integrates seamlessly with popular data hubs and computing engines, advancing prior systems in scalability and usability.
Findings
Processes TB-level data with 10k+ CPU cores efficiently.
Supports diverse data modalities including text, image, video, and audio.
Widely adopted in research and industry, including Alibaba Cloud PAI.
Abstract
Foundation models demand advanced data processing for their vast, multimodal datasets. However, traditional frameworks struggle with the unique complexities of multimodal data. In response, we present Data-Juicer 2.0, a data processing system backed by 100+ data processing operators spanning text, image, video, and audio modalities, supporting more critical tasks including data analysis, synthesis, annotation, and foundation model post-training. With seamless compatibility and dedicated optimization for popular dataset hubs like Hugging Face and computing engines like Ray, it improves upon its predecessor in terms of usability, efficiency, and programmability. It features an easily accessible user interface layer that supports decoupled Python interactions, RESTful APIs, and conversational commands. Its new runtime layer offers adaptive execution across diverse scales and environments,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computational Techniques and Applications · Distributed and Parallel Computing Systems · Advanced Database Systems and Queries
