Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development
Daoyuan Chen, Haibin Wang, Yilun Huang, Ce Ge, Yaliang Li, Bolin Ding, Jingren Zhou

TL;DR
The paper introduces Data-Juicer Sandbox, a feedback-driven platform for integrated data and model co-development in multimodal AI, enabling efficient iteration, performance improvements, and insights into data-model interactions.
Contribution
It presents a novel sandbox suite with a 'Probe-Analyze-Refine' workflow for co-developing multimodal models, validated through practical use cases and extensive experiments.
Findings
Performance boosts on multimodal tasks, including topping the VBench leaderboard.
Demonstrated usability and extensibility through over 100 experiments.
Insights into data quality, diversity, and computational costs affecting model behavior.
Abstract
The emergence of multimodal large models has advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a new sandbox suite tailored for integrated data-model co-development. This sandbox provides a feedback-driven experimental platform, enabling cost-effective iteration and guided refinement of both data and models. Our proposed ``Probe-Analyze-Refine'' workflow, validated through practical use cases on multimodal tasks such as image-text pre-training with CLIP, image-to-text generation with LLaVA-like models, and text-to-video generation with DiT-based models, yields transferable and notable performance boosts, such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsService-Oriented Architecture and Web Services
