MLCask: Efficient Management of Component Evolution in Collaborative Data Analytics Pipelines
Zhaojing Luo, Sai Ho Yeung, Meihui Zhang, Kaiping Zheng, Lei Zhu, Gang, Chen, Feiyi Fan, Qian Lin, Kee Yuan Ngiam, Beng Chin Ooi

TL;DR
MLCask introduces a version-controlled system for managing evolving machine learning pipelines in collaborative environments, enabling efficient merging, storage savings, and performance optimization.
Contribution
It presents a novel versioning approach with Git-like branching and merging tailored for ML pipelines, including optimized merge operations and prioritized search.
Findings
Merge operation up to 7.8x faster
Storage savings up to 11.9x
Effective in real-world deployment cases
Abstract
With the ever-increasing adoption of machine learning for data analytics, maintaining a machine learning pipeline is becoming more complex as both the datasets and trained models evolve with time. In a collaborative environment, the changes and updates due to pipeline evolution often cause cumbersome coordination and maintenance work, raising the costs and making it hard to use. Existing solutions, unfortunately, do not address the version evolution problem, especially in a collaborative environment where non-linear version control semantics are necessary to isolate operations made by different user roles. The lack of version control semantics also incurs unnecessary storage consumption and lowers efficiency due to data duplication and repeated data pre-processing, which are avoidable. In this paper, we identify two main challenges that arise during the deployment of machine learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Data Quality and Management · Scientific Computing and Data Management
MethodsPruning
