Data Virtualization for Machine Learning
Saiful Khan, Joyraj Chakraborty, Philip Beaucamp, Niraj Bhujel, Min Chen

TL;DR
This paper presents a data virtualization service designed to support multiple machine learning workflows, enabling efficient data management and scalability across various applications.
Contribution
It introduces a novel data virtualization architecture tailored for ML workflows, enhancing data accessibility and organizational efficiency.
Findings
Supports six ML applications with multiple workflows
Enables scalable growth for future applications
Improves data management efficiency in ML pipelines
Abstract
Nowadays, machine learning (ML) teams have multiple concurrent ML workflows for different applications. Each workflow typically involves many experiments, iterations, and collaborative activities and commonly takes months and sometimes years from initial data wrangling to model deployment. Organizationally, there is a large amount of intermediate data to be stored, processed, and maintained. \emph{Data virtualization} becomes a critical technology in an infrastructure to serve ML workflows. In this paper, we present the design and implementation of a data virtualization service, focusing on its service architecture and service operations. The infrastructure currently supports six ML applications, each with more than one ML workflow. The data virtualization service allows the number of applications and workflows to grow in the coming years.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
