Data Virtualization for Machine Learning

Saiful Khan; Joyraj Chakraborty; Philip Beaucamp; Niraj Bhujel; Min Chen

arXiv:2507.17293·cs.SE·September 19, 2025

Data Virtualization for Machine Learning

Saiful Khan, Joyraj Chakraborty, Philip Beaucamp, Niraj Bhujel, Min Chen

PDF

TL;DR

This paper presents a data virtualization service designed to support multiple machine learning workflows, enabling efficient data management and scalability across various applications.

Contribution

It introduces a novel data virtualization architecture tailored for ML workflows, enhancing data accessibility and organizational efficiency.

Findings

01

Supports six ML applications with multiple workflows

02

Enables scalable growth for future applications

03

Improves data management efficiency in ML pipelines

Abstract

Nowadays, machine learning (ML) teams have multiple concurrent ML workflows for different applications. Each workflow typically involves many experiments, iterations, and collaborative activities and commonly takes months and sometimes years from initial data wrangling to model deployment. Organizationally, there is a large amount of intermediate data to be stored, processed, and maintained. \emph{Data virtualization} becomes a critical technology in an infrastructure to serve ML workflows. In this paper, we present the design and implementation of a data virtualization service, focusing on its service architecture and service operations. The infrastructure currently supports six ML applications, each with more than one ML workflow. The data virtualization service allows the number of applications and workflows to grow in the coming years.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.