Dataset Lifecycle Framework and its applications in Bioinformatics
Yiannis Gkoufas (1), David Yu Yuan (2) ((1) IBM Research - Ireland,, (2) Technology, Science Integration, European Bioinformatics Institute,, European Molecular Biology Laboratory)

TL;DR
This paper introduces a Dataset Lifecycle Framework as a native Kubernetes resource to improve data management, scalability, and performance in bioinformatics pipelines, especially for ML and non-ML workloads on Kubeflow.
Contribution
The paper presents a novel Dataset resource and lifecycle framework for Kubernetes, enabling efficient data access, caching, and management in bioinformatics workflows.
Findings
Enhanced data scalability for ML training without local downloads
Improved durability of training metadata using datasets
Significant performance gains from pluggable caching mechanisms
Abstract
Bioinformatics pipelines depend on shared POSIX filesystems for its input, output and intermediate data storage. Containerization makes it more difficult for the workloads to access the shared file systems. In our previous study, we were able to run both ML and non-ML pipelines on Kubeflow successfully. However, the storage solutions were complex and less optimal. This is because there are no established resource types to represent the concept of data source on Kubernetes. More and more applications are running on Kubernetes for batch processing. End users are burdened with configuring and optimising the data access, which is what we have experienced before. In this article, we are introducing a new concept of Dataset and its corresponding resource as a native Kubernetes object. We have leveraged the Dataset Lifecycle Framework which takes care of all the low-level details about data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Scientific Computing and Data Management · Genetics, Bioinformatics, and Biomedical Research
