Mapping Datasets to Object Storage System
Xiaowei (Aaron) Chu, Jeff LeFevre, Aldrin Montana, Dana Robinson,, Quincey Koziol, Peter Alvaro, Carlos Maltzahn

TL;DR
This paper proposes a distributed dataset mapping infrastructure using Ceph to enhance access libraries for large datasets and modern storage devices, enabling scalable, flexible, and optimized data access.
Contribution
It introduces a novel approach to integrate existing access libraries with distributed storage systems like Ceph without re-implementation, supporting new storage devices and optimizing server-local operations.
Findings
Offloading access library operations to storage servers
Enabling independent evolution of access libraries and storage systems
Supporting storage server-local optimizations for new devices
Abstract
Access libraries such as ROOT and HDF5 allow users to interact with datasets using high level abstractions, like coordinate systems and associated slicing operations. Unfortunately, the implementations of access libraries are based on outdated assumptions about storage systems interfaces and are generally unable to fully benefit from modern fast storage devices. The situation is getting worse with rapidly evolving storage devices such as non-volatile memory and ever larger datasets. This project explores distributed dataset mapping infrastructures that can integrate and scale out existing access libraries using Ceph's extensible object model, avoiding re-implementation or even modifications of these access libraries as much as possible. These programmable storage extensions coupled with our distributed dataset mapping techniques enable: 1) access library operations to be offloaded to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Cloud Computing and Resource Management · Distributed systems and fault tolerance
