Towards an Arrow-native Storage System
Jayjeet Chakraborty, Ivo Jimenez, Sebastiaan Alvarez Rodriguez,, Alexandru Uta, Jeff LeFevre, Carlos Maltzahn

TL;DR
This paper proposes a new design paradigm for extending programmable object storage systems to embed data processing frameworks, enabling efficient data filtering and processing at the storage layer without extensive modifications.
Contribution
It introduces a flexible approach allowing independent evolution of storage systems and data processing frameworks, demonstrated with a Ceph, Apache Arrow, and Parquet implementation.
Findings
Achieved minimal modifications to embed frameworks into storage
Enabled offloading of data processing tasks to storage layer
Demonstrated promising performance improvements
Abstract
With the ever-increasing dataset sizes, several file formats like Parquet, ORC, and Avro have been developed to store data efficiently and to save network and interconnect bandwidth at the price of additional CPU utilization. However, with the advent of networks supporting 25-100 Gb/s and storage devices delivering 1, 000, 000 reqs/sec the CPU has become the bottleneck, trying to keep up feeding data in and out of these fast devices. The result is that data access libraries executed on single clients are often CPU-bound and cannot utilize the scale-out benefits of distributed storage systems. One attractive solution to this problem is to offload data-reducing processing and filtering tasks to the storage layer. However, modifying legacy storage systems to support compute offloading is often tedious and requires extensive understanding of the internals. Previous approaches re-implemented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
