Skyhook: Towards an Arrow-Native Storage System
Jayjeet Chakraborty, Ivo Jimenez, Sebastiaan Alvarez Rodriguez,, Alexandru Uta, Jeff LeFevre, Carlos Maltzahn

TL;DR
Skyhook introduces a novel design paradigm enabling existing data processing frameworks to be embedded directly into storage systems like Ceph, enhancing data access efficiency without modifying legacy storage or processing software.
Contribution
It proposes a flexible, scalable approach for integrating data processing into storage layers, avoiding duplication and enabling independent evolution of storage and processing frameworks.
Findings
Skyhook achieves improved data processing efficiency.
It allows seamless integration of frameworks into storage systems.
Performance evaluation shows promising results.
Abstract
With the ever-increasing dataset sizes, several file formats such as Parquet, ORC, and Avro have been developed to store data efficiently, save the network, and interconnect bandwidth at the price of additional CPU utilization. However, with the advent of networks supporting 25-100 Gb/s and storage devices delivering 1, 000, 000 reqs/sec, the CPU has become the bottleneck trying to keep up feeding data in and out of these fast devices. The result is that data access libraries executed on single clients are often CPU-bound and cannot utilize the scale-out benefits of distributed storage systems. One attractive solution to this problem is to offload data-reducing processing and filtering tasks to the storage layer. However, modifying legacy storage systems to support compute offloading is often tedious and requires an extensive understanding of the system internals. Previous approaches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Cloud Computing and Resource Management
