Stocator: A High Performance Object Store Connector for Spark
Gil Vernik, Michael Factor, Elliot K. Kolodner, Pietro Michiardi, Effi, Ofer, Francesco Pace

TL;DR
Stocator is a high-performance Spark connector that leverages object store semantics to significantly improve write speeds and reduce operational costs compared to traditional connectors.
Contribution
It introduces a novel approach that avoids costly rename operations by exploiting atomic object creation, enhancing performance and simplicity.
Findings
Up to 18x faster write performance
30x fewer object store operations
Reduced costs for clients and providers
Abstract
We present Stocator, a high performance object store connector for Apache Spark, that takes advantage of object store semantics. Previous connectors have assumed file system semantics, in particular, achieving fault tolerance and allowing speculative execution by creating temporary files to avoid interference between worker threads executing the same task and then renaming these files. Rename is not a native object store operation; not only is it not atomic, but it is implemented using a costly copy operation and a delete. Instead our connector leverages the inherent atomicity of object creation, and by avoiding the rename paradigm it greatly decreases the number of operations on the object store as well as enabling a much simpler approach to dealing with the eventually consistent semantics typical of object stores. We have implemented Stocator and shared it in open source. Performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Scientific Computing and Data Management
