A milestone for FaaS pipelines; object storage vs VM-driven data exchange
Germ\'an T. Eizaguirre, Marc S\'anchez-Artigas, Pedro, Garc\'ia-L\'opez

TL;DR
This paper compares object storage and VM-driven data exchange in serverless function workflows, revealing that object storage can be an effective data passing method under certain conditions, challenging conventional assumptions.
Contribution
It provides an empirical evaluation of object storage performance in serverless data workflows, highlighting scenarios where it outperforms VM-based approaches.
Findings
Object storage can outperform VM-based shuffle stages in serverless workflows.
Performance depends on the number of functions used in shuffling stages.
Object storage is a viable data passing method in genomics pipelines.
Abstract
Serverless functions provide high levels of parallelism, short startup times, and "pay-as-you-go" billing. These attributes make them a natural substrate for data analytics workflows. However, the impossibility of direct communication between functions makes the execution of workflows challenging. The current practice to share intermediate data among functions is through remote object storage (e.g., IBM COS). Contrary to conventional wisdom, the performance of object storage is not well understood. For instance, object storage can even be superior to other simpler approaches like the execution of shuffle stages (e.g., GroupBy) inside powerful VMs to avoid all-to-all transfers between functions. Leveraging a genomics pipeline, we show that object storage is a reasonable choice for data passing when the appropriate number of functions is used in shuffling stages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
