DySkew: Dynamic Data Redistribution for Skew-Resilient Snowpark UDF Execution
Chenwei Xie, Urjeet Shrestha, Corbin McElhanney, Lukas Lorimer, Gopal V, Zihao Ye, Yi Pan, Nic Crouch, Elliott Brossard, Florian Funke, Yuxiong He

TL;DR
DySkew is a data-skew-aware execution strategy for Snowpark UDFs that dynamically redistributes data to mitigate skew-related performance issues, improving efficiency and reducing latency.
Contribution
This paper introduces DySkew, a novel adaptive data redistribution mechanism tailored for Snowpark UDFs to handle data skew effectively.
Findings
Significant reduction in execution time for skewed workloads
Improved resource utilization and load balancing
Effective handling of large rows with Row Size Model
Abstract
Snowflake revolutionized data warehousing with an elastic architecture that decouples compute and storage, enabling scalable solutions for diverse data analytics needs. Building on this foundation, Snowflake has advanced its AI Data Cloud vision by introducing Snowpark, a managed turnkey solution that supports data engineering and AI/ML workloads using Python and other programming languages. While Snowpark's User-Defined Function (UDF) execution model offers high throughput, it is highly vulnerable to performance degradation from data skew, where uneven data partitioning causes straggler tasks and unpredictable latency. The non-uniform computational cost of arbitrary user code further exacerbates this classic challenge. This paper presents DySkew, a novel, data-skew-aware execution strategy for Snowpark UDFs. Built upon Snowflake's new generalized skew handling solution, an adaptive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
