SODA: A Semantics-Aware Optimization Framework for Data-Intensive Applications Using Hybrid Program Analysis
Bingbing Rao, Zixia Liu, Hong Zhang, Siyang Lu, Liqiang Wang

TL;DR
SODA is a semantics-aware optimization framework for data-intensive applications that combines static and dynamic program analysis to improve performance of frameworks like Spark, achieving significant speedups.
Contribution
It introduces a two-phase hybrid analysis approach that aids programmers in tuning performance by understanding code semantics and runtime behavior.
Findings
Achieves up to 60% speedup on real-world Spark applications.
Effective optimization strategies include cache management, operation reordering, and element pruning.
Demonstrates the benefit of combined static and dynamic analysis for performance tuning.
Abstract
In the era of data explosion, a growing number of data-intensive computing frameworks, such as Apache Hadoop and Spark, have been proposed to handle the massive volume of unstructured data in parallel. Since programming models provided by these frameworks allow users to specify complex and diversified user-defined functions (UDFs) with predefined operations, the grand challenge of tuning up entire system performance arises if programmers do not fully understand the semantics of code, data, and runtime systems. In this paper, we design a holistic semantics-aware optimization for data-intensive applications using hybrid program analysis} (SODA) to assist programmers to tune performance issues. SODA is a two-phase framework: the offline phase is a static analysis that analyzes code and performance profiling data from the online phase of prior executions to generate a parameterized and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Software System Performance and Reliability
