LRC: Dependency-Aware Cache Management for Data Analytics Clusters
Yinghao Yu, Wei Wang, Jun Zhang, Khaled Ben Letaief

TL;DR
This paper introduces LRC, a cache management policy that leverages data dependency DAGs in data analytics clusters to improve cache efficiency and significantly speed up applications compared to traditional LRU.
Contribution
The paper presents LRC, a novel cache replacement policy that uses data dependency information to optimize cache management in data-parallel systems.
Findings
LRC improves cache hit ratio over LRU.
LRC speeds up applications by 60% in Spark.
LRC effectively utilizes DAG information for cache management.
Abstract
Memory caches are being aggressively used in today's data-parallel systems such as Spark, Tez, and Piccolo. However, prevalent systems employ rather simple cache management policies--notably the Least Recently Used (LRU) policy--that are oblivious to the application semantics of data dependency, expressed as a directed acyclic graph (DAG). Without this knowledge, memory caching can at best be performed by "guessing" the future data access patterns based on historical information (e.g., the access recency and/or frequency), which frequently results in inefficient, erroneous caching with low hit ratio and a long response time. In this paper, we propose a novel cache replacement policy, Least Reference Count (LRC), which exploits the application-specific DAG information to optimize the cache management. LRC evicts the cached data blocks whose reference count is the smallest. The reference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Cloud Computing and Resource Management · Distributed systems and fault tolerance
