LRC: Dependency-Aware Cache Management for Data Analytics Clusters

Yinghao Yu; Wei Wang; Jun Zhang; Khaled Ben Letaief

arXiv:1703.08280·cs.DC·March 27, 2017·5 cites

LRC: Dependency-Aware Cache Management for Data Analytics Clusters

Yinghao Yu, Wei Wang, Jun Zhang, Khaled Ben Letaief

PDF

Open Access

TL;DR

This paper introduces LRC, a cache management policy that leverages data dependency DAGs in data analytics clusters to improve cache efficiency and significantly speed up applications compared to traditional LRU.

Contribution

The paper presents LRC, a novel cache replacement policy that uses data dependency information to optimize cache management in data-parallel systems.

Findings

01

LRC improves cache hit ratio over LRU.

02

LRC speeds up applications by 60% in Spark.

03

LRC effectively utilizes DAG information for cache management.

Abstract

Memory caches are being aggressively used in today's data-parallel systems such as Spark, Tez, and Piccolo. However, prevalent systems employ rather simple cache management policies--notably the Least Recently Used (LRU) policy--that are oblivious to the application semantics of data dependency, expressed as a directed acyclic graph (DAG). Without this knowledge, memory caching can at best be performed by "guessing" the future data access patterns based on historical information (e.g., the access recency and/or frequency), which frequently results in inefficient, erroneous caching with low hit ratio and a long response time. In this paper, we propose a novel cache replacement policy, Least Reference Count (LRC), which exploits the application-specific DAG information to optimize the cache management. LRC evicts the cached data blocks whose reference count is the smallest. The reference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Cloud Computing and Resource Management · Distributed systems and fault tolerance