Data Placement and Replica Selection for Improving Co-location in Distributed Environments
K. Ashwin Kumar, Amol Deshpande, Samir Khuller

TL;DR
This paper proposes workload-driven data placement and replica selection algorithms that reduce the average number of machines involved in queries, thereby decreasing resource consumption in distributed data management systems.
Contribution
It introduces a novel hypergraph-based model for replica placement that minimizes query span, with algorithms tailored for workload-driven optimization in distributed environments.
Findings
Significant reduction in resource consumption through optimized data placement.
Effective algorithms for replica selection based on workload hypergraph modeling.
Validation on synthetic and real workloads demonstrating practical benefits.
Abstract
Increasing need for large-scale data analytics in a number of application domains has led to a dramatic rise in the number of distributed data management systems, both parallel relational databases, and systems that support alternative frameworks like MapReduce. There is thus an increasing contention on scarce data center resources like network bandwidth; further, the energy requirements for powering the computing equipment are also growing dramatically. As we show empirically, increasing the execution parallelism by spreading out data across a large number of machines may achieve the intended goal of decreasing query latencies, but in most cases, may increase the total resource and energy consumption significantly. For many analytical workloads, however, minimizing query latencies is often not critical; in such scenarios, we argue that we should instead focus on minimizing the average…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed systems and fault tolerance · Caching and Content Delivery
