A Survey on Geographically Distributed Big-Data Processing using MapReduce
Shlomi Dolev, Patricia Florissi, Ehud Gudes, Shantanu Sharma, Ido, Singer

TL;DR
This survey reviews the challenges and advancements in geographically distributed big-data processing frameworks, focusing on MapReduce, Spark, and SQL-style systems, highlighting their limitations and future directions.
Contribution
It provides a comprehensive classification and analysis of geo-distributed big-data processing frameworks, discussing their challenges, requirements, and overhead issues.
Findings
Identifies key challenges in geo-distributed data processing.
Classifies existing frameworks into batch, stream, and SQL-style systems.
Highlights the need for new architectures to process data locally without moving raw datasets.
Abstract
Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems are extensively used by many industries, e.g., Google, Facebook, and Amazon, for solving a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and social network analysis. However, all these popular systems have a major drawback in terms of locally distributed computations, which prevent them in implementing geographically distributed data processing. The increasing amount of geographically distributed massive data is pushing industries and academia to rethink the current big-data processing systems. The novel frameworks, which will be beyond state-of-the-art architectures and technologies involved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
