Towards Geo-Distributed Machine Learning
Ignacio Cano, Markus Weimer, Dhruv Mahajan, Carlo Curino, Giovanni, Matteo Fumarola

TL;DR
This paper introduces a new approach to machine learning that trains models across multiple data centers without centralizing data, addressing bandwidth and privacy issues, and demonstrating its effectiveness on real datasets.
Contribution
It proposes a novel geo-distributed training system that avoids data centralization, improving privacy and regulatory compliance while maintaining performance.
Findings
GDML reduces data transfer costs compared to centralized methods.
GDML maintains high model accuracy across multiple datasets.
Geo-distributed training is feasible and advantageous in real-world scenarios.
Abstract
Latency to end-users and regulatory requirements push large companies to build data centers all around the world. The resulting data is "born" geographically distributed. On the other hand, many machine learning applications require a global view of such data in order to achieve the best results. These types of applications form a new class of learning problems, which we call Geo-Distributed Machine Learning (GDML). Such applications need to cope with: 1) scarce and expensive cross-data center bandwidth, and 2) growing privacy concerns that are pushing for stricter data sovereignty regulations. Current solutions to learning from geo-distributed data sources revolve around the idea of first centralizing the data in one data center, and then training locally. As machine learning algorithms are communication-intensive, the cost of centralizing the data is thought to be offset by the lower…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Cloud Computing and Resource Management · Scientific Computing and Data Management
