Scaling-up Distributed Processing of Data Streams for Machine Learning
Matthew Nokleby, Haroon Raja, and Waheed U. Bajwa

TL;DR
This paper reviews recent methods for large-scale distributed stochastic optimization in streaming data environments, emphasizing convergence analysis under communication and computation constraints, and discusses their theoretical guarantees.
Contribution
It provides a comprehensive review of recent advances in distributed stochastic optimization methods tailored for high-rate streaming data, including convergence guarantees and algorithmic designs.
Findings
Methods achieve order-optimal learning rates in distributed streaming settings.
Explicit convergence analysis accounts for computation-communication mismatch.
Focus on convex problems and distributed principal component analysis.
Abstract
Emerging applications of machine learning in numerous areas involve continuous gathering of and learning from streams of data. Real-time incorporation of streaming data into the learned models is essential for improved inference in these applications. Further, these applications often involve data that are either inherently gathered at geographically distributed entities or that are intentionally distributed across multiple machines for memory, computational, and/or privacy reasons. Training of models in this distributed, streaming setting requires solving stochastic optimization problems in a collaborative manner over communication links between the physical entities. When the streaming data rate is high compared to the processing capabilities of compute nodes and/or the rate of the communications links, this poses a challenging question: how can one best leverage the incoming data for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
