
TL;DR
This paper introduces a scalable distributed algorithm for large-scale network embedding using Apache Spark, enabling efficient processing of graphs with billions of edges and improving performance on real-world applications.
Contribution
The paper presents a novel distributed graph partitioning and embedding algorithm in Spark that significantly enhances scalability and speed for large graph analysis.
Findings
Handles graphs with billions of edges within hours
At least 4 times faster than existing methods
Improves link prediction and node classification accuracy
Abstract
Network embedding has been widely used in social recommendation and network analysis, such as recommendation systems and anomaly detection with graphs. However, most of previous approaches cannot handle large graphs efficiently, due to that (i) computation on graphs is often costly and (ii) the size of graph or the intermediate results of vectors could be prohibitively large, rendering it difficult to be processed on a single machine. In this paper, we propose an efficient and effective distributed algorithm for network embedding on large graphs using Apache Spark, which recursively partitions a graph into several small-sized subgraphs to capture the internal and external structural information of nodes, and then computes the network embedding for each subgraph in parallel. Finally, by aggregating the outputs on all subgraphs, we obtain the embeddings of nodes in a linear cost. After…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
