Large-Scale Network Embedding in Apache Spark

Wenqing Lin

arXiv:2106.10620·cs.SI·October 30, 2025

Large-Scale Network Embedding in Apache Spark

Wenqing Lin

PDF

TL;DR

This paper introduces a scalable distributed algorithm for large-scale network embedding using Apache Spark, enabling efficient processing of graphs with billions of edges and improving performance on real-world applications.

Contribution

The paper presents a novel distributed graph partitioning and embedding algorithm in Spark that significantly enhances scalability and speed for large graph analysis.

Findings

01

Handles graphs with billions of edges within hours

02

At least 4 times faster than existing methods

03

Improves link prediction and node classification accuracy

Abstract

Network embedding has been widely used in social recommendation and network analysis, such as recommendation systems and anomaly detection with graphs. However, most of previous approaches cannot handle large graphs efficiently, due to that (i) computation on graphs is often costly and (ii) the size of graph or the intermediate results of vectors could be prohibitively large, rendering it difficult to be processed on a single machine. In this paper, we propose an efficient and effective distributed algorithm for network embedding on large graphs using Apache Spark, which recursively partitions a graph into several small-sized subgraphs to capture the internal and external structural information of nodes, and then computes the network embedding for each subgraph in parallel. Finally, by aggregating the outputs on all subgraphs, we obtain the embeddings of nodes in a linear cost. After…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.