Multi-Dimensional Balanced Graph Partitioning via Projected Gradient   Descent

Dmitrii Avdiukhin; Sergey Pupyrev; Grigory Yaroslavtsev

arXiv:1902.03522·cs.DS·February 19, 2019

Multi-Dimensional Balanced Graph Partitioning via Projected Gradient Descent

Dmitrii Avdiukhin, Sergey Pupyrev, Grigory Yaroslavtsev

PDF

TL;DR

This paper introduces a scalable multi-dimensional balanced graph partitioning method using projected gradient descent, improving distributed graph processing performance on large-scale social networks.

Contribution

It presents a novel scalable algorithm for multi-dimensional balanced graph partitioning based on randomized projected gradient descent for non-convex relaxations.

Findings

01

Outperforms state-of-the-art methods on large social networks

02

Efficient implementation of the algorithm in practice

03

Demonstrates importance of multi-dimensional balance for performance

Abstract

Motivated by performance optimization of large-scale graph processing systems that distribute the graph across multiple machines, we consider the balanced graph partitioning problem. Compared to the previous work, we study the multi-dimensional variant when balance according to multiple weight functions is required. As we demonstrate by experimental evaluation, such multi-dimensional balance is important for achieving performance improvements for typical distributed graph processing workloads. We propose a new scalable technique for the multidimensional balanced graph partitioning problem. The method is based on applying randomized projected gradient descent to a non-convex continuous relaxation of the objective. We show how to implement the new algorithm efficiently in both theory and practice utilizing various approaches for projection. Experiments with large-scale social networks…

Figures34

Click any figure to enlarge with its caption.

Tables3

Table 1. Table 1 : Theoretical properties of projection methods.

	$d$	Output	Time required
Alternating	any	$𝐱 \in K$	Until convergence
Dykstra’s	any	projection	Until convergence
Exact (ours)	$d \leq 2$	projection	$𝒪 (n \log^{d - 1} n)$

Table 2. Table 2: Impact of partitioning policy on the running time and the amount of sent messages across 128 128 128 Giraph workers for the Page Rank application applied on the FB-400B graph. The numbers are average values over 30 30 30 iterations.

Partitioning	Runtime, sec			Communication, GB
	mean	max	stdev	mean	max	stdev
Hash	$95$	$102$	$27$	$69.5$	$69.6$	$2.4$
vertex	$93$	$143$	$25$	$18.6$	$47.6$	$6.8$
edge	$82$	$120$	$22$	$25.7$	$38.2$	$5.9$
vertex-edge	$84$	$88$	$21$	$29.1$	$30.6$	$2.8$

Table 3. Table 3 : Comparison of GD with METIS for multidimensional experiments. The results show that for high-dimensional balanced partitioning METIS can’t guarantee balance. Better results shown in bold. In most cases GD outperforms METIS by almost in edge locality, imbalance, memory usage and/or time.

		LiveJournal		orkut		sx-stackoverflow
		GD	METIS	GD	METIS	GD	METIS
$d = 2$ : balance on vertices and degrees	Locality, $%$	$91.71$	$93.74$	$88.36$	$86.52$	$75.82$	$80.41$
	$\max$ imbalance, $%$	$0.04$	$0.5$	$0.02$	$0.7$	$0.04$	$0.6$
	Memory, MB	$𝟐𝟔𝟑𝟓$	$4085$	$𝟒𝟔𝟕𝟑$	$10259$	$𝟏𝟓𝟖𝟕$	$4113$
	Time, s	$117$	$𝟒𝟒$	$203$	$𝟗𝟐$	$68$	$𝟓𝟓$
$d = 3$ : balance on vertices, degrees and sum of neighbor degrees	Locality, $%$	$88.74$	$73.36$	$89.55$	$62.1$	$76.8$	$60.09$
	$\max$ imbalance, $%$	$0.05$	$30$	$0.02$	$1.6$	$0.1$	$6.5$
	Memory, MB	$𝟐𝟕𝟏𝟏$	$4802$	$𝟒𝟔𝟗𝟕$	$12271$	$𝟏𝟔𝟐𝟕$	$4985$
	Time, s	$140$	$𝟔𝟔$	$𝟏𝟗𝟔$	$303$	$𝟕𝟔$	$131$
$d = 4$ : balance on vertices, degrees, sum of neighbor degrees and pagerank	Locality	$87.93$	$74.36$	$75.58$	$65.08$	$77.04$	$78.54$
	$\max$ imbalance, $%$	$0.5$	$38$	$2.7$	$20$	$0.4$	$3.8$
	Memory, MB	$𝟐𝟗𝟑𝟗$	$4839$	$𝟒𝟖𝟗𝟔$	$12294$	$𝟏𝟕𝟓𝟒$	$5013$
	Time, s	$227$	$𝟔𝟔$	$𝟐𝟒𝟎$	$297$	$𝟖𝟖$	$142$

Equations63

\frac{1}{2} (i_{1}, i_{2}) \in E \sum (x_{i_{1}} x_{i_{2}} + 1)

\frac{1}{2} (i_{1}, i_{2}) \in E \sum (x_{i_{1}} x_{i_{2}} + 1)

i = 1 \sum n w_{i}^{(j)} x_{i} \leq ε i = 1 \sum n w_{i}^{(j)}

x_{i} \in {- 1, 1}

B_{\infty} = {x \in R^{n}} \forall i : x_{i} \in [- 1, 1]

B_{\infty} = {x \in R^{n}} \forall i : x_{i} \in [- 1, 1]

S_{ε}^{j} = {x \in R^{n}} ∣ i = 1 \sum n w_{i}^{(j)} x_{i} ∣ \leq ε i = 1 \sum n w_{i}^{(j)} for j \in [d],

f (x) = ∥ x - y ∥_{2}^{2}

f (x) = ∥ x - y ∥_{2}^{2}

g_{i} = x_{i}^{2} - 1 \leq 0

h_{+}^{(j)} = i = 1 \sum n w_{i}^{(j)} x_{i} - ε \leq 0

h_{-}^{(j)} = - i = 1 \sum n w_{i}^{(j)} x_{i} - ε \leq 0

y - x = i = 1 \sum n μ_{i} x_{i} e_{i} + j = 1 \sum d (μ_{+}^{(j)} - μ_{-}^{(j)}) i = 1 \sum n w_{i}^{(j)} e_{i}

y - x = i = 1 \sum n μ_{i} x_{i} e_{i} + j = 1 \sum d (μ_{+}^{(j)} - μ_{-}^{(j)}) i = 1 \sum n w_{i}^{(j)} e_{i}

μ_{i} (x_{i}^{2} - 1) = 0,

μ_{i} (x_{i}^{2} - 1) = 0,

μ_{+}^{(j)} (i = 1 \sum n w_{i}^{(j)} x_{i} - ε) = 0,

μ_{+}^{(j)} (i = 1 \sum n w_{i}^{(j)} x_{i} - ε) = 0,

μ_{-}^{(j)} (i = 1 \sum n w_{i}^{(j)} x_{i} + ε) = 0,

f (x) = ∥ y - x ∥_{2}^{2}

f (x) = ∥ y - x ∥_{2}^{2}

g_{i} = x_{i}^{2} - 1 \leq 0

i = 1 \sum n w_{i}^{(j)} x_{i} = ε,

i \sum w_{i} x_{i} =

i \sum w_{i} x_{i} =

+ i : y_{i} \in (- 1 + λ w_{i}, 1 + λ w_{i}) \sum w_{i} (y_{i} - λ w_{i}) .

h_{i} (λ) = ⎩ ⎨ ⎧ w_{i} w_{i} (y_{i} - λ w_{i}) - w_{i} if λ < (y_{i} - 1) / w_{i} if λ \in [(y_{i} - 1) / w_{i}, (y_{i} + 1) / w_{i}] if λ > (y_{i} + 1) / w_{i}

h_{i} (λ) = ⎩ ⎨ ⎧ w_{i} w_{i} (y_{i} - λ w_{i}) - w_{i} if λ < (y_{i} - 1) / w_{i} if λ \in [(y_{i} - 1) / w_{i}, (y_{i} + 1) / w_{i}] if λ > (y_{i} + 1) / w_{i}

h_{i}^{(j)} (λ_{1}, λ_{2}) = ⎩ ⎨ ⎧ w_{i}^{(j)} w_{i}^{(j)} (y_{i} - σ_{i}) - w_{i}^{(j)} if σ_{i} < y_{i} - 1 if σ_{i} \in [y_{i} - 1, y_{i} + 1] if σ_{i} > y_{i} + 1

h_{i}^{(j)} (λ_{1}, λ_{2}) = ⎩ ⎨ ⎧ w_{i}^{(j)} w_{i}^{(j)} (y_{i} - σ_{i}) - w_{i}^{(j)} if σ_{i} < y_{i} - 1 if σ_{i} \in [y_{i} - 1, y_{i} + 1] if σ_{i} > y_{i} + 1

f (x) = ∥ y - x ∥_{2}^{2}

f (x) = ∥ y - x ∥_{2}^{2}

g_{i} = x_{i}^{2} - 1 \leq 0

i = 1 \sum n w_{i}^{(j)} x_{i} = ϵ

i = 1 \sum n w_{i}^{(j)} x_{i} = - ϵ

h_{i}^{(j)} (λ) = ⎩ ⎨ ⎧ w_{i}^{(j)} w_{i}^{(j)} (y_{i} - k \sum λ_{k} w_{i}^{(k)}) - w_{i}^{(j)} if k \sum λ_{k} w_{i}^{(k)} < y_{i} - 1 if k \sum λ_{k} w_{i}^{(k)} \in [y_{i} - 1, y_{i} + 1] if k \sum λ_{k} w_{i}^{(k)} > y_{i} + 1

h_{i}^{(j)} (λ) = ⎩ ⎨ ⎧ w_{i}^{(j)} w_{i}^{(j)} (y_{i} - k \sum λ_{k} w_{i}^{(k)}) - w_{i}^{(j)} if k \sum λ_{k} w_{i}^{(k)} < y_{i} - 1 if k \sum λ_{k} w_{i}^{(k)} \in [y_{i} - 1, y_{i} + 1] if k \sum λ_{k} w_{i}^{(k)} > y_{i} + 1

x = [y - j \sum w^{(j)} λ_{j}] = [y - j \sum w^{(j)} λ_{j}^{'}]

x = [y - j \sum w^{(j)} λ_{j}] = [y - j \sum w^{(j)} λ_{j}^{'}]

j \sum w_{i}^{(j)} (α λ_{j} + (1 - α) λ_{j}^{'}) \leq y_{i} - 1

j \sum w_{i}^{(j)} (α λ_{j} + (1 - α) λ_{j}^{'}) \leq y_{i} - 1

j \sum w_{i}^{(j)} (α λ_{j} + (1 - α) λ_{j}^{'}) = y_{i} - x_{i}

j \sum w_{i}^{(j)} (α λ_{j} + (1 - α) λ_{j}^{'}) = y_{i} - x_{i}

Δ_{t} (λ_{1}, \dots, λ_{t - 1}, λ_{t}^{'}) = Δ_{t} (λ_{1}, \dots, λ_{t - 1}, λ_{t}^{''}) = C .

Δ_{t} (λ_{1}, \dots, λ_{t - 1}, λ_{t}^{'}) = Δ_{t} (λ_{1}, \dots, λ_{t - 1}, λ_{t}^{''}) = C .

Δ_{t} (λ_{1}, \dots, λ_{t - 1}, α λ_{t}^{'} + (1 - α) λ_{t}^{''}) = C .

Δ_{t} (λ_{1}, \dots, λ_{t - 1}, α λ_{t}^{'} + (1 - α) λ_{t}^{''}) = C .

Δ_{t} (λ_{1}, \dots, λ_{t - 1}, λ_{t}^{'}) = Δ_{t} (λ_{1}, \dots, λ_{t - 1}, λ_{t}^{''}) = C,

Δ_{t} (λ_{1}, \dots, λ_{t - 1}, λ_{t}^{'}) = Δ_{t} (λ_{1}, \dots, λ_{t - 1}, λ_{t}^{''}) = C,

h^{(t)} (λ^{'})

h^{(t)} (λ^{'})

h^{(j)} (λ^{'})

x = y^{'} - j \geq t \sum λ_{j} w^{(j)}

x = y^{'} - j \geq t \sum λ_{j} w^{(j)}

i = 1 \sum n w_{i}^{(t)} x_{i} = C

i = 1 \sum n w_{i}^{(j)} x_{i} = c_{j} for all j > t

h_{i}^{(j)} (λ) = ⎩ ⎨ ⎧ w_{i}^{(j)} - w_{i}^{(j)} w_{i}^{(j)} (y_{i} - k \sum λ_{k} w_{i}^{(k)}) if k \sum λ_{k} w_{i}^{(k)} < y_{i} - 1 if k \sum λ_{k} w_{i}^{(k)} > y_{i} + 1 otherwise.

h_{i}^{(j)} (λ) = ⎩ ⎨ ⎧ w_{i}^{(j)} - w_{i}^{(j)} w_{i}^{(j)} (y_{i} - k \sum λ_{k} w_{i}^{(k)}) if k \sum λ_{k} w_{i}^{(k)} < y_{i} - 1 if k \sum λ_{k} w_{i}^{(k)} > y_{i} + 1 otherwise.

y_{i} - λ_{1} w_{i}^{(1)} - λ_{2} w_{i}^{(2)} = 1

y_{i} - λ_{1} w_{i}^{(1)} - λ_{2} w_{i}^{(2)} = 1

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\usetkzobj

all

Multi-Dimensional Balanced Graph Partitioning via Projected Gradient Descent

Dmitrii Avdiukhin

Sergey Pupyrev

Grigory Yaroslavtsev

Indiana University

Bloomington, IN

[email protected]

Facebook

Menlo Park, CA

[email protected]

Indiana University

Bloomington, IN

[email protected]

Abstract

Motivated by performance optimization of large-scale graph processing systems that distribute the graph across multiple machines, we consider the balanced graph partitioning problem. Compared to most of the previous work, we study the multi-dimensional variant when balance according to multiple weight functions is required. As we demonstrate by experimental evaluation, such multi-dimensional balance is essential for achieving performance improvements for typical distributed graph processing workloads.

We propose a new scalable technique for the multi-dimensional balanced graph partitioning problem. The method is based on applying randomized projected gradient descent to a non-convex continuous relaxation of the objective. We show how to implement the new algorithm efficiently in both theory and practice utilizing various approaches for the projection step. Experiments with large-scale graphs with up to 800B edges indicate that our algorithm has superior performance compared with the state-of-the-art approaches.

1 Introduction

Distributed graph processing systems have been widely adopted in recent years to enable analysis and knowledge extraction from large-scale graphs. Systems such as Giraph [6], GraphX [19], GraphLab [30], and PowerGraph [18] allow users to use a vertex-centric model for applications which can be executed on a cluster of worker nodes. In this setting, each worker node operates on a subset of the input graph and communicates with other workers by sending messages. The process of splitting the input graph into these subsets, also known as graph partitioning, is essential for optimizing performance of such systems [18, 43, 20, 3].

Created partitions have a significant impact on the communication between different workers and the resource usage of individual workers. In order to maximize the processing speed, the partitions should largely be independent to minimize communication. At the same time, computation executed on each partition should take approximately the same amount of processing time, as the overall performance depends on the slowest worker. These constraints give rise to the Balanced Graph Partitioning problem whose goal is to divide the vertices of a graph into a given number of (approximately) equal size components while minimizing the resulting edge cut. Balanced Graph Partitioning is a classic and thoroughly studied problem from both theoretical and practical points of view [9, 12]. In the context of distributed graph processing, the problem is typically studied in two variants.

In the vertex partitioning model, each worker machine is assigned an equal number of vertices with the goal of minimizing the number of cross-machine edges. Since messages are usually sent between adjacent vertices, storing tightly connected subgraphs on the same worker can reduce communication and hence running times of jobs. It has however been observed that this strategy does not lead to equally loaded partitions for real-world graphs with power law degree distribution [18]. Graph partitioning algorithms tend to colocate high-degree vertices and corresponding partitions take much longer to process, resulting in longer execution time overall.

The edge partitioning model has been suggested to alleviate the above imbalance problem [18, 29]. In this model the goal is to partition the graph so that the number of edges in every component is the same, while the number of incident edges across different components is minimized. Good partitions according to this model typically result in better balance across workers and reduced computation time in comparison to the trivial hash-based assignment of vertices to worker machines. However, edge-based graph partitioning can still result in performance regressions [3, 40].

To analyze the source of regressions, we performed a simple experiment of running a Page Rank algorithm implemented on top of Giraph utilizing various graph partitioning methods. Figure 1 illustrates the histograms of running times for individual workers processing a graph with $800M$ vertices and $80B$ edges. As discussed above, partitions according to the vertex partitioning model suffer from unequal distribution of edges across workers. A single overloaded partition can contain $1.92$ x more edges than an average one, which results in $1.5$ x longer execution time. We also observe a high correlation ( $\rho=0.79$ ) between the number of edges assigned to a partition and the corresponding processing time in this experiment. Partitioning according to the edge partitioning model yields a $1.08$ x running time improvement over the baseline, though there is still a noticeable imbalance between the fastest and the slowest worker machines. This can be explained by uneven distribution of vertices among workers. Machines with more vertices have higher operational overhead such as serialization of sent messages whose number is proportional to the number of vertices on a worker. Here we observe an $1.33$ x imbalance in the number of vertices and a moderate correlation ( $\rho=0.62$ ) between the running time and the vertex count on the workers.

In order to mitigate the issues described above we introduce a new strategy, vertex-edge partitioning, which is designed to balance the number of vertices and edges across workers simultaneously. As shown in Figure 1, this is done at a cost of a lower edge locality (percentage of edges with both endpoints on the same machine), and thus, higher communication volume. The resulting assignment results in a $1.17$ x speedup over the hash-based model. Motivated by the above experiment and a number of earlier studies [43, 20, 3, 40], we formalize a new model for graph partitioning which is suitable for real-world distributed graph processing systems.

We now formally describe the model in the most general setting which allows one to require balance according to $d$ different unrelated weight functions. Let $G(V,E)$ be a graph with $d$ vertex weight functions $w^{(1)},\dots,w^{(d)}:V\rightarrow\mathbb{R}^{+}$ , each assigning a positive weight to every vertex in the graph. Let $w^{(j)}(V)=\sum_{v\in V}w^{(j)}(v)$ be the sum of weights of all vertices in the graph according to the $j$ -th weight function. Given an integer $k$ and a parameter $\varepsilon>0$ , the goal is to find a partition of the vertex set $V$ into $k$ sets $V_{1},\dots,V_{k}$ such that for each weight function $w^{(j)}$ and each part of the partition $V_{i}$ the sum of weights in $V_{i}$ is approximately the same and close to the average, i.e. $\sum_{v\in V_{i}}w^{(j)}(v)=(1\pm\varepsilon)\frac{w^{(j)}(V)}{k}$ . We call such partitions $\varepsilon$ -balanced. Finally, among all such $\epsilon$ -balanced partitions the goal is to find one that maximizes the number of edges whose both endpoints are contained within some part of the partition and hence minimizes the size of the cut. This problem is referred to as Multi-Dimensional Balanced Graph Partitioning (MDBGP).

The simplest example of MDBGP is the classic balanced graph partitioning problem which is equivalent ot the vertex partitioning strategy described above and can be expressed using a single weight function $w^{(1)}(v)=1$ . Since $w^{(1)}(V)=|V|$ this requires that we maximize edge locality while ensuring that $|V_{i}|\approx\frac{|V|}{k}$ . Using two weight functions $w^{(1)}(v)=1$ and $w^{(2)}(v)=deg(v)$ corresponds to requiring balance on the number of vertices and edges in the parts of the partition and hence corresponds to the vertex-edge partitioning approach described above. Indeed, $w^{(2)}(V)=2|E|$ and hence in addition to balance on the number of vertices this requires that $\sum_{v\in V_{i}}deg(v)\approx\frac{2|E|}{k}$ . However, the model is not restricted to vertex- and edge-balance (as in the aforementioned vertex-edge partitioning) but can take arbitrary user-specified weights. In particular, when partitioning the vertices of the graph between the workers for load balancing, various weights modeling expected vertex activity can be used (historical data on individual vertex load, proxy values for the load such as PageRank, etc).

While a large body of work exists offering practical solutions for the one-dimensional version of the problem [23, 13, 42, 41, 7, 14, 33, 22, 12], as well as on theoretical foundations of graph partitioning [26, 4, 32], literature on principled and scalable approaches for the multi-dimensional case is quite sparse [24, 37, 36, 35]. In particular, if the weight functions are unrelated to each other, one can easily construct examples when no feasible solution exists that satisfies all balance constraints even for two weight functions. However, it is empirically observed that instances coming from applications often allow balanced solutions for several weight functions of interest simultaneously. For classical local search based algorithms such as [25] handling of multiple unrelated weight functions is challenging since imposing one balance constraint might violate another and hence finding a good local move becomes computationally hard. We overcome this difficulty by using a continuous relaxation of the problem, which allows more flexibility for achieving balance in the search space. In order to obtain an integral solution, in the end we apply randomized rounding which preserves balance with high probability.

1.1 Our Contributions

We present a scalable algorithmic framework for the problem of balanced partitioning of large graphs according to multiple user-specified weight functions while maximizing the number of edges inside the resulting components. Our framework consists of applying the projected gradient descent on a standard relaxation with a suitably chosen projection method. The relaxation is to maximize a non-convex quadratic function $f(\mathbf{x})=\frac{1}{2}\mathbf{x}^{T}A\mathbf{x}$ for $\mathbf{x}\in\mathbb{R}^{n}$ , where $A$ is the adjacency matrix, subject to a constraint $\mathbf{x}\in K$ for a certain convex body $K$ defined by the weight functions. Section 2 provides the exact description of the relaxation. Note that the gradient descent step only uses a matrix-vector multiplication since $\nabla f=A\mathbf{x}$ , and thus, the algorithm allows a straightforward distributed implementation.

While applying projected gradient descent to solve non-convex optimization problems subject to convex constraints is a well-studied approach in non-linear optimization (Section 2.3, [8]) and machine learning (Section 6.6, [21]), one has to overcome two technical challenges to make it applicable to the multi-dimensional graph partitioning problem: 1) projection step is computationally expensive, 2) existence of points with small gradient (saddle points) slows down convergence.

We show how to address the first challenge by designing efficient projection step algorithms tailored to the standard relaxation of MDBGP. While convergence to the projection point can be achieved using various alternating projections methods [15], for $d\leq 2$ we give one-shot exact solutions with almost linear running time.

Theorem 1.1

Running time of the projected gradient descent step is ${\mathcal{O}}(|E|+|V|\log^{d-1}|V|)$ for $d\leq 2$ and scales as ${\mathcal{O}}(|E|/m+|V|\log^{d-1}|V|)$ when distributed between $m$ machines.

In order to address the second challenge, we use small perturbations to get out of saddle points, where the perturbation vectors are sampled from a scaled $n$ -dimensional Gaussian distribution. We refer to the resulting algorithm as GD, see Algorithm 1. Convergence analysis of GD remains an open problem. While noisy gradient descent is known to have fast convergence to a local optimum for non-convex optimization subject to equality constraints, if inequality constraints are allowed convergence analysis is unknown [16].

Our experimental results show that GD scales to graphs with up to several billions of vertices and up to $10^{12}$ edges. We conducted an experimental evaluation of various graph partitioning strategies for optimizing several real-world Giraph workloads. The results demonstrate that multi-dimensional balancing is a suitable objective for achieving performance improvements, providing speedups in the order of $10\%-30\%$ over the state-of-the-art one-dimensional partitioning strategies. Compared to existing scalable graph partitioners, such as Social Hash Partitioner [22], Spinner [33], and Balanced Label Propagation [42, 34], the algorithm is conceptually simple and obtains close-to-perfect balanced partitions across multiple dimensions.

1.2 Previous Work

While one-dimensional balanced graph partitioning has been studied extensively and a number of tools exist [23, 13, 42, 41, 7, 14, 33, 22] (see also surveys by Bichot and Siarry [9] and by Buluç et al. [12]), to the best of our knowledge none of the practical algorithms for this problem have been previously based on running gradient descent on a continuous relaxation. Existing approaches are inherently discrete and are based on combinations of various discrete algorithms: greedy heuristics (METIS [23], Fennel [41]), branch-and-bound [13], label propagation and local search (balanced label propagation [42], Social Hash Partitioner [22], Spinner [33]), as well as hybrid approaches (linear embedding method combined with various optimizations [7]). Due to the combinatorial nature of these algorithms, their generalizations to the multi-dimensional case appear to be non-straightforward without substantial losses in performance, while our continuous relaxation handles multiple balance constraints uniformly. Compared to the one-dimensional version, existing literature on the multi-dimensional version is rather sparse [24, 37, 36, 35] and the main publicly available tool for the problem is currently METIS [24, 37].

Vast literature exists on optimization of non-convex functions and the interest in this topic lately has been particularly high. However, in the constrained case when the optimization has to be performed over a convex body, fairly little is known; see classic optimization literature [8, 44, 11]. Recent results on the non-convex optimization problem subject to convex constraints and its special cases include [16, 39, 5, 17, 21]. Closest to our work in terms of techniques is [27] who use projected gradient method to solve convex programs involving the max-norm and show how to solve large semidefinite programming relaxations of Max-Cut. Their results are quite different from ours as we consider a balanced version of graph partitioning and expect our algorithms to be scalable; the largest instances handled by [27] have $|V|=20K$ and $|E|=40K$ . Since we require that our algorithms scale to graphs with billions of edges, using existing general purpose software for constrained quadratic programming is also infeasible.

2 Projected Gradient Descent

For an integer $t$ we use notation $[t]$ to denote the set $\{1,\dots,t\}$ . The weighted $d$ -dimensional balanced graph partitioning problem is defined by a collection of $d$ weight functions $w^{(1)},\dots,w^{(d)}$ , where $w^{(j)}\colon V\to\mathbb{R}^{+}$ . For a set $S\subseteq V$ we use notation $w^{(j)}(S)\equiv\sum_{v\in S}w^{(j)}_{v}$ .

Definition 2.1 (MDBGP)

Given a graph $G(V,E)$ , an integer $k$ and a parameter $\varepsilon>0$ , the Multi-Dimensional $\varepsilon$ -Balanced Graph $k$ -Partitioning problem is to find a partition of the vertex set $V$ into $k$ sets $V_{1},\dots,V_{k}$ such that for each $j\in[d]$ , it holds that $w^{(j)}(V_{i})=(1\pm\varepsilon)\frac{w^{(j)}(V)}{k}$ for all $i\in[k]$ . Among all such partitions the goal is to find one that maximizes the number of edges whose both endpoints are contained within some part of the partition.

In this paper we focus on the $2$ -partitioning problem; for the general variant of $k$ -partitioning, we apply the algorithm recursively. For $k=2$ MDBGP is equivalent to the following integer quadratic program:

Maximize: $\displaystyle\frac{1}{2}\sum_{(i_{1},i_{2})\in E}(x_{i_{1}}x_{i_{2}}+1)$

Subject to: $\displaystyle\left|\sum_{i=1}^{n}w^{(j)}_{i}x_{i}\right|\leq\varepsilon\sum_{i=1}^{n}w^{(j)}_{i}$ $\displaystyle\forall j\in[d]$

$\displaystyle x_{i}\in\{-1,1\}$ $\displaystyle\forall i\in V$

The interpretation of $x_{i}$ variables is that if $x_{i}=1$ then $i\in V_{1}$ and if $x_{i}=-1$ then $i\in V_{2}$ . The objective is then the same as in MDBGP and counts the number of edges whose both endpoints are contained in some part of the partition. Indeed, an edge $(i_{1},i_{2})$ makes a contribution of $1$ to the objective when $x_{i_{1}}=x_{i_{2}}$ (and hence $x_{i_{1}}x_{i_{2}}=1$ ) and [math], otherwise (since $x_{i_{1}}x_{i_{2}}=-1$ ). The constraints are equivalent to $-\epsilon w^{(j)}(V)\leq w^{(j)}(V_{1})-w^{(j)}(V_{2})\leq\epsilon w^{(j)}(V)$ . Adding or subtracting $w^{(j)}(V)$ to both sides and dividing by $2$ we have $w^{(j)}(V_{i})=(1\pm\varepsilon)\frac{w^{(j)}(V)}{2}$ as required in MDBGP.

After dropping the additive term the objective can be expressed as $f(\mathbf{x})=\frac{1}{2}\mathbf{x}^{T}A\mathbf{x}$ and has gradient $\nabla f(\mathbf{x})=A\mathbf{x}$ and Hessian $\nabla^{2}(f)=A$ . Finally, we use a continuous relaxation of the above problem where we replace the integrality constraints with $x_{i}\in[-1,1]$ for all $i\in V$ . A solution to this continuous relaxation can be converted into an integral solution using randomized rounding. Using independent random variables $X_{i}$ for each vertex such that $\Pr[X_{i}=1]=\frac{1+x_{i}}{2}$ and $\Pr[X_{i}=-1]=\frac{1-x_{i}}{2}$ the expected value of the objective on the rounded solution $(X_{1},\dots,X_{|V|})$ is the same as on the initial fractional solution $(x_{1},\dots,x_{|V|})$ while all balance constraints are still approximately preserved with high probability by concentration bounds.

2.1 Overview

We propose the following algorithm for the multi-dimensional balanced graph partitioning problem based on the continuous relaxation described above. The algorithm is referred to as Gradient Descent (GD), see Algorithm 1. It computes a sequence of vectors $\set{\mathbf{x}^{(t)}}$ , where $x^{(t)}_{i}\in[-1;1]$ for all $i\in V$ and $t$ . Here $\mathbf{x}^{(0)}$ is initialized with zero vector, and $\mathbf{x}^{(t+1)}$ is computed by applying projected gradient descent iteration to $\mathbf{x}^{(t)}$ . Each iteration consists of three steps.

Step 1: Adding noise. We add Gaussian noise to $\mathbf{x}^{(t)}$ obtaining a noisy vector $\mathbf{z}^{(t)}$ . The noise is drawn from the $n$ -dimensional Gaussian distribution $N_{n}(0,\eta_{t})$ with zero mean and variance $\eta_{t}$ in each coordinate. The addition of noise to $\mathbf{x}^{(t)}$ allows to escape from saddle points, e.g. $\mathbf{x}^{(0)}=0$ .

Step 2: Gradient descent. We obtain $\mathbf{y}^{(t)}$ from the noisy vector $\mathbf{z}^{(t)}$ via a gradient descent step with step size $\gamma_{t}$ . Note that the gradient at $\mathbf{z}^{(t)}$ is given as $A\mathbf{z}^{(t)}$ hence this step can be expressed as $\mathbf{y}^{(t)}=(I+\gamma_{t}A)\mathbf{z}^{(t)}$ .

Step 3: Projection. The resulting vector $\mathbf{y}^{(t)}$ is then projected on the feasible space $\mathcal{B}_{\infty}\cap\bigcap_{j=1}^{d}\mathcal{S}^{j}_{\varepsilon}$ , where:

[TABLE]

that is, $\mathcal{B}_{\infty}$ satisfies that $\|\mathbf{x}\|_{\infty}\leq 1$ and $\mathcal{S}^{j}_{\epsilon}$ corresponds to the constraints imposed by the balance of weights according to the $j$ -th weight function.

The final solution is obtained by rounding last $\mathbf{x}^{(t)}$ : each vertex $i$ is assigned to part $V_{1}$ with probability $\frac{x^{(t)}_{i}+1}{2}$ . Note that this ensures that the expected number of edges whose endpoints belong to the same part after this rounding is given as $\frac{1}{2}\sum_{(i_{1},i_{2})\in E}(x^{(t)}_{i_{1}}x^{(t)}_{i_{2}}+1)$ .

The algorithm uses parameters $\eta_{t},\gamma_{t}$ , and $I$ , where $t$ is the iteration index. Here $\eta_{t}$ controls the magnitude of noise, $\gamma_{t}$ is the step size, and $I$ is the number of iterations. We discuss the selection of parameters in the experimental Section 4.

2.2 Projection

In the projection step of GD (Line 1) we need to find $\operatorname*{\arg\!\min}_{\mathbf{x}\in K}\|\mathbf{y}^{(t+1)}-\mathbf{x}\|_{2}$ , where $K=\mathcal{B}_{\infty}\cap\bigcap_{j=1}^{d}\mathcal{S}_{\varepsilon}^{j}$ . Denoting $\mathbf{y}^{(t+1)}$ as $\mathbf{y}$ we formulate this step as an optimization problem:

Minimize: $\displaystyle f(\mathbf{x})=\|\mathbf{x}-\mathbf{y}\|_{2}^{2}$

Subject to: $\displaystyle g_{i}=x_{i}^{2}-1\leq 0$ $\displaystyle\forall i\in[n]$

$\displaystyle h^{(j)}_{+}=\sum_{i=1}^{n}w^{(j)}_{i}x_{i}-\varepsilon\leq 0$ $\displaystyle\forall j\in[d]$

$\displaystyle h^{(j)}_{-}=-\sum_{i=1}^{n}w^{(j)}_{i}x_{i}-\varepsilon\leq 0$ $\displaystyle\forall j\in[d]$

The optimum solution to the optimization problem has to satisfy KKT conditions:

Stationarity:

$\displaystyle\mathbf{y}-\mathbf{x}=\sum_{i=1}^{n}\mu_{i}x_{i}\mathbf{e}_{i}+\sum_{j=1}^{d}(\mu^{(j)}_{+}-\mu^{(j)}_{-})\sum_{i=1}^{n}w^{(j)}_{i}\mathbf{e}_{i}$

Complementary slackness 1:

$\displaystyle\mu_{i}(x_{i}^{2}-1)=0,$ $\displaystyle\forall i\in[n]$

Complementary slackness 2:

$\displaystyle\mu^{(j)}_{+}\left(\sum_{i=1}^{n}w^{(j)}_{i}x_{i}-\varepsilon\right)=0,$ $\displaystyle\forall j\in[d]$

$\displaystyle\mu^{(j)}_{-}\left(\sum_{i=1}^{n}w^{(j)}_{i}x_{i}+\varepsilon\right)=0,$ $\displaystyle\forall j\in[d]$

Here $\mu_{i},\mu^{(j)}_{+},\mu^{(j)}_{i}\geq 0$ are the dual variables and $\mathbf{e}_{i}$ is the $i$ -th standard unit vector. It is a standard fact (see [11], Chapter 5.5.3) that for convex optimization subject to linear constraints Stationarity, Complementary slackness and Primal/Dual feasibility are necessary and sufficient conditions for the optimum solution. Thus we just focus on satisfying these conditions below.

Let $\gamma_{i}=\sum_{j=1}^{d}(\mu^{(j)}_{+}-\mu^{(j)}_{-})w^{(j)}_{i}$ . Then by Stationarity for each $i$ we have $y_{i}-x_{i}=\mu_{i}x_{i}+\gamma_{i}$ . Consider the following three cases:

Case 1. $(y_{i}>1+\gamma_{i})$ . If $\mu_{i}=0$ then by Stationarity $x_{i}=y_{i}-\gamma_{i}>1$ which violates primal feasibility conditions. Therefore $\mu_{i}>0$ and $x_{i}^{2}=1$ by Complementary slackness 1. Among the two roots $x_{i}=1$ and $x_{i}=-1$ the second root can be ruled out and hence $x_{i}=1$ . Indeed, if $x_{i}=-1$ then by Stationarity $y_{i}+1=-\mu_{i}+\gamma_{i}$ which contradicts $\mu_{i}>0$ and $y_{i}>1+\gamma_{i}$ .

Case 2. $(y_{i}<-1+\gamma_{i})$ . This case is symmetric to the previous one and thus $x_{i}=-1$ in this case.

Case 3. $y_{i}\in[-1+\gamma_{i},1+\gamma_{i}]$ . First we show that $\mu_{i}=0$ . Indeed, assume that $\mu_{i}>0$ . Then $x_{i}=\pm 1$ by Complementary slackness 1. Both cases lead to contradiction:

$(x_{i}=1)$ . By Stationarity $y_{i}-1=\mu_{i}+\gamma_{i}$ which contradicts with $y_{i}\leq 1+\gamma_{i}$ and $\mu_{i}>0$ . 2. 2.

$(x_{i}=-1)$ . Similarly to the above by Stationarity we have $y_{i}+1=-\mu_{i}+\gamma_{i}$ which is a contradiction with $y_{i}\geq-1+\gamma_{i}$ and $\mu_{i}>0$ .

Therefore in this case we have $\mu_{i}=0$ and hence by Stationarity $x_{i}=y_{i}-\gamma_{i}$ .

Let $\lambda_{j}=\mu^{(j)}_{+}-\mu^{(j)}_{-}$ and assume that these values are known to the algorithm. For $z\in\mathbb{R}$ we use notation $[z]=\min(1,\max(-1,z))$ for the truncated linear function. Using the analysis above the projection step is simply $x_{i}=[y_{i}-\sum_{j=1}^{d}\lambda_{j}w^{(j)}_{i}]$ . It remains to show how to find $\set{\lambda_{j}}$ .

Note that from Complementary slackness 2 it follows that either $\mu^{(j)}_{+}=0$ or $\mu^{(j)}_{-}=0$ since both of these values being positive leads to a contradiction. This leads to three cases: 1) $\mu^{(j)}_{+}=0,\mu^{(j)}_{-}>0$ , 2) $\mu^{(j)}_{-}=0,\mu^{(j)}_{+}>0$ and 3) $\mu^{(j)}_{+}=\mu^{(j)}_{-}=0$ which correspond to the three possibilities for $sign(\lambda_{j})$ . For each of the $d$ dimensions we can try all three choices. For a fixed guess of the signs let $S_{+}=\{j\colon\lambda_{j}>0\}$ , $S_{0}=\{j\colon\lambda_{j}=0\}$ and $S_{-}=\{j\colon\lambda_{j}<0\}$ . Assuming a correct guess of $sign(\lambda_{j})$ for each of the dimensions the optimization problem above reduces to the following:

Proposition 2.1

For the correct guess of $sign(\lambda_{j})$ for all $j\in[d]$ it suffices to find the optimum of the above optimization problem without the constraints for $j\in S_{0}$ . This optimum is unique.

The proof is given Appendix B. Using Proposition 2.1 and trying all guesses for $sign(\lambda_{j})$ we can reduce the projection step to $3^{d}$ instances of the following optimization problem:

Minimize:

$\displaystyle f(\mathbf{x})=\|\mathbf{y}-\mathbf{x}\|_{2}^{2}$

Subject to:

$\displaystyle g_{i}=x_{i}^{2}-1\leq 0$ $\displaystyle\forall i\in[n]$

$\displaystyle\sum_{i=1}^{n}w^{(j)}_{i}x_{i}=\varepsilon,$ $\displaystyle\forall j\in S_{+};$ $\displaystyle\sum_{i=1}^{n}w^{(j)}_{i}x_{i}=-\varepsilon,$ $\displaystyle\forall j\in S_{-}$

which can be done by finding numbers $\lambda_{j}>0$ for $j\in S_{+}$ and $\lambda_{j}<0$ for $j\in S_{-}$ and setting $x_{i}=[y_{i}-\sum_{j\in S_{+}\cup S_{-}}\lambda_{j}w_{ij}]$ . The choice of $\lambda_{j}$ ’s has to satisfy the constraints $\sum_{i=1}^{n}w_{ij}x_{i}=\varepsilon$ for all $j\in S_{+}$ and $\sum_{i=1}^{n}w_{ij}x_{i}=-\varepsilon$ for all $j\in S_{-}$ . In the analysis below we assume that $d=|S_{+}\cup S_{-}|$ corresponds to the “effective dimension” of the problem.

2.3 Exact Projection Algorithms

Projection for $d=1$

As a warm up, we first show how to perform exact projection for $d=1$ in ${\mathcal{O}}(n\log n)$ time, proving Theorem 1.1 for $d=1$ . This can be further improved to ${\mathcal{O}}(n)$ using a more careful approach [31]. However, to the best of our knowledge, no fast algorithm is known for $d>1$ which is the main focus of our work. Dropping the second index to simplify presentation (that is, $w_{i}=w^{(1)}_{i}$ ) and using the fact that $x_{i}=[y_{i}-\lambda w_{i}]$ we have:

[TABLE]

We introduce notation $h_{i}(\lambda)$ where each $h_{i}$ is the following piecewise linear function:

[TABLE]

Thus $\sum_{i=1}^{n}w_{i}x_{i}=\sum_{i=1}^{n}h_{i}(\lambda)$ and the problem reduces to finding $\lambda^{*}$ such that $\sum_{i}h_{i}(\lambda^{*})=\pm\varepsilon$ where the sign depends on whether our dimension is in $S_{+}$ or $S_{-}$ . Since $w_{i}\geq 0$ for all $i$ each $h_{i}$ is monotone in $\lambda$ and so the function $\sum_{i}h_{i}$ is a monotone piecewise linear function. The value of $\lambda^{*}$ can be found in ${\mathcal{O}}(\log n)$ iterations of binary search where each iteration requires ${\mathcal{O}}(n)$ time to evaluate the sum. This gives the overall running time of ${\mathcal{O}}(n\log n)$ . See Figure 2 for an illustration.

Projection for $d=2$

For $d=2$ we need to find $(\lambda_{1},\lambda_{2})$ such that $\sum_{i=1}^{n}h^{(j)}_{i}(\lambda_{1},\lambda_{2})=\pm\varepsilon$ for $j=1,2$ , where $h^{(j)}_{i}(\lambda_{1},\lambda_{2})$ is defined below.

[TABLE]

where $\sigma_{i}=\lambda_{1}w^{(1)}_{i}+\lambda_{2}w^{(2)}_{i}$ . The projection process is shown in Figure 3. In Appendix A.2, we prove Theorem 1.1 for $d=2$ showing that nested binary search can be used to solve this problem in ${\mathcal{O}}(n\log n)$ time.

3 Implementation

3.1 Projection algorithms

We considered the following three methods for the projection step (Algorithm 1, Line 1). Their theoretical properties are summarized in Table 1.

•

Alternating projections: A standard approach for projection on the intersection of convex sets is the alternating projections method (see [10]). It is easy to implement projections on $\mathcal{B}_{\infty}$ and $\cap_{j=1}^{d}S^{j}_{\varepsilon}$ separately. Since both are convex bodies by alternating projections on each of them one can guarantee convergence to a point in the intersection, but there is no guarantee that this point will be the actual projection. In practice, we are able to achieve slightly better balance by modifying this approach slightly and projecting on $S^{j}_{0}$ instead of $S^{j}_{\varepsilon}$ . This still ensures that we get a point in the intersection in the end.

•

Dykstra’s projection: [1] We also considered Dykstra’s projection algorithm [15]. This is a modification of the alternating projections method which is guaranteed to converge to the projection.

•

Exact projection for $d\leq 2$ : This is the algorithm presented in Section 2.2. In our experiments Dykstra’s algorithm and exact projection give similar results, since they find approximately the same projection point.

In Section 4.3 we study how quality of partitions produced by GD depends on choice of one of the projection methods above. Since the exact projection algorithm is computationally the most expensive, in our experiments we mostly use the alternating projections method. Moreover, since in practice each iteration of alternating projection is computationally expensive, in the intermediate iterations we project on each plane and the cube only once, while in the last iterations we run the alternating projections method until convergence. We refer to this choice as “one-shot” alternating projection below.

3.2 Adaptive Step Size

Recall that Algorithm 1 has the following parameters: Gaussian noise variances for each step $\set{\eta_{t}}$ and step size parameters $\set{\gamma_{t}}$ . Due to the spectral properties of the adjacency matrix in our experiments the algorithm doesn’t encounter any saddle points other than the initial point $\mathbf{x}=0$ . Therefore it suffices to add Gaussian noise only at first iteration (that is, $\eta_{t}=0$ for $t\neq 0$ ).

The simplest choice of the step size parameters $\set{\gamma_{t}}$ is constant, but it gives suboptimal results in our experiments. Carefully chosen step size parameters for different iterations not only gives better performance but can also be used to ensure that convergence can be reached in a fixed number of steps. In section 4.3 we discuss how to choose the step size to achieve good performance on a wide range of graphs.

The choice of step size parameters is complicated by the projection step. The change in the objective function and the progress towards an integral solution can both be related to the progress in Euclidean distance $\|\mathbf{x}_{t}-\mathbf{x}_{t+1}\|$ between the iterations. While consistent progress in Euclidean distance can be ensured by multiplying the gradient by an appropriate amount after the projection the actual progress can be much smaller.

Another important implementation detail is our handling of vertices which are close to integral. When the number of such vertices becomes large the progress of the algorithm can slow down. This is due to the fact that while the gradient vector is still large all of its large components correspond to already integral vertices and point to the outside of the feasible region. These large components can then dominate in the computation of the projection step which leads to slow convergence. In order to avoid this issue we “fix” such vertices so that they become integral and no longer participate in the gradient update and the projection step. As we show in Section 4.3 this results in noticeable improvements in the quality of the resulting partitions.

3.3 Partitioning Into k Buckets

For partitioning into more than two buckets two main approaches are typically considered. We use the second approach due to its higher efficiency.

Problem relaxation for $k$ buckets: For each vertex $i$ and bucket $j$ we can introduce a variable $p_{ij}$ corresponding to whether $i$ belongs to bucket $j$ and then adjust the relaxation accordingly. Our algorithm GD can then be modified to handle such relaxation. The main drawback of this approach is that it requires ${\mathcal{O}}(k\cdot|E|)$ communication per iteration, which makes it infeasible for partitioning large graphs into many buckets.

Recursive partitioning: The graph is partitioned recursively $\lceil\log_{2}k\rceil$ times into two parts. While there are cases when recursive partitioning can result in a suboptimal partition regardless of the underlying algorithm, this approach requires ${\mathcal{O}}(|E|)$ memory, ${\mathcal{O}}(|E|)$ operations per iteration and ${\mathcal{O}}(\log k)$ runs of GD, which makes it applicable to very large graphs. For simplicity we only show results for $k$ being powers of two but the algorithm can be modified to handle any $k$ by changing the coefficients in the balance constraints.

4 Experiments

We design our experiments to understand how well the new partitioning algorithm, GD, behaves on real-world datasets and how it affects the performance of distributed graph processing. As pointed out in Section 1, we are not aware of an alternative scalable approach for solving the multi-dimensional balanced partitioning. However, some of the existing techniques for one-dimensional partitioning can be adapted for the multi-dimensional case. Next we discuss several such techniques, which are evaluated together with the newly proposed algorithm.

Hash is the simplest partitioning strategy that assigns vertices to worker machines by hashing the vertex identifiers. Hashing is stateless, extremely fast in practice, and requires no preprocessing of the graph, which made it the default strategy in Giraph. The main disadvantage is that the majority of sent messages are non-local and may results in significant communication.

Spinner is a graph partitioning algorithm that can be applied to process large-scale graphs in a distributed environment [33]. The algorithm is based on the label propagation technique in which vertices exchange their labels trying to pick the most frequent label among its neighbors. This process guarantees a high number of adjacent vertices having the same label, which are then assigned to the same worker. Spinner does not enforce a strict balance across partitions but integrates score functions that penalize imbalanced solutions.

BLP is another approach based on the balanced label propagation based on combining the ideas of Ugander and Backstrom [42] and Meyerhenke et al. [34]. On the first step, the method creates a size-constrained clustering of the input graph using significantly more clusters than the number of available machines, $k$ . In our implementation, we construct $c\times k$ clusters for $c=1024$ and forbid a cluster to contain more than $\frac{|V|}{c\times k}$ vertices and $\frac{|E|}{c\times k}$ edges. On the second step, we randomly merge the clusters into $k$ partitions, which results in the multi-dimensional balance even if the original clusters have different sizes.

SHP is a distributed graph partitioner [38, 22] that is based on a classical local search heuristic [25]. Although SHP does not provide balancing on multiple dimensions, it supports a mode with several dimensions whose final balance is not guaranteed. The algorithm works by balancing on a new dimension, which is a combination of the specified dimensions. We configure SHP to find solutions having the same number of edges (with a higher coefficient in the combination) and the same number of vertices (with a lower coefficient) in every partition.

We implemented the algorithms and extensively experimented with the Giraph framework, which is used as the primary tool for large-scale graph analytics at Facebook [6, 2]. Although the evaluation is performed with the single distributed graph processing system, we believe that our main conclusions are valid for other frameworks relying on the vertex-centric programming model. For our experiments, we use four large social networks that are publicly available [28]. LiveJournal, Orkut, Twitter, and Friendster are undirected graphs containing $4.8$ , $3.1$ , $41$ , and $65$ million of vertices and $0.04$ , $0.12$ , $1.2$ , and $1.8$ billion of edges, respectively. In addition, we experiment with several large subgraphs of the Facebook friendship graph that serve to demonstrate scalability of our approach and its performance on real-world data. We denote the graphs by FB-X, where X indicates the (approximate) number of billions of edges; this data is anonymized before processing.

Next we analyze the quality of the solutions produced by the algorithms on our dataset (Section 4.1) and evaluate various graph partitioning strategies for speeding up distributed graph processing for real-world workloads (Section 4.2). Section 4.3 investigates various parameters of GD.

4.1 Multi-Dimensional Partitioning

Our initial experiments (see Figure 1) and earlier works [18, 29, 33] indicate that two important dimensions for the performance of Giraph jobs are the number of vertices and the number of edges. For this reason, we specify two weights for the vertices, $w^{(1)}_{v}=1$ and $w^{(2)}_{v}=\deg(v)$ for all $v\in V$ . Recall that our primary goal is to guarantee almost perfect balance for the two dimensions, as even a single overloaded partition affects the job performance. Figure 4 illustrates the resulting vertex and edge imbalance of the solutions on the public networks for three algorithms, Spinner, BLP, and SHP, using $k=2$ and $k=8$ partitions. The imbalance is defined as $\left(\frac{\max_{i}w(V_{i})}{\operatorname*{avg}_{i}w(V_{i})}-1\right)$ , where the maximum and the average are taken over the total weight of all $k$ constructed partitions. We do not include the results for Hash and GD, as the corresponding values are below $0.01$ for the instances.

We observe that two algorithms, Spinner and SHP, are not suitable for the multi-dimensional variant of the problem. For dense graphs with a highly skewed degree distribution (as in Twitter), the algorithms cannot simultaneously provide balance on the two dimensions. With the default setting, these two algorithms generate solutions in which some of the partitions contain $1.5-2$ x more vertices than the average one. We tried to modify the techniques by adjusting relative weights of their penalty functions for vertex and degree counts in resulting partitions. However, we were not able to design universal penalty weights that work for all instances. A similar behavior regarding the resulting balance is observed for our internal graphs, FB-3B, FB-80B, and FB-400B. In contrast, Hash, GD, and BLP produced nearly-balanced (that is, having $\varepsilon\leq 0.05$ both for vertex and edge counts) solutions for all the instances. With this in mind, we exclude Spinner and SHP from further experiments.

Next we compare the quality of our algorithm as measured by the resulting edge locality, that is, the percentage of uncut edges with both endpoints in the same partition. The metric represents the fraction of local messages in Giraph jobs and corresponds to a possible reduction in communication between the worker machines. Figure 5 reports the results of Hash, GD, and BLP on the public dataset. Unsurprisingly, GD and BLP outperform the Hash algorithm in the experiment, as the latter keeps only $\frac{1}{k}$ of all the edges in the same partition. The resulting edge locality of GD and BLP are close for the three graphs, though GD typically achieves a higher locality by $2\%-5\%$ .

Figure 6 shows the experiments on the Facebook friendship graphs. Here we use a larger number of partitions, $k$ , which more accurately represent the real-world Giraph use case. Again, Hash produces solutions having the lowest edge localities. In fact, over $99\%$ of the edges are cut using the partitioning strategy for an instance with a hundred partitions. This is in agreement with our measurements of the typical percentage of cross-worker Giraph messages in the production environment.

On the other hand, we observe a bigger advantage of GD over BLP; the locality difference is around $10\%-20\%$ for $k=16$ and $5\%-10\%$ for $k=128$ . The balanced label propagation algorithm, BLP, could be configured to produce better results by decreasing its cluster size threshold, $c$ . However, this results in an imbalanced solution with $\varepsilon>0.05$ for the largest instance with $k=128$ . Hence, we keep the value of $c=1024$ for all the experiments.

The main difference between FB graphs and publicly available graphs is the number of edges. The main reason why on FB graphs GD performs better compared to other algorithms is poor performance of existing local-search based methods on large graphs in the multi-dimensional case. This is most obvious in Figure 6 for $k=128$ as one can see that GD is gaining a larger advantage over BLP as the size of the graph grows (3B $\rightarrow$ 80B $\rightarrow$ 400B).

Overall we conclude that GD generates solutions of higher quality than BLP and Hash on all examined instances. Therefore, we utilize the algorithm to experiment with distributed graph processing in the next section. We present results for $3$ - and $4$ -dimensional experiments in Appendix C.

4.2 Distributed Graph Processing

In this section we conduct an experimental evaluation of various graph partitioning strategies for speeding up distributed graph processing. Here we argue and experimentally demonstrate that multi-dimensional balancing is a suitable objective for the application. We experiment with four graph algorithms implemented in Giraph. Page Rank and Connected Components, are popular benchmarks for verifying the performance of distributed systems. Page Rank iteratively propagates vertex ranks through adjacent edges; our implementation performs $30$ iterations for the algorithm. For the Connected Components algorithm, we use a simple label propagation technique in which vertices iteratively update their labels based on the minimum label of their neighbors; for our graphs, the process converges after at most $50$ rounds. The other two algorithms, Hypergraph Clustering and Mutual Friends, are production applications for large-scale graph analytics at Facebook. The former is used to find a certain clustering of the input graph by converting it to a hypergraph. The latter builds a set of features for friend recommendation on Facebook. Both applications extensively exchange messages between adjacent vertices, which adds a significant communication overhead.

Figure 7 depicts the results of our experiment. Since we are interested in the impact of various partitioning policies on the performance of Giraph, we report the relative differences to the baseline policy, Hash. Here we measure the total runtime of an application using GD as the partitioning strategy in three modes, vertex partitioning (one-dimensional balance on vertex count), edge partitioning (balance on edge count), and vertex-edge partitioning (two-dimensional balance both on vertex and edge counts). Every algorithm is applied in two configurations, small and large. The first one uses the FB-80B graph and a cluster with $16$ worker machines, while the second one process FB-400B using $128$ workers.

The key finding is that one-dimensional partitioning cannot provide consistent benefits across all the Giraph applications. In fact, we observe performance regression for some instances, in particular, when the number of utilized worker machines is large, that is, $k=128$ . In this scenario, we notice a few workers whose running time is significantly larger than the average; see Figure 1. Since in Giraph (and other vertex-centric systems) the computation is split into a number of supersteps that end with a global synchronization barrier, the performance is determined by the slowest worker. Notice that a similar phenomena regarding the vertex partitioning has been observed in earlier works [18, 3, 40, 20]. In contrast, the two-dimensional partitioning always results in a speedup over the default Hash strategy. The improvement is in the order of $10\%-30\%$ for the examined applications.

To get a deeper understanding of the source of performance differences, we analyze the detailed logs for the Page Rank application using a cluster with $128$ worker machines. Table 2 shows the measurements of the mean, maximum, and standard deviation of the time to compute a superstep by all the workers. The results indicate that the with hash partitioning the workers are idling on average for $7$ seconds per superstep waiting for the slowest one to complete the work. With one-dimensional partitioning the idling time is much longer, $50$ seconds for vertex-based partitioning and $38$ seconds for edge-based one, which is the primary reason for the performance regression. The two-dimensional partitioning results in a more even load across the workers delivering a $13.2\%$ speedup. Table 2 also indicates a significant communication reduction over the baseline partitioning, as measured by the total size of messages sent between the workers via network. For the Page Rank application, the average reduction is correlated with the edge locality of the corresponding partitioning. However, an unbalanced partitioning causes some workers to use more memory resources and become a bottleneck for graph processing.

Finally, we emphasize that the timings analyzed in the section exclude the running times of the partitioner itself. This is realistic for our use case in which the same friendship graph is expected to be processed multiple times for various analytics tasks. Thus, the extra overhead incurred by a partitioning strategy is amortized among several runs.

4.3 Parameters of GD

In this section we perform an experimental comparison of various choices of the projection step algorithm in GD and study its convergence properties. Unless specified otherwise, we use two-dimensional GD in the following setting:

balance is required with respect to the number of vertices and their degrees,
in the projection step we use “one-shot” alternating projection (see Section 3.1),
we use adaptive step size and vertex fixing as described in Section 3.2.

Since behavior of gradient descent algorithms can depend on selection of the step size parameters, we used experiments to establish convergence of GD with different choices of these parameters. In particular, our implementation aims to ensure that the step length $\|\mathbf{x}^{(t)}-\mathbf{x}^{(t+1)}\|_{2}$ remains close to constant between iterations. A natural scaling parameter for the step length is $\sqrt{n}$ as it corresponds to the distance between the initial solution $\mathbf{x}_{0}=0$ and any integral solution of the form $\{-1,1\}^{n}$ . As we show in Figure 8 for various graphs a good choice of step size turns out to be $2\frac{\sqrt{n}}{100}$ , where $100$ is the limit we set on the number of iterations due to the constraints on the runtime during the execution.

In Figure 9 we show how adaptive step size and vertex fixing described in Section 3 affect the performance of the algorithm. Note that compared with other methods vertex fixing not only improves quality but also preserves almost perfect balance even when simple “one-shot” alternating projection is used. Finally, in Figure 10 we show analysis of performance of the algorithm under different choices of the projection method. The results show that the exact projection algorithm with sufficiently large allowed imbalance leads to the best performance. Larger imbalance permits more partitions, possibly including ones with better locality, allowing the overall algorithm to find partitions with better locality. However, the alternating projections algorithm can often be used to achieve similar performance. This is most likely due to the fact that the alternating projections algorithm despite not computing the projection outputs a point close enough to it.

4.4 Performance Analysis

Finally, we analyze scalability of our algorithm. Our results are obtained on a Hadoop cluster of $128$ workers; each of the machines is a dual-node 2.4 GHz Intel Xeon E5-2680 with 256GB RAM. Figure 11 reports the running time of GD in machine-hours on FB-X graphs of various size with balance on two dimensions. We observe a near-linear growth of the running time with the size of the input graph. In comparison, the running time of the SHP algorithm exceed the values by a factor of $1.5-2$ on the same cluster configuration. Despite the fact that our implementation is not specifically optimized for performance, GD processes huge graphs within a few hours in the distributed setting.

5 Conclusion

We introduced a new Multi-Dimensional Balanced Graph Partitioning algorithm which produces balanced partitions according to multiple user-specified weight functions while maintaining high edge locality. Our results show that this algorithm is scalable and for large graphs with small allowed vertex and edge imbalance outperforms existing solutions. Resulting partitions allow one to achieve substantial speedups in computational time for various computational tasks. This is in contrast with balancing on just one dimension (for example, vertex or edge count, separately), which can sometimes result in worse performance. We state several open problems below.

One of the most interesting directions for future work is incorporating a wider range of balancing requirements, for example, those that can depend on the resulting partitioning itself such as the number of local edges and the maximum number of edges going between any pair of parts in the resulting partition. For example, the latter quantity can substantially affect performance of distributed computation tasks in Giraph-like systems as communication between different machines depends on the number of edges between them. Note that our proposed algorithm can’t directly handle such solution-dependent weight functions as they can’t be specified through an a priori fixed collection of weight functions.

A scalable algorithm for solving multi-dimensional balanced partitioning into $k$ parts without using recursive partitioning. As discussed in Section 3.3, applying similar algorithm to straightforward problem relaxation results into ${\mathcal{O}}(k\cdot|E|)$ communication, which comes from inherently continuous nature of the algorithm compared to discrete ones. In discrete algorithms a vertex can occupy only one bucket, but in our algorithm it can occupy all buckets with some probabilities. Since all these probabilities may change, $\Theta(k)$ information can be sent to neighbors.

An interesting theoretical question is finding a fast algorithm for exact projection for $d>2$ . As we will show in Appendix A.1, it is possible to use nested binary search to find $\set{\lambda_{j}}$ (and therefore the projection) with arbitrary precision. Unfortunately, the running time of the suggested algorithm is unknown, because it is unclear how to estimate left and right bounds for binary search. Determining these bounds gives an algorithm with running time ${\mathcal{O}}(n\cdot\prod_{i=1}^{d}\log\frac{r_{j}-l_{j}}{\delta})$ , where $l_{j}$ and $r_{j}$ are bounds for $\lambda_{j}$ and $\delta$ is the required precision.

Another interesting theoretical question is understanding the convergence properties of our algorithm (or a similar gradient descent based method) under some assumption about the spectral properties of the graph. We see this as a challenging open problem – while noisy gradient descent is known to have fast convergence for non-convex optimization subject to equality constraints, if inequality constraints are allowed convergence analysis is unknown [16].

Appendix A Multidimensional projection

In this section we consider projection problem in multidimensional case. In section 2.2 we reduced projection to the following optimization problem.

Minimize: $\displaystyle f(\mathbf{x})=\|\mathbf{y}-\mathbf{x}\|_{2}^{2}$

Subject to: $\displaystyle g_{i}=x_{i}^{2}-1\leq 0$ $\displaystyle\forall i\in[n]$

$\displaystyle\sum_{i=1}^{n}w^{(j)}_{i}x_{i}=\epsilon$ $\displaystyle\forall j\in S_{+};$

$\displaystyle\sum_{i=1}^{n}w^{(j)}_{i}x_{i}=-\epsilon$ $\displaystyle\forall j\in S_{-}$

Then KKT conditions for this problem were further reduced to the following problem. Given $\mathbf{y}\in\mathbb{R}^{n}$ we need to find its projection $\mathbf{x}$ whose coordinates are given as $x_{i}(\lambda_{1},\ldots,\lambda_{d})=[y_{i}-\sum_{j}w^{(j)}_{i}\lambda_{j}]$ by selecting the values $(\lambda_{1},\ldots,\lambda_{d})$ in order to satisfy the balance constraints, i.e. $\sum_{i=1}^{n}w^{(j)}_{i}x_{i}=\epsilon$ for $j\in S_{+}$ and $\sum_{i=1}^{n}w^{(j)}_{i}x_{i}=-\epsilon$ for $j\in S_{-}$ . We consider more general constraints: $\sum_{i=1}^{n}w^{(j)}_{i}x_{i}=c_{j}$ for $j\in[d]$ , where $\set{c_{j}}$ are some constants. Let $\bm{\mathbf{\lambda}}=(\lambda_{1},\ldots,\lambda_{d})$ . Since $\mathbf{x}$ can be computed based on $\bm{\mathbf{\lambda}}$ , it remains to show how to find $\bm{\mathbf{\lambda}}$ satisfying these constraints.

The contents of this section are the following:

•

We show that it’s possible to find $\bm{\mathbf{\lambda}}$ (and therefore $\mathbf{x}$ ) with arbitrary precision using nested binary search. inline,backgroundcolor=green!10!white]DA: But we don’t know boundsinline]GY: Is this even in the paper?

•

We describe an ${\mathcal{O}}(n\log n)$ -time algorithm finding the exact values of $\bm{\mathbf{\lambda}}$ in $2$ -dimensional case.

A.1 Nested binary search

Recall that $h^{(j)}(\bm{\mathbf{\lambda}})=\sum_{i}w_{ij}x_{i}$ . As shown in Section 2.2, $h^{(j)}(\bm{\mathbf{\lambda}})=\sum\limits_{i=1}^{n}h^{(j)}_{i}(\bm{\mathbf{\lambda}})$ , where:

[TABLE]

We want to find $\bm{\mathbf{\lambda}}^{*}$ such that $h^{(j)}(\bm{\mathbf{\lambda}}^{*})=c_{j}$ for all $j\in[d]$ .

Lemma A.1 (Uniqueness)

There is at most one point $\mathbf{x}$ for which there exists $\bm{\mathbf{\lambda}}^{*}$ such that $(\mathbf{x},\bm{\mathbf{\lambda}}^{*})$ satisfy KKT conditions.

Proof A.2.

Our optimization problem is convex, since $L_{2}$ -norm is a convex function, cube and planes are convex sets and their intersection is also a convex set. As follows from [11], any pair $(\mathbf{x},\bm{\mathbf{\lambda}}^{*})$ satisfying KKT conditions is a solution (i.e. $\mathbf{x}$ is the projection). By strict convexity of $L_{2}$ -norm the projection is unique, and therefore there is at most one $\mathbf{x}$ satisfying KKT conditions.

Note that there can be several $\bm{\mathbf{\lambda}}^{*}$ corresponding to the same $\mathbf{x}$ . In the rest of the section we show that it is possible to find $\bm{\mathbf{\lambda}}^{*}$ using nested binary search. For that purpose we define auxiliary functions $\Delta_{1},\ldots,\Delta_{d}$ in the following way.

For any value of $\lambda_{1}$ we would like to find $\lambda_{2},\ldots,\lambda_{d}$ such that constraints $h^{(2)}(\bm{\mathbf{\lambda}})=c_{2},\ldots,h^{(d)}(\bm{\mathbf{\lambda}})=c_{d}$ are satisfied. We define $\Delta_{1}(\lambda_{1})$ as $h^{(1)}(\bm{\mathbf{\lambda}})$ . We will show that $\Delta_{1}$ is well-defined (when the feasible space is not empty) and monotone. Therefore, we can use binary search to find $\lambda_{1}$ for which $h^{(1)}(\bm{\mathbf{\lambda}})=c_{1}$ is satisfied.

Consider the nested problem. Assume that $\lambda_{1}$ is fixed. Then for any value of $\lambda_{2}$ we would like to find $\lambda_{3},\ldots,\lambda_{d}$ such that constraints $h^{(3)}(\bm{\mathbf{\lambda}})=c_{3},\ldots,h^{(d)}(\bm{\mathbf{\lambda}})=c_{d}$ are satisfied. Similar to $\Delta_{1}$ we define $\Delta_{2}(\lambda_{1},\lambda_{2})$ as $h^{(2)}(\bm{\mathbf{\lambda}})$ and we will show that $\Delta_{2}$ is well-defined and monotone on $\lambda_{2}$ . Therefore, again, we can use binary search to find $\lambda_{2}$ . We define $\Delta_{t}(\lambda_{1},\ldots,\lambda_{t})$ for all $t$ and show that $\Delta_{t}$ is monotone on $\lambda_{t}$ .

Definition A.3.

Consider $t\in[d]$ . Let $\bm{\mathbf{\lambda}}=(\lambda_{1},\dots,\lambda_{d})$ and assume that constraints $h^{(j)}(\bm{\mathbf{\lambda}})=c_{j}$ are satisfied for all $j>t$ . Then we define $\Delta_{t}(\lambda_{1},\ldots,\lambda_{t})\triangleq h^{(t)}(\bm{\mathbf{\lambda}})$ and call $\lambda_{t+1},\ldots,\lambda_{d}$ suitable for $\lambda_{1},\ldots,\lambda_{t}$ .

Note that $\Delta_{t}$ is a function of the first $t$ coordinates.

Lemma A.4 ( $\Delta$ is well-defined).

For fixed $\lambda_{1},\ldots,\lambda_{t}$ different suitable $\lambda_{t+1},\ldots,\lambda_{d}$ produce the same $\mathbf{x}(\bm{\mathbf{\lambda}})$ . Therefore, $\Delta_{t}(\lambda_{1},\ldots,\lambda_{t})$ is the same for different suitable $\lambda_{t+1},\ldots,\lambda_{d}$ . If the feasible space is not empty, then for fixed $\lambda_{1},\ldots,\lambda_{t}$ there exist suitable $\lambda_{t+1},\ldots,\lambda_{d}$ .

Proof A.5.

Fix $\lambda_{1},\ldots,\lambda_{t}$ . Denote $y_{i}^{\prime}=y_{i}-\sum\limits_{j\leq t}\lambda_{j}w_{j}$ . Then we obtain the following problem: find $\lambda_{t+1},\ldots,\lambda_{d}$ , such that $\mathbf{x}=[\mathbf{y}^{\prime}-\sum_{j>t}\lambda_{j}w^{(j)}]$ and $\sum_{i=1}^{n}w^{(j)}_{i}x_{i}=c_{j}$ for all $j>t$ . Therefore, we reduced the problem to $(d-t)$ -dimensional problem of the same form, and by Uniqueness Lemma there exists exactly one $\mathbf{x}$ , satisfying all constraints.

Lemma A.6 (Solution convexity).

The set of $\bm{\mathbf{\lambda}}$ such that $(\mathbf{x},\bm{\mathbf{\lambda}})$ is KKT solution is convex.

Proof A.7.

By Uniqueness Lemma there is at most one $\mathbf{x}$ satisfying KKT. Consider two KKT solutions $(\mathbf{x},\bm{\mathbf{\lambda}})$ and $(\mathbf{x},\bm{\mathbf{\lambda}}^{\prime})$ . Therefore

[TABLE]

We will show that $(\mathbf{x},\ \alpha\bm{\mathbf{\lambda}}+(1-\alpha)\bm{\mathbf{\lambda}}^{\prime})$ is also a solution for any $\alpha\in[0;1]$ . For each $x_{i}$ consider $3$ cases depending on rounding of $x_{i}$ :

$x_{i}=1$ . Then $\sum_{j}w^{(j)}_{i}\lambda_{j}\leq y_{i}-1$ and $\sum_{j}w^{(j)}_{i}\lambda_{j}^{\prime}\leq y_{i}-1$ . By multiplying the first inequality by $\alpha$ and the second one by $(1-\alpha)$ and then summing them up we obtain

[TABLE] 2. 2.

$x_{i}=-1$ . Similar to the first case. 3. 3.

$x_{i}\in(-1;1)$ . $\sum_{j}w^{(j)}_{i}\lambda_{j}=y_{i}-x_{i}$ and $\sum_{j}w^{(j)}_{i}\lambda_{j}^{\prime}=y_{i}-x_{i}$ . Therefore,

[TABLE]

**

Lemma A.8.

$\Delta_{t}$ * is continuous*

Proof A.9.

Follows from the fact that projection is continuous function of the original point. For small enough $\varepsilon_{j}$ the projection of $\mathbf{y}-\sum_{j\leq t}\lambda_{j}w^{(j)}$ is close to projection of $\mathbf{y}-\sum_{j\leq t}(\lambda_{j}+\varepsilon_{j})w^{(j)}$ , and so are their values of $h^{(j)}$ , $j>t$ .

Theorem A.10 ( $\Delta_{t}$ monotonicity).

Consider two points $(\lambda_{1},\ldots,\lambda_{t-1},\lambda_{t}^{\prime})$ and $(\lambda_{1},\ldots,\lambda_{t-1},\lambda_{t}^{\prime\prime})$ such that

[TABLE]

Then for any $\alpha\in[0;1]$

[TABLE]

Since $\Delta_{t}$ is continuous, $\Delta_{t}$ is monotone on $\lambda_{t}$ .

Proof A.11.

Since

[TABLE]

there exist $\lambda_{t+1}^{\prime},\ldots,\lambda_{n}^{\prime}$ and $\lambda_{t+1}^{\prime\prime},\ldots,\lambda_{n}^{\prime\prime}$ such that

[TABLE]

where $\bm{\mathbf{\lambda}}^{\prime}=(\lambda_{1},\ldots,\lambda_{t-1},\lambda_{t}^{\prime},\ldots,\lambda_{d}^{\prime})$ and $\bm{\mathbf{\lambda}}^{\prime\prime}=(\lambda_{1},\ldots,\lambda_{t-1},\lambda_{t}^{\prime\prime},\ldots,\lambda_{d}^{\prime\prime})$ .

Denote $y_{i}^{\prime}=y_{i}-\sum\limits_{j<t}w^{(j)}_{i}\lambda_{j}$ . Consider the following problem: find $\lambda_{t+1},\ldots,\lambda_{d}$ , such that

[TABLE]

We obtained $(d-t+1)$ -dimensional problem. Both points are solutions to this problem, and by Convexity lemma the set of its solution is convex.

As follows from Theorem A.10, if the projection exists then it’s possible to find $\bm{\mathbf{\lambda}}^{*}$ with arbitrary precision using nested binary search on each coordinate. Unfortunately, it’s unclear how to estimate binary search bounds. While it’s possible to find them by expanding the bounds until they contain the solution, the resulting running time becomes unknown.

A.2 Projection for D = 2

In this section we introduce a randomized $O(n\log n)$ -time algorithm for finding projection for $d=2$ . Recall from Section 2.2 that for $\mathbf{y}\in\mathbb{R}^{n}$ we need to find $\bm{\mathbf{\lambda}}^{*}=(\lambda_{1}^{*},\lambda_{2}^{*})$ such that $h^{(1)}(\bm{\mathbf{\lambda}}^{*})=c_{1}$ and $h^{(2)}(\bm{\mathbf{\lambda}}^{*})=c_{2}$ . For $\bm{\mathbf{\lambda}}=(\lambda_{1},\lambda_{2})$ we define $h^{(j)}(\bm{\mathbf{\lambda}})=\sum\limits_{i=1}^{n}h^{(j)}_{i}(\bm{\mathbf{\lambda}})$ for $j\in\{1,2\}$ , where

[TABLE]

Once we find $(\lambda_{1}^{*},\lambda_{2}^{*})$ we can compute the coordinates of $\mathbf{x}$ as $x_{i}=[y_{i}-w^{(1)}_{i}\lambda_{1}^{*}-w^{(2)}_{i}\lambda_{2}^{*}]$ . We introduce an auxiliary function $\Delta$ (corresponding to $\Delta_{1}$ from the previous section)which we use to solve the above problem using binary search:

Definition A.12.

Suppose that $\lambda_{1}$ is such that there exists $\lambda_{2}$ for which the constraint $h^{(2)}(\lambda_{1},\lambda_{2})=c_{2}$ is satisfied. Then we define $\Delta(\lambda_{1})\triangleq h^{(1)}(\lambda_{1},\lambda_{2})$ .

We now describe an ${\mathcal{O}}(n\log n)$ -time algorithm for finding $(\lambda_{1}^{*},\lambda_{2}^{*})$ . The algorithm is shown as Algorithm 2. It takes as a parameter a Boolean variable $\Delta^{+}$ indicating whether $\Delta$ is an increasing or decreasing function. We run the algorithm under both assumptions and select a solution satisfying the constraints.

We outline the main ideas behind Algorithm 2 below. Consider the $(\lambda_{1},\lambda_{2})$ plane partitioned by the following lines (which we call boundary lines):

[TABLE]

for all $i$ . Let $L$ be the set of boundary lines (line 2). We refer to the subsets of the plane resulting from its partition by the boundary lines as regions (see Figure 13 where the regions are referred to as $\set{\mathtt{T_{i}}}$ ). Boundary lines separate the plane into half-planes corresponding to the different cases in the definitions of the corresponding $h^{(j)}_{i}$ . Therefore, inside each region all $h^{(j)}_{i}$ are linear and hence $h^{(j)}$ are also linear.

The intuition behind the algorithm is then as follows (in order to achieve the best performance the exact details differ slightly from this simplified presentation). Suppose we could find a region that contains some solution $\bm{\mathbf{\lambda}}^{*}$ . Then since constraint functions are linear inside the region, in order to find $\bm{\mathbf{\lambda}}^{*}$ we could solve a system of linear equations over $\lambda_{1}$ and $\lambda_{2}$ . We identify such region, with binary search over $\lambda_{1}$ by using monotnicity of $\Delta$ . We consider only a finite set of values: $\lambda_{1}$ -coordinates of intersections of boundary lines. Since there are ${\mathcal{O}}(n)$ boundaries, there are ${\mathcal{O}}(n^{2})$ intersections(e.g., in Figure 13 we consider only points $a$ , $b$ , $c$ and $d$ ). Hence ${\mathcal{O}}(\log n)$ iterations of binary search suffice. The only difference between Algorithm 2 and the above approach is that after the binary search on $\lambda_{1}$ we still have to try ${\mathcal{O}}(n)$ regions to identify the exact region which contains $\bm{\mathbf{\lambda}}^{*}$ (see Algorithm 2 for the details).

Now consider one iteration of the binary search. Let $\lambda_{1}^{l}$ and $\lambda_{1}^{r}$ be its current boundaries. Let $\Lambda$ be a set of all intersection points $(\lambda_{1},\lambda_{2})$ such that $\lambda_{1}\in(\lambda_{1}^{l},\lambda_{1}^{r})$ . Since $\Delta$ is monotone, for any $\lambda_{1}^{\prime}$ we can use binary search by checking whether $\bm{\mathbf{\lambda}}^{*}$ is greater or less than $\lambda_{1}^{\prime}$ through a comparison of $\Delta(\lambda_{1}^{\prime})$ and $c_{1}$ (lines 2-2). Computing $\Delta(\lambda_{1}^{\prime})$ requires solving the one-dimensional problem over $\lambda_{2}$ discussed in Section 2.3 and thus can be done in ${\mathcal{O}}(n)$ time.

In order to have binary search run in ${\mathcal{O}}(\log n)$ iterations it suffices to be able to find a value $\lambda_{1}^{\prime}\in(\lambda_{1}^{l},\lambda_{1}^{r})$ which with constant probability splits $\Lambda$ into two subsets of points, those with $\lambda_{1}>\lambda_{1}^{\prime}$ and with $\lambda_{1}<\lambda_{1}^{\prime}$ respectively, of size at most $\frac{2}{3}n$ each. In particular, it suffices to sample a uniformly random point $(\lambda_{1}^{\prime},\lambda_{2}^{\prime})$ from $\Lambda$ . The following lemma bounds the overall running time of these sampling steps.

Lemma A.13.

The overall time required for sampling random points from $\Lambda$ in line 2 of Algorithm 2 is ${\mathcal{O}}(n\log n)$ .

Proof A.14.

Consider three cases:

$|\Lambda|>n\log n$ . In this case we sample ${\mathcal{O}}(n)$ uniformly random pairs of lines from $L$ and find an intersection of each pair (assume no parallel lines which can be handled separately). Since the number of lines is ${\mathcal{O}}(n)$ w.h.p. we sample at least one intersection which lies in $\Lambda$ . The last condition can be checked in ${\mathcal{O}}(n)$ time and if it doesn’t hold then we conclude that w.h.p. $|\Lambda|\leq n\log n$ . We then compute $S$ , the set of all points in $\Lambda$ in ${\mathcal{O}}(n\log n)$ time as described below and proceed to the second case.

To find $\Lambda$ we first find intersections of all lines from $L$ with lines $\lambda_{1}=\lambda_{1}^{l}$ and $\lambda_{1}=\lambda_{1}^{r}$ . We call $\lambda_{2}$ -coordinates of the intersection points events. Each line $\ell\in L$ creates two event: $\ell_{open}$ corresponds to smaller $\lambda_{2}$ and $\ell_{close}$ – to the larger one.

Consider two lines $a$ and $b$ such that $a_{open}\geq b_{open}$ . These lines intersect in one of two cases. If they are opened on different sides (i.e. one on $\lambda_{1}^{l}$ and another one – on $\lambda_{1}^{r}$ ), then $b_{open}$ should be greater than $a_{close}$ , as shown in Figure 12(a). If they are opened on the same side, then it should be $b_{close}\geq a_{close}$ , i.e. $[a_{open},a_{close}]\subseteq[b_{open},b_{close}]$ , as shown in Figure 12(b).

We process all events in increasing order and for each side we maintain the set of lines opened on this side. We sort lines in these sets by their closing events. When event $\ell_{open}$ arrives, we find intersections of $\ell$ with opened lines in the following way. To handle the first case, we intersect $\ell$ with all lines opened on the other side. To handle the second case, we intersect $\ell$ with all lines opened on the same side and closing after $\ell_{close}$ . 2. 2.

$n\leq|\Lambda|\leq n\log n$ . Note that in this case $\Lambda=\set{(\lambda_{1},\lambda_{2})\in S}{\lambda_{1}\in(\lambda_{1}^{l};\lambda_{1}^{r})}$ , where $S$ is as defined above. We sample ${\mathcal{O}}(n)$ random points from $S$ so that w.h.p. we get at least one point from $\Lambda$ . As before, if this doesn’t happen, we conclude that w.h.p. $|\Lambda|<n$ and proceed to the last case. 3. 3.

$|\Lambda|<n$ . In this case we maintain $\Lambda$ directly. When we sample a random point $(\lambda_{1}^{\prime},\lambda_{2}^{\prime})\in\Lambda$ , we remove from $\Lambda$ all points on one of the side from $\lambda_{1}^{\prime}$ as directed by the binary search.

In each of the cases above one iteration can be implemented in ${\mathcal{O}}(n)$ time and pre-/post-processing between the cases takes ${\mathcal{O}}(n\log n)$ time. Since there are ${\mathcal{O}}(\log n)$ iterations, sampling takes ${\mathcal{O}}(n\log n)$ time overall.

Using the above algorithm we can find $\lambda_{1}^{l}$ and $\lambda_{1}^{r}$ such that there are no intersection points between them. Since there are ${\mathcal{O}}(\log n)$ iterations and each of them requires ${\mathcal{O}}(n)$ time on average, the total running time is ${\mathcal{O}}(n\log n)$ . This completes a proof of the following theorem (corresponding to lines 2-2 of the algorithm).

Theorem A.15.

There exists an ${\mathcal{O}}(n\log n)$ -time randomized algorithm returning $\lambda_{1}^{l}$ and $\lambda_{1}^{r}$ such that:

No intersections of boundary lines in $[\lambda_{1}^{l},\lambda_{1}^{r}]$ , 2. 2.

There exists a solution $(\lambda_{1}^{\dagger},\lambda_{2}^{\dagger})$ such that $\lambda_{1}^{\dagger}\in[\lambda_{1}^{l},\lambda_{1}^{r}]$ .

After we find $\lambda_{1}^{l}$ and $\lambda_{1}^{r}$ as in Theorem A.15 we show that there are only ${\mathcal{O}}(n)$ regions which can contain a solution and we can check them in ${\mathcal{O}}(n\log n)$ time. The following theorem completes the proof of Theorem 1.1 for $d=2$ :

Theorem A.16.

If there exists a solution $\bm{\mathbf{\lambda}}^{\dagger}$ such that $\lambda_{1}^{\dagger}\in(\lambda_{1}^{l};\lambda_{1}^{r})$ and no intersection points are between $(\lambda_{1}^{l};\lambda_{1}^{r})$ then $\bm{\mathbf{\lambda}}^{*}$ can be found in ${\mathcal{O}}(n\log n)$ time.

Proof A.17.

We show how to find $\bm{\mathbf{\lambda}}^{*}$ in lines 2-2 of the algorithm. Consider set $S=(\lambda_{1}^{l},\lambda_{1}^{r})\times\mathbb{R}$ . Let $\set{R_{t}}_{t=1}^{T}$ be the partition a of $S$ into parts lying between the boundary lines. Since $S$ doesn’t contain boundary intersections and there are ${\mathcal{O}}(n)$ boundaries, the number of parts in the partition is ${\mathcal{O}}(n)$ . For each $R_{t}$ we solve the following system of equations over $\lambda_{1}$ and $\lambda_{2}$ :

[TABLE]

Since no boundary line crosses $R_{t}$ , it is a subset of some region. Therefore, $h^{(1)}_{i}$ and $h^{(2)}_{i}$ are linear inside $R_{t}$ , meaning that the above system becomes a system of linear equations. If the solution to the system belongs to $R_{t}$ , then we can take it as $\bm{\mathbf{\lambda}}^{*}$ . Thus it only remains to show how to find coefficients for the system in ${\mathcal{O}}(n\log n)$ total time.

Recall that in Algorithm 2 we assume that $\set{R_{t}}$ are sorted from bottom to top. For $R_{1}$ we find the linear system coefficients in ${\mathcal{O}}(n)$ time. Assume that the $R_{t}$ are already computed. To find the coefficients for next set $R_{t+1}$ , notice that $R_{t}$ and $R_{t+1}$ are separated by some boundary line. This line corresponds to some $h^{(j)}_{i}$ and therefore crossing it will change the coefficient of only this $h^{(j)}_{i}$ , and the coefficients can be recomputed in ${\mathcal{O}}(1)$ time. Since there are ${\mathcal{O}}(n)$ boundary lines, the overall time for recomputation is also ${\mathcal{O}}(n)$ . Taking sorting of $\set{R_{t}}$ into account, the total running time is ${\mathcal{O}}(n\log n)$ .

Appendix B Missing proofs from Section 2.2

Proof B.1 (of Proposition 2.1).

The constraints corresponding to $j\in S_{0}$ are not tight for the correct guess, otherwise consider a guess which has appropriate signs corresponding to the tight constraints in the optimum solution. Let $\mathbf{x}^{*}$ be the optimum without constraints for $j\in S_{0}$ and let $\mathbf{x}^{*}_{0}$ be the optimum with these constraints. If these two optima are different then we can improve the optimum $\mathbf{x}^{*}_{0}$ with the inequality constraints as follows. Consider vector $\mathbf{z}=(1-\alpha)\mathbf{x}^{*}_{0}+\alpha\mathbf{x}^{*}$ for some small $\alpha>0$ . Because the constraints corresponding to $j\in S_{0}$ are not tight none of these constraints will be violated by this vector for small enough $\alpha$ . All other constraints will be satisfied by convexity. However, we have $\|\mathbf{z}-\mathbf{y}\|<\|\mathbf{x}^{*}_{0}-\mathbf{y}\|$ , a contradiction with the optimality of $\mathbf{x}^{*}_{0}$ .

Uniqueness of the optimum follows from the uniqueness of projection on a convex bodyinline,backgroundcolor=green!10!white]DA: give reference.

Appendix C Additional experiments

In this section we show experiments for $d>2$ and compare performance of GD with METIS. We also show experiments on dataset sx-stackoverflow – the largest SNAP graph which is not a social network.

C.1 Multi-dimensional experiments

We performed experiments for $d=3$ and $d=4$ to illustrate the performance of our algorithms in the multi-dimensional case. We remark that our algorithm can handle higher dimensions as well, but public weight data for large enough graphs is hard to find. For these multidimensional experiments in addition to balancing on the number of vertices and edges we also balance based on the following additional vertex weights:

•

Pagerank. We use Pagerank to model activity level of a node. High Pagerank likely means that the vertex is accessed often, and therefore balancing on Pagerank can be beneficial for load balancing purposes.

•

Sum of neighbor degrees. We also use the sum of degrees over neighbours of a vertex as a weight function. We choose the sum of neighbor degrees as a proxy for the size of the 2-hop neighborhood of a vertex, which is computationally expensive to compute for very large graphs.

inline,backgroundcolor=green!10!white]DA: We should say that we compare with METIS The results are presented in Table 3. They indicate that METIS achieves poor balance for multiple constraints and that GD outperforms METIS by almost all parameters in most cases (better results shown in bold). METIS was given allowed imbalance of $0.5\%$ .

C.2 Experiments on Q&A data

In this section we present experimental results on SNAP graph sx-stackoverflow, containing $2\,601\,977$ vertices and $28\,183\,518$ edges after removing duplicate edges. Unlike other graphs presented in this paper, this one is not a social network. The experiments show that performance of GD on this graph is similar to other social network graphs included in the paper.

inline,backgroundcolor=green!10!white]DA: Fix plot captions

inline,backgroundcolor=green!10!white]DA: Say what $\xi$ is

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Dykstra’s projection algorithm. https://en.wikipedia.org/wiki/Dykstra%27s_projection_algorithm . Accessed: 2019-02-13.
2[2] Apache Giraph. http://giraph.apache.org/ .
3[3] Z. Abbas, V. Kalavri, P. Carbone, and V. Vlassov. Streaming graph partitioning: An experimental study. Proceedings of the VLDB Endowment , 11(11):1590–1603, 2018.
4[4] A. Amir, J. Ficler, R. Krauthgamer, L. Roditty, and O. S. Shalom. Multiply balanced k -partitioning. In LATIN 2014: Theoretical Informatics - 11th Latin American Symposium, Montevideo, Uruguay, March 31 - April 4, 2014. Proceedings , pages 586–597, 2014.
5[5] A. Anandkumar and R. Ge. Efficient approaches for escaping higher order saddle points in non-convex optimization. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016 , pages 81–102, 2016.
6[6] C. Avery. Giraph: Large-scale graph processing infrastructure on Hadoop. Proceedings of the Hadoop Summit. Santa Clara , 11(3):5–9, 2011.
7[7] K. Aydin, M. Bateni, and V. S. Mirrokni. Distributed balanced partitioning via linear embedding. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, February 22-25, 2016 , pages 387–396, 2016.
8[8] D. P. Bertsekas. Nonlinear programming . Athena scientific Belmont, 1999.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Multi-Dimensional Balanced Graph Partitioning via Projected Gradient Descent

Abstract

1 Introduction

1.1 Our Contributions

Theorem 1.1

1.2 Previous Work

2 Projected Gradient Descent

Definition 2.1** (MDBGP)**

2.1 Overview

2.2 Projection

Proposition 2.1

2.3 Exact Projection Algorithms

Projection for d=1d=1d=1

Projection for d=2d=2d=2

3 Implementation

3.1 Projection algorithms

3.2 Adaptive Step Size

3.3 Partitioning Into k Buckets

4 Experiments

4.1 Multi-Dimensional Partitioning

4.2 Distributed Graph Processing

4.3 Parameters of GD

4.4 Performance Analysis

5 Conclusion

Appendix A Multidimensional projection

A.1 Nested binary search

Lemma A.1** (Uniqueness)**

Proof A.2**.**

Definition A.3**.**

Lemma A.4** (Δ\DeltaΔ is well-defined).**

Proof A.5**.**

Lemma A.6** (Solution convexity).**

Proof A.7**.**

Lemma A.8**.**

Proof A.9**.**

Theorem A.10** (Δt\Delta_{t}Δt​ monotonicity).**

Proof A.11**.**

A.2 Projection for D = 2

Definition A.12**.**

Lemma A.13**.**

Proof A.14**.**

Theorem A.15**.**

Theorem A.16**.**

Proof A.17**.**

Appendix B Missing proofs from Section 2.2

Proof B.1** (of Proposition 2.1).**

Appendix C Additional experiments

C.1 Multi-dimensional experiments

C.2 Experiments on Q&A data

Definition 2.1 (MDBGP)

Projection for $d=1$

Projection for $d=2$

Lemma A.1 (Uniqueness)

Proof A.2.

Definition A.3.

Lemma A.4 ( $\Delta$ is well-defined).

Proof A.5.

Lemma A.6 (Solution convexity).

Proof A.7.

Lemma A.8.

Proof A.9.

Theorem A.10 ( $\Delta_{t}$ monotonicity).

Proof A.11.

Definition A.12.

Lemma A.13.

Proof A.14.

Theorem A.15.

Theorem A.16.

Proof A.17.

Proof B.1 (of Proposition 2.1).