Using Colors and Sketches to Count Subgraphs in a Streaming Graph
Shirin Handjani, Douglas Jungreis, Mark Tiefenbruck

TL;DR
This paper improves algorithms for estimating subgraph counts in streaming graphs by reducing storage and update time through three modifications, especially for graphs with bounded degree and specific subgraph structures.
Contribution
The authors introduce three modifications to an existing algorithm, significantly reducing storage and update time for counting subgraphs in streaming graphs under certain conditions.
Findings
Update time per edge is reduced to O(1).
Storage is decreased by a factor related to graph parameters.
Applicable to graphs with no leaves and bounded degree.
Abstract
Suppose we wish to estimate , the number of copies of some small graph in a large streaming graph . There are many algorithms for this task when is a triangle, but just a few that apply to arbitrary . Here we focus on one such algorithm, which was introduced by Kane, Mehlhorn, Sauerwald, and Sun. The storage and update time per edge for their algorithm are both , where is the number of edges in , and is the number of edges in . Here, we propose three modifications to their algorithm that can dramatically reduce both the storage and update time. Suppose that has no leaves and that has maximum degree , where . Define . Then in our version of the algorithm, the update time per edge is , and the storage is approximately reduced by a factor of , where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLimits and Structures in Graph Theory · Complexity and Algorithms in Graphs · Advanced Graph Theory Research
Using Colors and Sketches to Count Subgraphs in a Streaming Graph
Shirin Handjani [email protected] IDA Center for Communications Research, La Jolla
Douglas Jungreis [email protected] IDA Center for Communications Research, La Jolla
Mark Tiefenbruck [email protected] IDA Center for Communications Research, La Jolla
Abstract
Suppose we wish to estimate , the number of copies of some small graph in a large streaming graph . There are many algorithms for this task when is a triangle, but just a few that apply to arbitrary . Here we focus on one such algorithm, which was introduced by Kane, Mehlhorn, Sauerwald, and Sun. The storage and update time per edge for their algorithm are both , where is the number of edges in , and is the number of edges in . Here, we propose three modifications to their algorithm that can dramatically reduce both the storage and update time. Suppose that has no leaves and that has maximum degree , where . Define . Then in our version of the algorithm, the update time per edge is , and the storage is approximately reduced by a factor of , where is the number of vertices in ; in particular, the storage is .
1 Introduction
Suppose that a large simple graph is presented as a stream of edge insertions and deletions, and suppose that is a very small graph (e.g., a small clique or cycle). Our goal is to estimate , the number of copies of that appear in , where we are permitted a single pass through the stream. This problem has received a great deal of attention, particularly in the case where is a triangle; however, there are only a few known techniques that apply to arbitrary . Here we focus on the technique that was developed in [22, 27], which we refer to as the [KMSS]-algorithm.
The [KMSS]-algorithm, which uses complex-valued linear sketches, has many strengths: it applies to arbitrary ; it can be used in distributed settings; it allows edge deletions; and it is extremely efficient in a variety of situations, such as when is a star graph. However, there are many situations where the algorithm is not practical. Suppose has edges, and suppose has edges and vertices. When the [KMSS]-algorithm produces a single estimate of , that estimate has variance , so it is necessary to produce estimates and average them. The storage and update time per edge are proportional to the number of estimates produced, and are therefore both .
In this paper, we describe three modifications to the [KMSS]-algorithm that greatly reduce both the storage and update time per edge. Suppose that is a connected graph with no leaves. Suppose also that the maximum degree of any vertex in is , where , and define . Then the storage required by our algorithm is , i.e., it has been reduced approximately by a factor of . The update time per edge is .
The problem of counting copies of a small graph in a large graph has been studied extensively. It has many applications, as diverse as community detection, information retrieval, and motifs in bioinformatics; see for instance [5, 13, 15, 26, 32]. Here we restrict to the case where is given as a data stream, and our goal is merely to estimate , as opposed to computing exactly. Most work on this problem has addressed the case where is a triangle [4, 7, 9, 10, 11, 14, 16, 17, 18, 19, 20, 23, 24, 25, 28, 29, 30]. A few authors have addressed other specific subgraphs, such as butterflies [31] and cycles [27]. We are only aware of a few algorithms that apply to arbitrary subgraphs [6, 8, 22, 21]. Two of these, [8] and [6], require multiple passes through the stream, which we do not allow here. The third, [22], presents the [KMSS]-algorithm, which is the focus of this paper. The last, [21], presents a vertex-sampling algorithm which, in some situations, is extremely efficient, requiring storage , where is the fractional vertex cover number of . However, this bound requires a strong assumption on : it either requires that have bounded degree, or it requires that the maximum degree in is and that some optimal fractional vertex cover of can place non-zero degree on every vertex.
In order to explain our contribution to this problem, we first need to briefly review the [KMSS]-algorithm. Consider a fixed . Many independent estimates are made for , and they are then averaged. To get a single estimate, the first step is to arbitrarily assign directions to the edges of . We refer to the resulting digraph as and its edges as . Also each edge of is replaced by two directed edges and . We refer to the resulting directed version of as . For any graph or digraph , we refer to its vertices and edges as and . Now we define functions , one for each edge ; each maps edges of to complex roots of unity. These functions are defined in such a way that they can “recognize” whether a -tuple of edges in forms a copy of with each mapping to . In particular, if does form such a copy, then the expected value (over all permissible choices of the maps ) of is a non-zero constant; otherwise, the expected value is zero. Then, as the edges stream by, the values are computed. Finally, when the stream ends, the estimate of is given by multiplied by an appropriate constant.
The key to the algorithm is how to define the functions so that they can recognize when forms a copy of . Each of these functions has two parts: one part is meant to recognize when forms a homomorphic image of , and the other part is meant to recognize when the vertices of this homomorphic image are distinct. In this paper, we do not use the second part; we use a different method to ensure that the vertices are distinct. We therefore omit the second part from our description, keeping in mind that this description differs somewhat from the one in [22]. For each vertex , we define a hash function , which maps vertices of to complex roots of unity, where is the degree of in . Then is defined to be . It is not difficult to see that has expected value 1 if forms a homomorphic image of with each mapping to ; otherwise, it has expected value 0.
We can now describe our contributions to this problem. We present three modifications to the [KMSS]-algorithm, which can be used separately or together to reduce the storage and update time per edge. First, we introduce a different method for ensuring that we count only those homomorphic images of that have distinct vertices. We do this by assigning colors to the vertices of . Assuming there are colors, we subdivide each sum into different sums, one for each pair of colors. For instance, there might be a red-blue sum
[TABLE]
There might also be analogous blue-green sums and green-red sums, and if we were counting triangles, then
[TABLE]
would give an estimate for the number of triangles whose three vertices were respectively red, blue, and green. This allows us to count only homomorphic images whose vertices all have different colors, which in turn ensures that the vertices are all distinct. However, making sure the vertices are distinct is not the primary reason we use colors. The primary reason is that it dramatically reduces the variance.
For our second modification, rather than defining one hash function for each vertex of , we define one for each half-edge of , with the condition that for any vertex of and of , the product , where the product is taken over all half-edges in that are incident to . This too reduces the variance of each estimate.
For the third modification, rather than using hash functions that map vertices to roots of unity, we use hash functions that map vertices to diagonal -by- matrices. Each position along the diagonal of the matrix more-or-less gives a separate estimate of , so in some sense, this is almost equivalent to making independent estimates. The difference is that, when an edge streams by, instead of updating each for different estimates, we only have to update each for one matrix of estimates. This lets us reduce the update time per edge approximately by a factor of .
This paper is organized as follows. In Section 2, we describe our modified version of the [KMSS]-algorithm and prove that it gives an unbiased estimate of . In Section 3, we bound the variance of our estimate. In Section 4, we compare the storage and update time of our algorithm to that of the original algorithm.
The authors would like to thank Kyle Hofmann, Anthony Gamst, and Eric Price for many helpful conversations.
2 Description of Algorithm
In this section, we describe our algorithm and show that it gives an unbiased estimate of . We only explain how to use the algorithm to produce a single estimate of , but in order to get a more accurate estimate of , we would compute many such estimates and take their average.
Fix some small graph . We assume throughout the paper that is connected and has no leaves. Let and respectively denote the number of vertices and edges in . Arbitrarily assign directions to the edges of , and call the resulting directed graph . We assume that the vertices of are labeled , and the edges are , where each . has half-edges, which we call , where and are respectively the two halves of . In particular, each is incident to . For define . In other words, tells which half-edges are incident to . Figure 1 illustrates an example where and .
For each vertex , select an arbitrary element , and call the distinguished half-edge at . Observe that there are half-edges in , of which are distinguished and are not.
2.1 The Functions
The [KMSS]-algorithm uses hash functions that map vertices of to complex roots of unity. Here we define similar functions, but there are two differences. First, instead of having one function for each vertex of , we have one for each half-edge of . Second, we allow the more general setting where the co-domain of each is a group of diagonal matrices.
Let be any finite group of diagonal matrices with the property that the average of the elements of (i.e., ) is the zero matrix. Note that since consists of diagonal matrices, it is abelian. We use to denote the dimension of the matrices in . We are primarily interested in two types of groups . In the first type, , and the elements of are the complex roots of unity, for some . In that case, the matrices can be viewed as complex numbers and are therefore equivalent to what’s used in the [KMSS]-algorithm. For the second type of , . Let , and let be the square diagonal matrix that has along the diagonal. Then is the group generated by and (where is the -dimensional identity matrix); thus has elements: . In this paper, we focus on those two types of , but we remark that there are other that satisfy the given conditions; e.g., diagonal matrices whose diagonal entries are all . The entire discussion in this section applies to any such ; in particular, our algorithm gives an unbiased estimate of for any such . However, the discussion of the variance in the next section applies only to these two specific choices of .
Fix any such group , and for each , define a hash function . If is a non-distinguished half-edge of , then for each , the value is a random element of , and the functions for non-distinguished are chosen independently and uniformly from a family of -wise independent hash functions. If is the distinguished half-edge at , then is defined by
[TABLE]
If is the only element of , then . Observe that this definition of ensures that for any vertex of and any ,
[TABLE]
Lemma 1**.**
Let be any vertex of , and suppose its degree is . Suppose ; i.e., are the half-edges of incident to . Let be any not-necessarily-distinct vertices of . Then is equal to if , and otherwise it is a uniformly random element of .
**Proof: **If , then the result is clearly true, so assume . Assume without loss of generality that the distinguished half-edge at is . Then by definition,
[TABLE]
so
[TABLE]
If , then (1) is equal to . Now assume that some . Then for that , is the quotient of two independent uniformly random elements of , and is thus a uniformly random element of . Also, none of are distinguished, so
[TABLE]
are independent for all , and the rest are . Since at least one is uniformly random, their product is as well.
2.2 The Functions
Let be the directed graph obtained by replacing each edge of by two directed edges, and . Each time an edge of streams by, treat it as two directed edges of . From now on, we use to refer to the number of edges in . Arguably, we should use ; however, will be more convenient, and the factor of 2 will be irrelevant to all of our conclusions, which use notation.
For each edge of , define a function by
[TABLE]
For any -tuple of (not necessarily distinct) edges in , define
[TABLE]
and for each vertex , define
[TABLE]
Since every half-edge of is in exactly one of the sets we have
[TABLE]
The function will in a sense “test” whether forms a copy of .
Lemma 2**.**
Let be any -tuple of edges of . Suppose sends to for each . If induces a homomorphism from to , then . If does not induce such a homomorphism, then is a uniformly random element of .
**Proof: **Suppose induces a homomorphism from to . Let be any vertex of , and suppose the homomorphism sends to . Suppose ; i.e., are the half-edges of that are incident to . Then must all be equal to . By Lemma 1, . Equivalently, . This is true for every , so
[TABLE]
Now suppose does not induce such a homomorphism. Then there must be some vertex of such that, if , then the vertices are not all equal. Thus by Lemma 1, is a uniformly random element of , i.e., is a uniformly random element. is independent of for any other , so is also a uniformly random element; i.e., is a uniformly random element.
2.3 Coloring Vertices
Fix some number of colors . For the purposes of bounding the variance, we will later assume that the maximum degree of any vertex of is and then set ; however, here may take any value . Define a hash function that assigns a color to each vertex of . For each vertex , is a uniformly random color, and is chosen uniformly at random from a family of -wise independent hash functions.
Consider functions . There are such functions, but we want to find only the ones that map isomorphically onto its image. Suppose that maps the edges to the edges respectively. Then for any vertex , all of the vertices are equal to ; i.e., they’re all the same vertex. Therefore, a necessary condition for to induce an isomorphism is that all the vertices are the same vertex. In particular, a necessary condition is that all the vertices have the same color. Thus we say that either the map or the -tuple of edges is color-compatible if for every , all the vertices have the same color. More specifically, for any ordered -tuple of colors , we say that is -compatible if for every , all the vertices have color , or equivalently, if for every . Thus is color-compatible if there exists a -tuple such that is -compatible. Furthermore, if is -compatible and the colors are distinct, then we will say that is distinctly color-compatible.
As we saw in Lemma 2, is equal to if forms a homomorphic image of , and otherwise is a uniformly random element of . The strategy in [22] is basically to compute the sum of over all . The sum then has terms and therefore tends to have high variance. Here, rather than summing over all , we will only sum over distinctly color-compatible . The resulting sum will then have far fewer terms and therefore tend to have far lower variance.
For colors and , define
[TABLE]
Thus there are such sums, and is the sum of over all edges for which the color of is and the color of is . Also, define
[TABLE]
We use to denote expected value (not to be confused with , which refers to the edge-set). We use to denote the trace of a matrix.
Lemma 3**.**
For distinct, is equal to the number of -compatible maps that induce injective homomorphisms from to .
**Proof: **From the definitions of and , we have
[TABLE]
In that last sum, there is one term for every -compatible map . Consider any one such term. By Lemma 2, if does not induce a homomorphism from to , then that term is a uniformly random element of , and, by our assumption on , its trace therefore has expected value 0. Thus those terms do not contribute to . If does induce such a homomorphism, then by Lemma 2, that term is equal to , so it contributes to the trace of . Thus is equal to the number of -compatible maps that induce homomorphisms from to . Since the colors were assumed to be distinct, any such homomorphism sends the vertices of to vertices of with different colors and is therefore injective.
Define
[TABLE]
Theorem 1**.**
[TABLE]
where is the number of automorphisms of
**Proof: **By Lemma 3, if are distinct colors, then gives an unbiased estimate of the number of -compatible maps that induce injective homomorphisms from to , i.e., the number of injective homomorphic images of in whose vertices have colors respectively. Summing over distinct , we see that gives an unbiased estimate of the number of injective homomorphic images whose vertices have distinct colors. The probability that a randomly colored injective homomorphic image of has distinct colors is
[TABLE]
so we divide by this expression. Finally, each copy of gets counted as different injective homomorphic images, so we divide by .
Theorem 1 provides the method for counting copies of . As the edges stream by, we compute the sums . In particular, if the edge streams by, then for each , we compute and add it to the sum . (For an edge-deletion, we subtract from .) Once the data-stream has ended, for every -tuple of distinct colors , we compute the product using Equation (3). Finally, we sum these values to get , take the trace, and multiply by
[TABLE]
to get the final estimate. We refer to this as Algorithm 1 and summarize the steps in Table 1. Observe that after the data-stream ends, we do a potentially large computation, which could involve computing roughly values . There are often, but not always, ways to do this computation with less than work. This is discussed further in Section 4.
In the case where with , a very slight modification to Algorithm 1 reduces the update time per edge by roughly a factor of . In this modified algorithm, which we call Algorithm 2, we do not compute the sums until after the data stream has ended. Instead, we keep counts of how many times each would have contributed to . Thus we have a count for each , which we call . Suppose that when some edge streams by, we compute and find that it is equal to . Rather than immediately adding to , we add 1 to . (If is an edge-deletion or if is equal to , then we instead subtract 1 from the count.) Thus, rather than updating diagonal entries, we update one count, saving a factor of in update time. The storage does not change much: for each , rather than storing the values of diagonal entries, we store counts. After the data stream ends, we compute each
[TABLE]
Note that Equation (5) can be evaluated using a fast Fourier transform, though this is unlikely to have much effect on the overall run time. The steps of Algorithm 2 are summarized in Table 2.
3 The Variance
In this section, we bound the variance of the estimate given by our algorithm. Note that the variance is the same whether we use Algorithm 1 or Algorithm 2, since they produce the same estimate, so we do not distinguish between the two. The variance does however depend on the choice of , and our proof only applies when is either the group of roots of unity or the group . In either case, the variance is a large sum, but most terms in the sum are zero. In Section 3.1, we give conditions that classify which terms contribute non-trivially to the sum when is the group of roots of unity. In Section 3.2, we do the same when is the group . In Section 3.3, we bound the number of terms that satisfy those conditions, obtaining our bound.
Our estimate of (which is given in Theorem 1) has variance
[TABLE]
where denotes the complex conjugate of . We thus wish to understand the term .
From Equation (4),
[TABLE]
so
[TABLE]
Thus is a sum of terms of the form
[TABLE]
In particular, there is one term for every -tuple of edges for which is distinctly color-compatible and is distinctly color-compatible. In contrast, for the [KMSS]-algorithm, the analogous expression for the variance has a term for each -tuple of edges regardless of color-compatibility.
For most -tuples of edges , the product (8) has expected value 0 and therefore does not contribute to the variance. Here we classify the -tuples that do contribute to the variance. Consider some -tuple of edges , and consider any vertex . We consider three conditions that the -tuple may or may not satisfy at :
Condition 1:
The vertices are all the same, and the vertices are all the same.
Condition 2:
for all .
Condition 3:
There are vertices such that for every , either and , or and .
Note that Condition 1 is a special case of Condition 3. In general, when Condition 1 is satisfied at every vertex of , each of and forms a homomorphic image of . In general, when Condition 2 is satisfied at every vertex of , is an arbitrary collection of edges, and .
The following lemma turns Conditions 1–3 into conditions on and . Those conditions will later let us characterize which contribute to the variance.
Lemma 4**.**
Suppose is any -tuple of edges of .
- A.
If satisfies Condition 1 at , then . 2. B.
If satisfies Condition 2 at but not Condition 1, then , and each is a uniformly random element of . 3. C.
If satisfies Condition 3 at but not Condition 1, then , and each is a uniformly random element of . 4. D.
If does not satisfy Condition 1,2, or 3 at , then either or is a uniformly random element of and is independent of the other.
**Proof: **Suppose that and . If satisfies Condition 1 at , then by Lemma 1, and .
Now suppose that Condition 1 is not satisfied at . Let be the distinguished half-edge at . Then
[TABLE]
so
[TABLE]
Similarly,
[TABLE]
Since Condition 1 is not satisfied, either some or some . Assume it is the former. Then is a uniformly random element of , and it is independent of for all , since then neither nor is distinguished. Thus is a uniformly random element of . Similarly, if , then is a uniformly random element of .
If satisfies Condition 2, then for each , , so .
If satisfies Condition 3, then for each , either and , or and . Either way, is the inverse of , so .
Suppose then that does not satisfy any of the three conditions. Suppose also that for some , one of , , , and differs from the other three. Suppose the one that differs is either or . Then is a uniformly random element of , and it is independent of . It is also independent of and for all . Thus is a uniformly random element of and is independent of . Similarly, if or was the one that differed from the other three, then would be uniformly random and independent of . Suppose then that for each , none of , , , and is different from the other three. If , then Condition 2 must hold; whereas if , then Condition 3 must hold.
3.1 Variance When Consists of Roots of Unity
At this point, the discussion splits into two cases depending on whether is a group of roots of unity or a group of matrices. Here we consider the former. Therefore we fix some integer and let be the group of 1-by-1 matrices whose entries are roots of unity. Since the matrices are 1-by-1, we treat all matrices as complex numbers rather than matrices. Also, since the trace of a 1-by-1 matrix is equal to its entry, we simply remove “” from any equations. Thus the expression (6) for variance becomes
[TABLE]
Since is a sum of terms of the form , the next theorem classifies which pairs contribute to .
Theorem 2**.**
Let be a -tuple of edges of . If either of the following hold:
- •
* satisfies Condition 1 or 2 for every , or*
- •
, and satisfies Condition 1, 2, or 3 for every ,
then . Otherwise,
[TABLE]
**Proof: **We can write
[TABLE]
as
[TABLE]
If satisfies Condition 1 or 2 at some , then by Lemma 4, , so
[TABLE]
If and satisfies Condition 3 at some , then , so
[TABLE]
Thus if either of these two conditions holds at every , then
[TABLE]
Suppose now that at some , Conditions 1 and 2 don’t hold. If Condition 3 holds and , then by Lemma 4, , and each is a uniformly random element of . Then , and since we have
[TABLE]
Thus
[TABLE]
If instead Condition 3 does not hold, then by Lemma 4, one of and is a uniformly random root of unity and is independent of the other, so again
[TABLE]
In either case, is independent of for , so
[TABLE]
3.2 Variance When
Now we consider the variance of our estimate in the case where is a group of matrices. In particular, fix a dimension and let consist of the matrices , where is the diagonal matrix with entries , and .
The variance of our estimate for is given by Expression (6). Note, however, that the trace of every element of is real, so we can dispense with complex conjugation. Thus the variance becomes
[TABLE]
We thus wish to understand the term . Since is a sum of terms of the form , the next theorem classifies how much each pair contributes to .
Theorem 3**.**
Suppose is a -tuple of edges of .
- •
If satisfies Condition 1 for every , then .
- •
If satisfies either Condition 1, 2, or 3 at every but not always Condition 1, then
[TABLE]
- •
Otherwise,
[TABLE]
**Proof: **Suppose that Condition 1, 2, or 3 holds at every . Recall that , and . Let denote the product of over all where Condition 1 holds. Let denote the same product at all where Condition 2 holds, but not Condition 1. Let denote the same product over all where Condition 3 holds, but not Condition 1. (In each case, if the given conditions are not satisfied at any , then define to be .) Thus . By Lemma 4-A, , so . By Lemmas 4-B and 4-C, . Furthermore, if there is at least one where Condition 2 (resp. 3) holds but not Condition 1, then (resp. ) is uniformly random. Finally, since and involve different vertices of , they are independent.
If satisfies Condition 1 at every , then , so , and .
If satisfies Condition 1 or 2 at every but not always Condition 1, then , and . With probability , , in which case . If is not , then , so . Thus
[TABLE]
If satisfies Condition 1 or 3 at every but not always Condition 1, then , and . With probability , , in which case . If is not , then , so . Thus
[TABLE]
Next, suppose satisfies Condition 1, 2, or 3 at every but not always Condition 1 or 2, and not always Condition 1 or 3. Then and . If either of or is not , then it has trace 0, in which case . Thus we only need to consider the cases where and are both ; or equivalently, the case where and . This happens with probability if is odd and if is even. Thus
[TABLE]
is equal to 1 if is odd, and 2 if is even.
Finally, suppose that does not satisfy any of Conditions 1, 2, or 3 at some vertex . Then by Lemma 4-D, one of and is a uniformly random element of , and is independent of the other. Thus
[TABLE]
3.3 Bounding the Variance
As we saw in Sections 3.1 and 3.2, a -tuple of edges only contributes to the variance if it is distinctly color-compatible and satisfies Condition 1, 2, or 3 at every vertex of . Now we bound the number of -tuples with these properties to get a bound on the variance.
Throughout this section, will denote a -tuple of edges of , where and . We continue to refer to the edges of as . In much of this section, edge-directions will be irrelevant and will often be ignored. We refer to the two halves of the edge as the “half-edge at ” and the “half-edge at ,” and similarly for the two halves of and . Let denote the undirected subgraph of consisting of the edges of , ignoring edge-directions. If (so ), then we say that and lie over . Thus, for instance, Condition 1 is satisfied at some if and only if all that lie over are equal and all that lie over are equal. If and satisfies Condition 1 (resp. 2 or 3) at , then we’ll say also that satisfies Condition 1 (resp. 2 or 3) at and at .
Suppose and are distinct vertices of , and suppose and . The vertices and need not be distinct; however, if they are not distinct, then cannot be distinctly color-compatible (because that would require that and get different colors). And similarly for , , and . Thus if we are given but we are not yet given the colors of the vertices, then we will say that is distinctly colorable if for all distinct vertices , and for all and , the vertices and are distinct, as are the vertices and (though and are not required to be distinct). If is not distinctly colorable, then no matter how colors are assigned to vertices, will not be distinctly color-compatible and therefore will not contribute to the variance.
We begin with some lemmas.
Lemma 5**.**
Suppose and , where is some vertex of . If is distinctly colorable and Condition 2 or 3 is satisfied at , but not Condition 1, then neither nor can be equal to either or .
**Proof: **If Condition 2 holds at , then . By the definition of “distinctly colorable,” and , so neither nor can equal .
If instead Condition 3 holds at , then there are two vertices and that lie over in such that for all , either and or vice versa. We may assume that and . But since Condition 1 does not hold at , there must also be some such that and . Since , it follows from the definition of “distinctly colorable” that cannot be equal to either or , and similarly for .
Normally, we refer to the edges of as ; however, in the next lemma, we will not be concerned with the directions of the edges, so we will refer to the edges as , with the understanding that for some , either and , or vice versa.
Lemma 6**.**
Suppose is a walk in the undirected graph , and suppose that Condition 1 or Condition 3 holds at every internal vertex of the walk (i.e., Condition 1 or 3 holds at each vertex for ). Then there is a walk in from to either or , and similarly for .
**Proof: **Let denote the edge of ; in other words, . Then has two “lifts” in , namely, and . We will show that each lift of is adjacent to a lift of , so we will be able to piece together lifts of the ’s to get a lift of the entire walk.
We use induction on the length of . The proof is the same for and , so we present the proof just for . If , then is the required walk. If , then by induction, there is a walk in from to either or . Since is a walk, the edges and are adjacent; in particular, the vertices and are equal. Equivalently, there is some vertex of such that . By assumption, Condition 1 or 3 holds at , so either and , or and . Either way, we can append either the edge or the edge to , obtaining a walk from to either or .
Although is connected, need not be. For instance, might consist of two isomporphic copies of . In that case, each connected component of contains a lift of every edge of . However, it can also happen that a connected component of contains lifts of only some edges of . The next lemmas involve the connected components of . We generally use to denote a connected component of and use to denote the subgraph of that lies “below” . Note that is connected, so is not generally a connected component of .
Lemma 7**.**
Suppose that satisfies Condition 1, 2, or 3 at each vertex of , and suppose it satisfies Condition 2 at some vertex. Then for every , the two vertices and are in the same connected component of . Furthermore, that component also contains some vertex at which Condition 2 is satisfied.
Proof: is connected, so there is a walk that starts with the half-edge at and ends at a vertex where Condition 2 holds. We can choose a minimal such walk, in which case it has no internal vertices where Condition 2 holds. Suppose the walk ends with the half-edge at . By Lemma 6, there is a walk in from to either or , and also a walk from to either or . But Condition 2 holds at , so . Thus there is a walk from to , and one from to . Concatenating them gives a walk from to . Thus and are in the same component, and are also in the same component as the vertex , at which Condition 2 holds.
Lemma 8**.**
Suppose satisfies Condition 1, 2, or 3 at every vertex of and satisfies Condition 2 at some vertex of . Let be any connected component of , and define to be the subgraph of consisting of all edges for which either or is in . Then must contain either
- •
at least two vertices where satisfies Condition 2;
- •
a vertex with degree at least 2 in and where satisfies Condition 2;
- •
a vertex with degree at least 3 in and where satisfies Condition 1 or 3.
**Proof: **We assumed there is some vertex of where Condition 2 is satisfied, so by Lemma 7, contains such a vertex, and so then does . If contains two such vertices, then we are done, so assume there is just one. If that one vertex has degree at least 2 in , then again we are done, so assume it has degree 1. By the Handshaking Lemma, there must be another vertex with odd degree in ; and must satisfy either Condition 1 or 3 at that vertex.
If is any vertex in , then for some , the half-edge at is in , so either the half-edge at or the half-edge at is in . If in addition satisfies either Condition 1 or 3 at , then (by the definition of Conditions 1 and 3), for every , either the half-edge at or the half-edge at is in . Therefore, for every , the half-edge at is in . In other words, has the same degree in as in . We saw in the previous paragraph that some vertex satisfies either Condition 1 or 3 and has odd degree in . It has the same degree in . But we assumed that has no leaves, so it must have degree at least 3 in .
Lemma 9**.**
Suppose satisfies Condition 1, 2, or 3 at every vertex of and satisfies Condition 2 at some vertex of . Let denote the degree of in . For each connected component of , define as in the previous lemma. Suppose there are components of that satisfy:
- •
* has at most one vertex where Condition 2 holds, and*
- •
* has degree at least two in .*
Then at most distinct vertices of lie over .
**Proof: **Suppose , so the vertices are all equal to . The vertices that lie above in are and . Since Condition 2 holds at , for each , so in fact the vertices that lie above in are just . Consider any as defined in the lemma. Since has degree at least two in , at least two of the half-edges at are in . Assume without loss of generality that the half-edges at and are in . Then contains the half-edge at either or and the half-edge at either or . Since and , contains both and . But has at most one vertex where Condition 2 holds, so and must be the same vertex. Thus each of contains two of that are equal, so there can be at most that are distinct.
Lemma 10**.**
Suppose , where , and suppose . Then and .
**Proof: **The first inequality follows from . For the second inequality, if , then both and are at most , so ; if instead , then .
Theorem 4**.**
Suppose that the maximum degree of any vertex in is at most , where , and assume . Then the expected number of distinctly color-compatible -tuples of edges of that satisfy either Condition 1, 2, or 3 at every vertex of , and satisfy Condition 2 at some vertex of is .
**Proof: **Let and be as defined above. There are possibilities for the isomorphism class of (i.e., which of the vertices are the same), so it suffices to prove the theorem for an arbitrary isomorphism class. Consider then any one such class. We may assume that it is distinctly colorable.
The expected number of possibilities for can be computed in two steps: first count the number of ways to select the vertices of where colors are ignored, and then find the probability that when colors are assigned, becomes distinctly color-compatible. (When we say “select the vertices of ,” we mean, choose a vertex of for each and so that the resulting has the assumed isomorphism class.) To count the number of ways to select the vertices for , we consider one connected component of at a time. Let be some connected component of . We can arbitrarily designate any one edge of to be the “first edge.” Once we designate the first edge, there are at most ways to select its two endpoints (since has edges). There are then at most ways to select each subsequent vertex of , for a total of . Equivalently, we could have arbitrarily designated any two (not necessarily adjacent) vertices of to be the “first two vertices;” we could have then pretended that there were at most ways to select each of those two vertices and at most ways to select each other vertex of .
For a component , we use the following method to decide which will be its first two vertices. Let be the subgraph of consisting of all edges for which either or is in (as in Lemma 8). By Lemma 7, has at least one vertex where Condition 2 is satisfied. We’ll designate that as one of the first two vertices of . If there is a second such vertex, then we’ll designate it as the other. If not, if some vertex of lies above a vertex of that has degree at least 3, and where Condition 1 or 3 is satisfied, then we’ll designate that as the other. Otherwise, we’ll designate any vertex as the other.
We now show that the result is at most . We consider one vertex of at a time, and compute the factor that the vertices that lie above contribute to the result. Recall that if some vertex above was designated as one of the first two vertices in its component, then it contributes a factor of , and otherwise contributes a factor of . Furthermore, if Condition 2 holds at , and if there are vertices that lie above , then they must all receive the same color, which introduces a factor of . Note also that by Lemma 5, if and are two vertices of where Condition 2 holds, then all the vertices that lie above are distinct from all the vertices that lie above , and so these factors of are all independent. We consider four cases for . In the first case, Condition 2 holds at . In the other three cases, Condition 1 or 3 holds at , but we subdivide these cases based on whether some vertex that lies above was designated as a first vertex of its component, and whether has degree .
First consider the case where Condition 2 holds at . Let denote the number of vertices of that lie above . There are at most ways to select each of these vertices, and they must all receive the same color, so these vertices contribute at most a factor of to the count. By Lemma 9, is at most , where is the degree of in , and is the number of connected components of that satisfy: has at most one vertex where Condition 2 holds, and has degree at least two in . Thus the contribution of to the overall expected value is at most a factor of
[TABLE]
For the next three cases, suppose that either Condition 1 or 3 holds at . Then at most two vertices of lie above , and by Lemma 7, they lie in the same component of . In the case where neither was designated as one of the first two vertices of that component, their contribution to the count is at most a factor of
[TABLE]
(We used Lemma 10 in the first inequality.)
For the last two cases, suppose that one of the vertices that lie above was designated as one of the first two vertices of the component. First assume . Then the contribution of the vertices that lie above to the overall expected value is at most a factor of
[TABLE]
(We used Lemma 10 in the first inequality.)
Finally, suppose that one of the vertices that lie above was designated as one of the first two vertices of the component, and (which means that ). Then the contribution of the vertices that lie above to the overall expected value is at most a factor of
[TABLE]
(We used Lemma 10 in the last inequality.)
Observe that in all four cases (Equations (11), (12), (13), and (14)), the vertex contributed a factor of , except that in (11) and (14), there are additional factors of or . We first show that there are at least as many factors of as . For the remainder of the proof, we use rather than to denote the degree of , since will no longer be clear from context. There is one factor of in Equation (14) for each vertex and component such that:
- •
,
- •
satisfies Condition 1 or 3, and
- •
a vertex that lies above was designated as one of the first two vertices of .
Observe that can have at most one vertex that satisfies Condition 1 or 3 and was designated as one of the first two vertices, so cannot contribute a factor of for any vertex besides . In other words, contributes at most one factor of overall. Now we’ll show that also contributes a factor of to (11). Since was designated as one of the first two vertices of , we know that has only one vertex where Condition 2 holds (which, by Lemma 5, implies that has only one vertex where Condition 2 holds), and cannot have a vertex that lies above a vertex with degree at least 3 in and where Condition 1 or 3 holds. Thus by Lemma 8, must have a vertex with degree at least 2 (in ) where Condition 2 holds. Then for that vertex, contributes a factor of to (11). Thus there must be at least as many factors of in (11) as there are factors of in (14). We can therefore ignore all such factors; this can only increase the product. When we ignore these factors, each vertex of contributes a factor of at most . Taking the product over gives
[TABLE]
Next, we consider what happens when no vertex of satisfies Condition 2.
Lemma 11**.**
Suppose satisfies Condition 1 or 3 at every vertex of . Then has at most two connected components.
Proof: is connected, so for any , there is a walk from to . Since Condition 1 or 3 holds at every vertex along the walk, we can apply Lemma 6 to deduce that there is a walk in from to either or and also a walk from to either or . Thus every vertex of is in the same connected component as either or .
Lemma 12**.**
Suppose satisfies Condition 1 or 3 at every vertex of , but not always Condition 1. Then at least one of the following must hold:
- •
* has more edges than vertices;*
- •
there are at least two vertices of where does not satisfy Condition 1;
- •
* is connected (as an undirected graph).*
**Proof: **We assumed that is connected and has no leaves. Suppose the first condition above does not hold (i.e., has as many vertices as edges). Then must be a cycle. Now suppose that the second condition above also does not hold, i.e., there is exactly one vertex of where does not satisfy Condition 1 (and therefore satisfies Condition 3). We will assume that the edges of going around the cycle in order are . Note that there is no loss of generality in this assumption, because edge-directions are irrelevant to this lemma. We can also assume that the vertex is the vertex where Condition 3 is satisfied, but not Condition 1. Then , and . Since Condition 1 holds everywhere else, we have and for all . Thus is a path that visits every vertex of , so is connected.
Theorem 5**.**
Suppose that the maximum degree of any vertex in is at most , where , and assume . Then the expected number of distinctly color-compatible -tuples of edges of that satisfy either Condition 1 or 3 at every vertex of , but do not satisfy Condition 1 at every vertex of , is .
**Proof: **There are again possibilities for the isomorphism class of (i.e., which of the vertices are the same), so it suffices to prove the theorem for an arbitary isomorphism class. Assume then that we are given the isomporphism class of . We can assume that it is distinctly colorable.
We first count the number of ways to select the vertices of . By Lemma 11, has at most two components. As in the proof of Theorem 4, in each component, we can arbitrarily designate any one edge to be the “first edge.” There are at most ways to select its two endpoints (since has edges), and there are at most ways to select each subsequent vertex in the component. Equivalently, we can arbitrarily designate any two (not necessarily adjacent) vertices of the component to be the “first two vertices;” we can then pretend that there are at most ways to select each of these two vertices and at most ways to select each subsequent vertex. Since has at most vertices (because at most two vertices lie above each vertex of ), there is a total of at most possibilities for the vertices of if has one component, and possibilities if has two. Note that the number is greater in the two-component case.
Once the vertices of are chosen, colors must be assigned in such a way that the -tuple of edges is distinctly color-compatible. Consider any vertex of where Condition 3 (but not Condition 1) holds. That means that exactly two vertices and lie above in , and they must be distinct (or else Condition 1 would hold). Color-compatibility requires that and be assigned the same color, which happens with probability . Furthermore, if there are two vertices and where Condition 3 (but not Condition 1) holds, and if , , , and are the corresponding vertices that lie above and , then by Lemma 5, , , , and are all distinct. Thus every vertex where Condition 3 (but not Condition 1) holds contributes an independent factor of to the count.
Suppose now that has more edges than vertices (i.e., ). There are at most ways to select the vertices of and a probability of at most that the result is distinctly color-compatible, so (using Lemma 10) the expected number of -tuples of edges is at most
[TABLE]
Next suppose instead that has at least two vertices where Condition 3 (but not Condition 1) holds. Then there are at most ways to select the vertices of and a probability of at most that the result is distinctly color-compatible, so the expected number of -tuples of edges is at most
[TABLE]
The only remaining case is where does not have more edges than vertices and has only one vertex where Condition 3 (but not Condition 1) holds. By Lemma 12, is connected, i.e., has only one component. Then there are at most ways to select the vertices of and a probability of at most that the result is distinctly color-compatible, so the expected number of -tuples of edges is at most
[TABLE]
Lemma 13**.**
If the -tuple of edges is distinctly colorable and satisfies Condition 1 at every vertex of , then and are each isomorphic to .
**Proof: **Suppose and are (not necessarily distinct) vertices of , and suppose and . Since Condition 1 holds everywhere, if , then . Since is distinctly colorable, if , then . In other words, if and only if . Then the edge map that sends each to induces an isomorphism between and . The proof for is analogous.
Theorem 6**.**
Suppose is either the group of roots of unity (in which case ) or the group . Suppose that the maximum degree of any vertex in is at most , where , and assume . Then the estimate for given by Theorem 1 has variance that is .
**Proof: **The variance is given by
[TABLE]
As discussed earlier, is a sum of terms of the form , where and are each distinctly color-compatible. By Theorems 2 and 3, such a term contributes to only if satisfies Condition 1, 2, or 3 at every vertex of .
First consider the that satisfy Condition 1 at every vertex of . By Lemma 13, if satisfies Condition 1 at every vertex of and is distinctly colorable, then and are each isomorphic to . The number of such is then . By Theorems 2 and 3, each such contributes to and therefore contributes at most
[TABLE]
to the variance. This term is , so the contributions of these to the variance is .
Next consider the that satisfy Condition 1, 2, or 3 at every vertex of , but not always Condition 1. By Theorems 4 and 5, the number of such that are distinctly color-compatible is . By Theorems 2 and 3, each such contributes at most to . Thus these terms contribute to the variance.
4 Discussion of Algorithm
In this section, we discuss how our version of the algorithm compares to the original in terms of storage, update time per edge, and a one-time calculation.
First consider the case where is the group of roots of unity. As we showed in Theorem 6, the variance of a single instance of our algorithm is , so the number of instances needed to attain a variance of is
[TABLE]
Each instance of our algorithm requires storage, so the storage needed is . Assuming our goal is to minimize storage, if the first term in this expression is larger than the second, then we want to choose a smaller value of to ensure that
[TABLE]
i.e.,
[TABLE]
Thus, although we proved Theorem 6 for any , the best choice of is . In that case, the number of instances of our algorithm that we need to perform is , so the update time per edge is also , and the storage is . We thus save a factor of roughly in storage and in update time over the original algorithm.
Of the two terms in (15), 1 and , if the first is larger, then we are doing instances of the algorithm, so the update time per edge is . If the second is larger, then we can reduce the update time per edge by instead letting be the group and setting , but performing times as many instances of the algorithm. By Theorem 6, the variance remains . The storage requirement also does not change, since we do times as many instances of the algorithm, but each instance requires times the storage. However, now we are performing instances of the algorithm, so the update time is .
There is one drawback of our version of the algorithm: when the stream ends, a potentially large calculation is required. In particular, we must compute
[TABLE]
This could potentially involve work, although for most , we can use inclusion-exclusion to perform the calculation more efficiently. For instance, if is a 4-cycle with vertices 1,2,3,4 and edges , then we can loop through colors and for vertices 1 and 3. For each such pair of colors, we can loop through colors for vertex 2, computing
[TABLE]
Separately, we can loop through colors for vertex 4, computing
[TABLE]
We can multiply those two sums and then subtract the terms where :
[TABLE]
We thus do the computation with work rather than work. In fact, it is possible to do slightly better: each of the three sums above can be computed for all and by performing a matrix multiplication, which can be done using less than work. It would be unusual for this computation to be a significant issue, but if it is, then we might want to choose a smaller value of , in which case we would not realize the full reduction in storage.
5 Conclusion
We have described three modifications to the [KMSS]-algorithm: we define one hash function for each half-edge of rather than one for each vertex of ; we assign colors to the vertices of and restrict to distinctly color-compatible ; and we allow matrix-valued hash functions as an alternative to complex-valued hash functions. The first two modifications reduce the variance in each instance of the algorithm, and therefore reduce the number of instances needed. This in turn reduces the required storage and update time per edge. The third modification reduces only the update time per edge.
Suppose that the maximum degree of any vertex in is at most , where , and suppose . For the original [KMSS]-algorithm, both the storage and update time per edge are . For our algorithm, we have shown that the update time per edge is , and the storage is , i.e., the storage has been reduced approximately by a factor of .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. Ahmed, N. Duffield, J. Neville, and R. Kompella. Graph sample and hold: a framework for big-graph analytics. KDD 2014 , pp. 1446-1455, 2014.
- 2[2] N. Ahmed, N. Duffield, T. Wilke, R. Rossi, “On Sampling from Massive Graph Streams,” in Proc. VLDB, 1430-1441, 2017.
- 3[3] N. Ahmed and R. Rossi. The Network Data Repository with Interactive Graph Analytics and Visualization. http://networkrepository.com, 2015.
- 4[4] K. Ahn, S. Guha, and A. Mc Gregor. Graph sketches: Sparsification, spanners, and subgraphs. In Proceedings of the Symposium on Principles of Database Systems (PODS) , 2012, pp. 5-14.
- 5[5] U. Alon, D. Chklovskii, S. Itzkovitz, N. Kashtan, R. Milo, S. Shen-Orr. Network motifs: simple building blocks of complex networks. Science 298, no. 5594, pp. 824-827, 2002.
- 6[6] S. Assadi, M. Kapralov, S. Khanna. A simple sublinear-time algorithm for counting arbitrary subgraphs via edge sampling. In ITCS , volume 124 of LIP Ics , pp. 6:1-6:20. Schloss Dagstuhl - Liebniz-Zentrum fuer Informatik, 2019.
- 7[7] Z. Bar-Youssef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms with an application to counting triangles in graphs. SODA , pp. 623-632, 2002.
- 8[8] S. Bera and A. Chakrabarti. Towards tighter space bounds for counting triangles and other substructures in graph streams. In 34th Symposium on Theoretical Aspects of Computer Science (STACS 2017) , pp. 11:1-11:14, 2017.
