TL;DR
This paper introduces a Markov chain-based network interpolation method that generates plausible sequences of graphs between snapshots, with analytical estimates and validation on synthetic and real data showing its effectiveness.
Contribution
The paper presents a novel Markov chain model for network interpolation that analytically estimates transition properties and outperforms common growth models.
Findings
The model accurately interpolates between network snapshots.
It provides analytical estimates of hitting times and long-term behavior.
It outperforms preferential attachment and triadic closure models.
Abstract
Given a set of snapshots from a temporal network we develop, analyze, and experimentally validate a so-called network interpolation scheme. Our method allows us to build a plausible, albeit random, sequence of graphs that transition between any two given graphs. Importantly, our model is well characterized by a Markov chain, and we leverage this representation to analytically estimate the hitting time (to a predefined distance to the target graph) and long term behavior of our model. These observations also serve to provide interpretation and justification for a rate parameter in our model. Lastly, through a mix of synthetic and real-world data experiments we demonstrate that our model builds reasonable graph trajectories between snapshots, as measured through various graph statistics. In these experiments, we find that our interpolation scheme compares favorably to common network…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\newsiamremark
remarkRemark \newsiamremarkhypothesisHypothesis
\newsiamthmclaimClaim \headersNetwork InterpolationT. Reeves, A. Damle, and A. R. Benson
Network Interpolation††thanks: Submitted to the editors on June 28, 2019.
\fundingThis research was supported by NSF Award DMS-1830274, ARO Award W911NF19-1-0057, and ARO MURI.
Thomas Reeves Center for Applied Mathematics, Cornell University, Ithaca, NY 14853 (). [email protected]
Anil Damle Department of Computer Science, Cornell University, Ithaca, NY 14853 (, ). [email protected]
Austin R. Benson33footnotemark: 3
Abstract
Given a set of snapshots from a temporal network we develop, analyze, and experimentally validate a so-called network interpolation scheme. Our method allows us to build a plausible, albeit random, sequence of graphs that transition between any two given graphs. Importantly, our model is well characterized by a Markov chain, and we leverage this representation to analytically estimate the hitting time (to a predefined distance to the target graph) and long term behavior of our model. These observations also serve to provide interpretation and justification for a rate parameter in our model. Lastly, through a mix of synthetic and real-world data experiments we demonstrate that our model builds reasonable graph trajectories between snapshots, as measured through various graph statistics. In these experiments, we find that our interpolation scheme compares favorably to common network growth models, such as preferential attachment and triadic closure.
keywords:
network science, interpolation, dynamic networks
{AMS}
90B15, 05C82, 68R10
1 Introduction
Dynamic networks for temporal interactions of complex systems are a pervasive model throughout the sciences [21]. They are used to analyze, for example, interactions in social networks [13, 30], communication systems [27, 29], digital currency transactions [26, 40], and protein-protein interactions [18, 25]. Often, these dynamic datasets are recorded as a sequence of “snapshots” (also called “slices” [36] or “layers” [24]), where the snapshot represents the network at a single point in time or an aggregation of data over a period of time. A sequence of snapshots is often the fundamental type of data used to derive methods for dynamic network analysis [3, 11, 17, 19].
Even if dynamic interactions occur in real time, there are a number of reasons why one may only have access to a sequence of snapshots. A principal reason is that data may only be collected at regular intervals. Specifically, such scenarios are common in survey data. For example, sociological studies record social networks of groups at different points in time [9, 14, 34], and U.S. Census data such as the Survey of Income and Program Participation (SIPP) records job movement by surveying households at regular intervals. An online analog to offline surveys is Web crawling. In this scenario, sequences of snapshot networks are recorded from a sequence of crawls that collect network data and are subsequently analyzed for their dynamic structure [4, 31, 35]. In other cases, temporal network data is aggregated upon public release. This happens due to privacy concerns in biomedical data [5, 7] or because the interaction is associated with a regularly scheduled event, such as coauthorship of a computer science conference paper in a given year [30, 32].
Independent of the manner in which a sequence of snapshots is obtained, a natural problem is inferring what happens in the network between the snapshots, when the underlying true data is not available. Such reverse engineering would enable better real-time data analysis, localization of structural changes in the graph, and understanding of social, financial, or biological dynamics. More generally, we would like to generate a plausible sequence of graphs that takes us from one snapshot to the next. Beyond enabling exploration of the network between snapshots, such a process is valuable in the development of synthetic data for streaming algorithms by providing test sets (sequences of graphs) anchored to the real data.
A natural first approach to this problem is to appeal to network growth models such as preferential attachment [1, 28], triadic closure [22, 23], mixture models [38], or stochastic actor oriented models [43]. In this setup, one would start with a snapshot and use a network growth model, perhaps with parameters determined by the structure of the next snapshot, as a guess at how the network evolves. However, there is no guarantee that after an appropriate number of steps such growth models will be close in structure to the next snapshot, and indeed we see that this is the case in our experiments outlined in Section 5. Across all of our experiments, the behavior we observe with these models is that when initiated with one snapshot they do not have the same structure of the network at the time of the next snapshot.
Colloquially, employing a growth model looks like using an extrapolation method to try and move from one snapshot to the next—the only considerations are the choice of the parameters in a growth model and the starting snapshot. In contrast, because we have a sequence of snapshots a more apt analogy is to try and build an interpolation strategy. Here we will use both the starting and ending snapshots and explicitly generate a path between them. To illustrate the subsequent discussion, Figure 1 shows four graphs produced in an instance of our network interpolation model.
We propose a temporal network model that, given two sequential snapshots, provides a path of graphs that begins at one snapshot and ends exactly at the next snapshot—i.e., it exactly interpolates the snapshots. More specifically, the model provides a feasible sequence of additions and deletions of edges that transitions between the two snapshots. We stress that a key benefit of our model is that it does not simply interpolate one-dimensional statistics of the snapshots (such as the clustering coefficient). It actually provides a sequence of graphs whose statistics interpolate those of the snapshots.
Of course, absent sufficient additional side information, there could be an infinite number of potential sequences from one graph to the next. Thus, our model is determined by the process through which we build the interpolation. More specifically, our model is based on a Markov chain on graphs. Given a starting graph and a target graph , our model makes a sequence of edits to the graph ; at each step, a biased coin determines whether or not we increase or decrease the edit distance between and by one, where the bias depends on the current edit distance.
While our model is a Markov chain on the space of all graphs of appropriate size, we show how to analyze its structure using a much simpler Markov chain on the set of edit distances between the starting and target graphs. This reduces the dimension of the chain from exponential in the number of nodes to quadratic in the size of the starting and target graph. By assuming a flexible bias parameter in the type of transition (increasing or decreasing edit distance) we are able to provide an analytic expression for the expected hitting time of the Markov chain, thereby characterizing the expected number of edge additions and deletions needed to reach the target graph as a function of a rate parameter. We also theoretically show that the limiting distribution is concentrated around the target graph. Our theory is also validated through numerical experiments.
A further benefit of our model is that it may be adapted beyond just the interpolation setting. Specifically, rather than being asked to explicitly hit the target graph it can also evolve to a graph within a given edit distance of the target graph and then fluctuate around that point. This is useful in several settings. For example, one may assume a given snapshot is noisy and matching it exactly may not be desirable. Alternatively, this process may be used to construct long term trajectories of graphs that fluctuate around a given target. These trajectories, anchored by real-world graphs, may then be used in the development and testing of streaming algorithms.
Experimentally, we use our model to analyze a number of synthetic and real-world network snapshot sequences. We show that our model can interpolate between snapshots, while common extrapolatory growth models deviate substantially from the exhibited structure even when we fit growth parameters to the snapshots. One key example is that our network interpolation scheme often maintains clustering between snapshots (as measured by the mean and global clustering coefficients), whereas growth models (even ones based on triadic closure) fail to reasonably represent the clustering between snapshots in communication, social, and collaboration networks.
2 Related work
The most relevant models in the literature are snapshot models [15, 44], which consist of probabilistically generated graphs whose parameters (such as edge connection probabilities and vertex labels) vary randomly over a series of discrete time steps. Typically, the number of snapshots is very small compared to the size of the entire graph, whereas the granularity of our method is at a per-edge level. The main goal of existing snapshot studies is statistical inference of the model parameters. For example, Bhattacharjee et al. consider a series of independently drawn graphs with a given community structure that exhibits a sudden change and study the statistical estimation of the time of the change [8]. We perform a similar change-point detection experiment in Section 4.
Importantly, snapshot models do not capture the fine-grain changes that take place in a dynamic network. Network data is becoming increasingly available through fine-grain measurements of temporal interactions [12, 21, 33, 41], and snapshot graphs are designed to capture coarse-grain structure. Also, snapshot models do not yet fully harness temporal structure to identify the emergence of structure in an evolutionary framework. Aldecoa and Marín head toward this direction by proposing a rewiring process from one graph to another as a benchmark for evaluating community detection algorithms [2]. In order to tackle the problem of understanding structure, we need fine-grain models for network evolution.
Another relevant model is the stochastic actor oriented model [43]. Nodes are conceptualized as actors who control their ties to the other actors in the network. Each actor can change ties at time points determined by a Poisson rate function, and the probability of the actor toggling a tie given snapshots is estimated [9]. By contrast, our model processes edges one by one, does not characterize the individual nodes in the network, and operates on a global level.
Finally, another related idea to our research is so-called “network archaeology,” which reconstructs network histories given a present day snapshot [37, 45]. These methods infer growth model parameters by running growth models in reverse to find likely ancestors of a graph. Our model instead interpolates from one snapshot to the next.
3 Network interpolation model
We now concretely present and analyze our model for network interpolation based on given starting and target graphs. All code and data used for this paper are available at https://github.com/tr-maker/networkinterpolation.
3.1 Model description
We initialize our model with the starting graph and target graph . For example, we could be given two snapshots of a social network and want to create a sequence of graphs interpolating between them. At each step, our model makes an update to so that it represents the current graph. (The graph remains fixed throughout the interpolation.) Our graph evolution model is in terms of the graph edit distance between and , which is the minimum number of moves (edge or vertex additions or deletions) needed to go from the former graph to the latter. Without loss of generality, we assume that and have the same number of vertices ,111This is not particularly limiting as there is no restriction on or having multiple connected components or nodes with no edges. and that moves consist only of edge additions and deletions. At each step, we change so that
- •
with probability , we make some advancing move that decreases the edit distance by , or
- •
with probability , we make some regressing move that increases the edit distance by .
We have freedom in how the advancing and regressing moves are chosen.222Because we are only modeling changes in the graph we omit the possibility for the graph to remain the same. This removes any notion of time from our model, though it could be introduced by further modeling when the edits occur. Advancing moves (1) add an edge that is in but not or (2) delete an edge that is in but not in . Regressing moves (1) delete an edge in that is also in or (2) add an edge to that is not in . The simplest model to analyze is when all these possibilities are allowed, and each advancing (resp. regressing) move is chosen uniformly at random from the set of all possible advancing (resp. regressing) moves. However, we may for example wish to disallow adding an edge to that is not in . In such a model, we may have outdated edges (edges from the initial graph that are not in ) but never false edges (edges not in the initial graph nor ).
Another parameter we can control is how the advancing probability varies as a function of the current edit distance to the target. Here we choose it as a sigmoid function of that is centered at a given distance . In the long term this process generates graphs that fluctuate around graphs that are edit distance from the target graph, and we call the target edit distance. We also need to deal with boundary conditions of and , where is the maximum possible edit distance. More formally, is specified as follows:
- •
is a monotonically increasing function of .
- •
, , and .
These properties are realized if we define, for ,
[TABLE]
where is a sigmoid function and is a rate parameter that controls the hitting time to and the spread of distances around in the time-averaged limiting distribution. The standard logistic function makes our model analytically tractable, and we use this function throughout the paper.
Once we have chosen , Algorithm 1 compactly summarizes our model when we choose to run it for a fixed number of steps. (It is easily adaptable to settings where, e.g., we wish to terminate when the target distance is first reached—setting this target distance parameter to zero yields an interpolation scheme.) We note that if we want to store the dynamic graph at each step, we can save space by storing a sequence of edge updates to the initial graph. In addition, we do not need to store the set of regressing moves explicitly since they can be inferred from the advancing moves.
Algorithm 2 describes in more detail an efficient implementation of Algorithm 1 for undirected graphs given the adjacency matrices of the starting and target graphs. If is the strictly upper triangular part of the adjacency matrix of the starting graph and the strictly upper triangular part of the adjacency matrix of the target graph, then the matrix keeps track of the advancing and regressing moves.333If instead the starting and target graphs are directed, then we let be the entire adjacency matrix of the starting graph, be the entire adjacency matrix of the target graph, and . An entry of in indicates an edge that is not in the current graph but in the target graph; an entry of in indicates an edge that is in the current graph but not in the target graph; and an entry of [math] in indicates an edge that is both in the current graph and the target graph, or not in either. Therefore, the number of nonzeros in is the number of advancing moves (or the edit distance between the current graph and the target graph), and the number of zeros of is the number of regressing moves.
Our algorithm scales well with respect to space and time complexity. Before giving a more formal analysis, we first give a sense of the practical running time based on one of our larger experiments in Section 5. There, we interpolate between 9 snapshots of a coauthorship graph with several hundred thousand nodes and edges in each snapshot. We set as the rate parameter and as the target edit distance, and we sequentially ran our algorithm by using consecutive snapshots as the respective starting and target snapshots until the target snapshot was reached. The entire sequence of 8 consecutive interpolations covered 2862784 steps and took 49.2 seconds in total, or 17.2 microseconds per interpolation step, on an early 2015 MacBook Pro with 2.7 GHz and 8 GB of RAM. (Our implementation is not highly optimized, and one possible approach for improving performance in practice would be to make the advancing and regressing moves in batches.)
We now analyze the space and time complexity of our algorithm theoretically. If and are sparse, then is sparse since it contains at most the combined number of nonzeros of and . As the algorithm progresses, becomes sparser with each advancing move and denser with each regressing move. Due to our assumptions on the advancing probability , with high probability the sparsity of will be on the order of the sparsity of and throughout the algorithm. In other words, the total storage required for the algorithm is on the order of the storage required to store the starting and target graphs.
As for the running time of the algorithm, initializing takes time linear in the number of nonzeros of and . After that, each step of the algorithm depends on the complexity of the following three operations: 1) sampling a random index from the nonzero entries of (for an advancing move), 2) sampling a random index from the zero entries of (for a regressing move), and 3) updating an entry of (and similarly, updating an entry of ). In the case of uniform sampling and sparse , the three operations can be made to take amortized constant time. By storing the entries of in a dynamically growing array and a hash map, where the entries of the array consist of the signed edges ( or ) in and the hash map maps each signed edge to its index in the array, the operations of insertion, deletion, search, and sampling a uniformly random element can be performed in amortized constant time. In particular, this takes care of operations 1) and 3). To implement operation 2), we can use rejection sampling, drawing uniformly random edges from the set of all possible edges until we draw an edge whose corresponding entry in is zero. If the number of nonzeros in is linear in the number of vertices, then this process takes a constant number of iterations with high probability. Altogether, each step of the algorithm takes amortized constant time. Therefore, the total running time is on the order of the number of nonzeros of and , plus the order of the number of steps .
Figure 2 shows how various choices of in Eq. 1 affect the series of edit distances from the target graph. For Figure 2 and the remainder of the figures analyzed in Section 3.2, the starting graph is a 50-node Erdős-Rényi random graph with edge connection probability 0.5, the target graph is a 50-node stochastic block model divided into two equally sized clusters with in-cluster edge connection probability 0.9 and out-of-cluster edge connection probability 0.1, and the target edit distance is equal to .
3.2 Model analysis
From Algorithm 1, our model is a Markov chain on graphs. However, many interesting properties of our model can be gleaned by looking at just the edit distances at each step of the interpolation. Our model naturally describes a stochastic process on edit distances ; this process is in fact Markovian provided that an advancing move and a regressing move are always possible for . This very mild assumption holds exactly in the simplest case where all possible advancing and regressing moves are allowed,444It is not strictly true in the case that we disallow false edges (i.e., disallow adding an edge to the current graph that is not in the target graph ), since every time an edge that is not in is removed from , the edit distance for which a regressing move is impossible decreases by one. and our theoretical estimates under this assumption closely match the experiments.
With the Markov assumption, describes the transition probability from edit distance to edit distance . The transition probability matrix therefore has above the diagonal and below the diagonal:
[TABLE]
With this, we analyze two important properties of our model: the limiting distribution of edit distances from the target graph (Proposition 3.1) and the hitting time to a distance from the target graph (Proposition 3.3). For these two quantities, we give formulae that can be computed explicitly in terms of the rate parameter . These propositions are therefore of practical importance for choosing the rate parameter in modeling. We then consider the properties of a random walk on the space of all graphs on vertices (Proposition 3.5). For these theoretical results, we assume that is of the form in Eq. 1 with the standard logistic function. To experimentally validate the propositions, we use the simplest version of the model where each advancing (resp. regressing) move is chosen uniformly at random from the set of all possible advancing (resp. regressing) moves.
3.2.1 Time-averaged limiting distribution of edit distances
The Markov chain as described has period 2: the state at any given time step is either an even or an odd distance away from the initial edit distance. Nevertheless, the Markov chain has a time-averaged limiting distribution in the sense that for each state, the proportion of time spent in that state before step converges as [42].
The left eigenvector corresponding to the eigenvalue 1 of the transition probability matrix is represented in unnormalized form as follows:
[TABLE]
The time-averaged limiting distribution of the Markov chain is this eigenvector normalized by the sum of the entries. Although it is possible to do this eigenvector computation directly, we have the following simpler formula for approximately computing entries of the eigenvector around
Proposition 3.1**.**
Assume that . There exists a constant such that for every and , if , each component of the time-averaged limiting distribution satisfies the following approximation for :
[TABLE]
Proof 3.2**.**
We can write the entries of recursively:
[TABLE]
From Eq. 5, we make two observations. First, if for all , then
[TABLE]
Second, if we iterate Eq. 5 times, we get
[TABLE]
so in particular
[TABLE]
Now assuming that is the standard logistic function, combining Eq. 6 and Eq. 7 gives that
[TABLE]
If , , and is small enough relative to , we can approximate the time-averaged limiting distribution by assuming that is supported on . Then Eq. 8 gives
[TABLE]
where the error
[TABLE]
for some universal constant . So
[TABLE]
and so . Going back to Eq. 8, we get
[TABLE]
*The parenthesized term in the denominator is greater than 1, independent of and . So for (4) to hold, it suffices that is at most some constant times , where the constant is also independent of and . The condition for in the proposition follows from (10). *
Figure 3 shows that the analytic formula reasonably approximates the limiting distribution of edit distances if the distribution of edit distances is not too spread out so that a symmetric approximation is reasonable.
3.2.2 Expected hitting time to the target edit distance
The next proposition gives an analytic formula for the expected hitting time.
Proposition 3.3**.**
The expected hitting time to the target edit distance , starting at the initial edit distance to the target graph, is
[TABLE]
*In particular, the hitting time is monotonic in . *
Proof 3.4**.**
Let be the expected time to hit starting from edit distance . Then we have the linear recurrence
[TABLE]
Let be the starting edit distance. We are interested in . From Eq. 13, adding to both sides, subtracting from both sides, and dividing by gives
[TABLE]
which is a recurrence in terms of the successive differences of hitting times. When is the standard logistic function, then the above equation simplifies to
[TABLE]
with initial condition . Rewriting, we get
[TABLE]
*Now sum Eq. 15 from to . The left-hand side is then the telescoping sum , and we obtain *
[TABLE]
*To find the value of , we sum Eq. 15 from to . The left-hand side is the telescoping sum . Plugging in Eq. 16 for the value of gives *
[TABLE]
Figure 4 shows that the analytic expression for the hitting time is very close to the empirical average hitting time if enough terms are included in the sum in Eq. 12. This again shows that the Markov model approximates the actual dynamics of edit distances. For the analytic hitting times, we kept terms in the sum larger than machine precision; this gave 9 terms for rate and 27 terms for rate . Figure 4 also shows how differently our model behaves for different in terms of the spread of actual hitting times. Figure 5 quantitatively shows how the expected hitting time grows as a function of and how the number of terms needed in Eq. 12 to reach machine precision grows as a function of . If is relatively small, the terms in Eq. 12 decay rapidly and the hitting time can be accurately approximated with a summation of a few terms, regardless of the target or initial edit distance.
3.2.3 The Markov chain on the space of graphs
The previous two results concerned a Markov chain on the space of edit distances, but our model is actually a Markov chain on the space of all possible graphs with a fixed number of vertices. Here we analyze the set of graphs that the model traverses. The exact transition probabilities between the graphs depends on how the advancing and regressing moves are chosen at each time step, but we can make some general statements. An -vertex graph can be viewed as a -dimensional boolean vector indexed by pairs of nodes, where each entry of the vector indicates whether contains the corresponding edge of the complete graph. Thus the set of graphs can be viewed as the vertices of the -dimensional hypercube, and the Markov chain on graphs can be viewed as a (biased) random walk on this hypercube. The edit distance between two graphs is the Hamming distance between their boolean vector representations, and we say that two graphs are adjacent if the edit distance between them is 1. Let the th layer of the hypercube be the set of all graphs that are edit distance from the target graph: the target graph forms layer 0, the graphs formed by deleting an edge or adding an edge to the target graph form layer 1, and so on. The farthest layer from the target graph is layer , which consists of a single graph whose edges are precisely the ones that do not exist in the target graph.
Proposition 3.5**.**
Suppose that each advancing (resp. regressing) move is chosen uniformly at random from the set of all possible advancing (resp. regressing) moves. Then the time-averaged limiting distribution is uniform on all graphs in the same layer.
*Suppose that each advancing (resp. regressing) move is chosen uniformly at random from the set of all possible advancing (resp. regressing) moves, except that no false edges are allowed. Then the time-averaged limiting distribution is supported on the subgraphs of the target graph and is uniform on all such graphs that are in the same layer. *
Proof 3.6**.**
*Note that a graph in layer is adjacent to graphs in layer and graphs layer . Thus, if is the probability that a graph in layer advances to a graph in the next lowest layer, then the Markov chain described in the first part of the statement can be described as follows: for , each graph in layer has a probability of transitioning to an adjacent graph in layer and probability of transitioning to an adjacent graph in layer . (The graph in layer [math] has probability of transitioning to each graph in layer , and the graph in layer has a probability of transitioning to each graph in layer .) The conclusion of the first part of the proposition holds because there is an automorphism of the Markov chain that sends a given vertex in a layer to any vertex in the same layer. (Any permutation of the vertices in layer 1 defines a permutation of the indices of each boolean vector and thus defines an automorphism of the hypercube graph.) For the second part, the graphs that contain edges not in the target graph are transient states in the Markov chain, and so in the long term the Markov chain looks like the chain described in the first part but supported on the subgraphs of the target graph. *
4 Synthetic experiments
We now investigate examples of our model where the target graph is structurally different from the starting graph. This experiment models forensic graph analysis—given structurally distinct starting and target graphs, can we build a sequence of graphs between them that captures the change and subsequently analyze the behavior of that sequence to detect the structural change? We use synthetic graphs in this section as they have especially clear structure, but we consider real-data graphs in Section 5.
We consider the stochastic block model (SBM) [20], a widely used model for planted graph clustering. In this model vertices are divided into clusters and two vertices are independently connected by an edge with probability if they are in the same cluster and smaller probability if they are in different clusters. We analyze two scenarios where a starting 2-block graph evolves into a target 3-block graph. This allows us to explore how community structure affects the graph transition. In the first scenario, one block in the starting 2-block graph splits into 2 equally sized blocks of half the size. In the second scenario, the 3 blocks in the target graph are independently chosen and have no a priori relation to the 2 blocks in the starting graph.
For the experiment, we took and set and for all the block structures. We ran both evolutions with and target distance [math] and stopped the model as soon as it reached the target graph. (The target distance was [math] and so we stopped when the edit distance was first [math].) We measured the recovery rate (the fraction of nodes correctly classified) over time for the 3 clusters in the target graph using a spectral clustering algorithm [10]. We also measured the subspace distance between the space spanned by eigenvectors associated with the three largest eigenvalues of the symmetrically normalized adjacency matrix at each step and the analogous subspace of the symmetrically normalized adjacency matrix of the final target graph. The subspace distance between two subspaces represented by matrices and with orthonormal columns is [16]. This measures how well the dominant invariant subspaces of the current and target graph align and therefore if we expect a spectral algorithm to be able to pick up on the structure of the target graph.
Figure 6 shows the results of this experiment, and both scenarios display striking behavior. There is a point at which the subspace distance sharply declines, causing the recovery rate to sharply increase. The first scenario requires noticeably less time to detect the graph transition, presumably because the community structure of the target graph is related to the starting graph. Although the experiment is based on synthetic data, it suggests the capabilities of our model to transition from one structure to another, and the availability of algorithms to detect structural changes as they occur.
From the same experiment, we looked at the spectrum of the symmetric normalized adjacency matrix of the interpolating graphs in both scenarios and compared it to the spectrum of an algebraic interpolation between the normalized adjacency matrices of the starting and target graphs, where we ignore the graphs and simply interpolate linearly at equally spaced intervals between the two matrices. Figure 7 shows the resulting spectra for the first scenario, and Figure 8 shows the resulting spectra for the second scenario. In both figures, the spectra from the graph interpolation and linear matrix interpolation are strikingly similar, suggesting that our model can produce plausible spectral information while having the advantage of providing a sequence of graphs producing that spectral information. This is valuable since eigenvalue interpolation is in general known to be a difficult problem, especially in situations where the eigenvalues cross as in the second scenario.
5 Real-data experiments
We now interpolate between snapshots of several real-world networks, demonstrating that our model can provide a sensible graph interpolation where graphs and their properties vary “smoothly,” even in the absence of any data between the snapshots. To interpolate between two graphs, we set the first graph to be the starting graph and the second graph to be the target graph. We then set the target edit distance to be [math] and ran our interpolation with a specified rate parameter until the target graph is reached. To interpolate between multiple graphs, we do this procedure for the first and second graphs, then the second and third graphs, and so on. We then compare our interpolation with extrapolation methods (growth models) as described below. Table 1 summarizes the datasets we use.
In our first type of example, we consider a strictly growing network and compare our interpolation with various extrapolations available from network growth models. For this purpose, we used the first two datasets shown in Table 1. The first one is based on private messages on a college messaging network [39]. The data consists of timestamped edges indicating communication between two individuals. From this, we create a growing dynamic graph based on the accumulation of these messages over a period of time. We took two snapshots: the first snapshot is the empty graph and the second snapshot is the aggregated graph of all communications within the first 30 days of the dataset.
Figure 9 shows the change in the mean and global clustering coefficient in the actual data over time compared with our graph model interpolations and three extrapolation methods. The mean clustering coefficient of a graph is the mean local clustering coefficient across all nodes, where the local clustering coefficient of a node is the number of connected pairs of neighbors of the node divided by the total number of pairs of neighbors (and is set to 0 if the node has degree 0 or 1). The global clustering coefficient of a graph is the number of closed length-2 paths divided by the total number of length-2 paths in the entire graph.
The interpolations in Figure 9 uses rates and ; the latter was chosen to match most closely with the real data. To elaborate further, we actually know that the real dynamic network took 21812 steps between the snapshots, so using Proposition 3.3 we took to be the multiple of such that the expected hitting time was closest to the number of steps taken. In general, a higher rate parameter increases the expected number of steps in the interpolation, so this parameter can be tuned if an estimate of the number of edits actually taken from one graph to another is known.
The extrapolations are from three network growth models: uniform attachment, Barabási-Albert preferential attachment [6], and Jackson-Rogers triangle closing [22]. These growth models all start with a small initial graph. In each step, a new node appears and connects to a predefined number of nodes (on average) in the existing network according to a probabilistic distribution on the nodes. For uniform attachment, we start with a clique on some vertices, and each new node connects to nodes chosen uniformly at random in the existing graph. For preferential attachment, we start with a clique on some vertices, and each new node connects to samples of random nodes in the existing graph, chosen with probability proportional to their degree. For Jackson-Rogers triangle closing, we start with a clique on some vertices. At each step, we pick nodes in the existing graph uniformly at random and call them “parent nodes.” The new node connects to each parent node independently with probability . We also pick nodes uniformly at random from the parents’ immediate neighbors and connect to each of them independently with probability .
We set the extrapolation parameters to match approximately the average degree of the target graph. For uniform and preferential attachment, the parameter is the number of nodes that each new node connects to, and so we set to be the nearest integer to the average degree of the target graph. For the college message network in Figure 9, as can be seen from Table 1. The triangle closing model has parameters , , , and . The average degree, in expectation, of a graph produced by this model is , so we matched this with the average degree in the dataset. To set a reasonable choice of parameters, we set and experimentally adjusted relative to until the mean clustering coefficient of the graph produced by the model approximately matched that of the target graph. Here we set and (we found that and produced a final mean clustering coefficient that was as low as the uniform attachment model.)
Figure 9 shows that the extrapolation methods do not yield the correct final mean clustering coefficient, nor even the correct qualitative behavior for the global clustering coefficient, instead decreasing rapidly from 1 (corresponding to the starting clique with isolated vertices). These existing methods are extrapolating from a starting graph, which is not appropriate for our task. This further motivates the use of our interpolation model.
We repeated our experiments using a dataset of emails at a European research institution [40] with timestamped edges indicating emails exchanged between individuals at the institution. As before, we created a growing dynamic graph based on the accumulation of the email exchanges. We did two experiments: one using just 2 snapshots (the empty graph and the final graph accumulating all the email exchanges) and one using 10 snapshots (details below). For the first experiment, Figure 10 shows the change in mean and global clustering coefficient for our graph model interpolation with rate and for the three extrapolation methods just described. For uniform and preferential attachment, we set , and for triangle closing, we set , , and . Here we were better able to fit the triangle closing model to match the final mean clustering coefficient of the target. However, for the global clustering coefficient, all of the extrapolation methods miss the mark.
Figure 11 shows our experiments on the same dataset, but where we took a snapshot of the growing graph roughly every 100 days to get 10 snapshots in total. (Because the interval between timestamps is far from uniform, a number of snapshots are clustered near the end of the dataset when the timestamps are ordered chronologically.) For each interpolation, we used Proposition 3.3 to choose the best rate parameter up to a multiple of 50 since we know the edit distance between snapshots and the number of steps taken between them. For the extrapolations, we used the same models as before but added edges one at a time. As shown in Figure 11, only our interpolation produces a plausible result with roughly the correct overall number of steps, and it also best captures the statistical trends of the data over time. One apparent inaccuracy is the dipping behavior near the beginning of each interpolation. This may be due to the assumption of uniformly random edge edits, which degrade the clustering of the graph as edges that get added near the beginning of the interpolation tend not to have any significant connection to the existing edges. This behavior is magnified in a later experiment (see Figure 14).
Our second type of real-world data example further shows the potential of our model to produce a feasible interpolation between a series of snapshots. The datasets we used here are the last two in Table 1. The first dataset consists of 7 snapshots of self-reported friendships between 32 university freshmen in a survey-based experiment [9]. Figure 12 displays the mean and global clustering coefficient over time for our interpolation with rate 1 and compares it with the same three extrapolation models. Although it is less natural to use extrapolation for non-growing networks, we included a “decaying” component as follows. If the number of edges of the graph increased between snapshots, then we proceed as before with the usual extrapolation one edge at a time until the number of edges in the extrapolation equals the number of edges in the target snapshot. If, on the other hand, the number of edges of the graph decreased between snapshots, then in each step of the extrapolation we select a “new node” uniformly at random. From this “new node,” we choose a neighbor to sever connections with. For uniform attachment, we delete a uniformly random neighbor; for preferential attachment, we delete a random neighbor inversely weighted by degree; and for triangle closing, we delete a random neighbor inversely weighted by the number of triangles in which the neighbor participates.
The interpolation and extrapolations in Figure 12 are shown for only one run, but the qualitative behavior across runs is similar. Notably, we observe that our interpolation scheme viably interpolates the snapshots while the growth models fail to do so. For the purposes of verifying consistency across different instances of our interpolation scheme, Figure 13 provides a quartile box plot of the global clustering coefficient as a function of edit distance for two pairs of sequential snapshots: one where the beginning and ending clustering coefficients are similar, and one where they are not. The mean clustering coefficient has a comparable level of variability.
We repeated the comparison of interpolation and extrapolation using the DBLP coauthorship dataset [32], which is a series of 18 graphs detailing the coauthor relationships among the conference proceedings papers on DBLP each year from 2000–2017. To prevent the clustering coefficients from being unduly influenced by outliers, we ignored papers from the dataset with more than 10 authors, and we aggregated the coauthorship data every 2 years (thus producing 9 snapshots). The comparison is shown in Figure 14.
In this case, the clustering coefficients for the interpolation dip substantially instead of staying roughly linear from one snapshot to another. This is because the edits in the interpolation are done uniformly at random. This presumably does not reflect the dynamics of the actual coauthorship dynamic network, which preserves clustering structure where coauthorship forms cliques in the network. This coauthorship example shows that uniform edits are not a perfect model for describing changes in a network, and more correlated edit rules may be necessary to preserve such a high clustering coefficient. We leave this for future work.
6 Conclusions
We developed a model for network interpolation, analyzed quantities of interest, and motivated it using snapshots from experiments. In doing so, we demonstrated substantial improvements over extrapolation methods in reconstructing important network statistics of dynamic graphs. We also provided a conceptual understanding of trajectories between graphs in terms of edits and edit distances and opened an area of investigation into the rules that should govern such trajectories. Although we analyzed the effect of the rate parameter in our model, further work remains to be done on the effect of sigmoid functions other than the standard logistic function and more complex edge edit rules beyond uniformly random edits. For example, edge edit rules that favor well-connectedness may help better preserve certain graph quantities of interest. Such insights would enhance the adaptability of our model to real-world situations.
We emphasize that our model has broad scope: it can be applied to any set of snapshots—real or synthetic—to generate random, plausible sequences of graphs to move from a starting graph to a target graph in a wide variety of domains. This makes our model applicable to generating synthetic streaming datasets with planted structure (including planted partition, planted clique, and planted coloring) as well as other types of emerging structure in real-world networks. It also has the potential to be used for on-demand or “live” data mining platforms. Furthermore, our model is amenable to theoretical analysis and opens up new theoretical directions in temporal network analysis.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] R. Albert and A.-L. Barabási , Emergence of scaling in random networks , Science, 286 (1999), pp. 509–512.
- 2[2] R. Aldecoa and I. Marín , Exploring the limits of community detection strategies in complex networks , Scientific Reports, 3 (2018).
- 3[3] M. Araujo, S. Papadimitriou, S. Günnemann, C. Faloutsos, P. Basu, A. Swami, E. E. Papalexakis, and D. Koutra , Com 2: Fast automatic discovery of temporal (‘comet’) communities , in Advances in Knowledge Discovery and Data Mining, Springer International Publishing, 2014, pp. 271–283, https://doi.org/10.1007/978-3-319-06605-9_23 . · doi ↗
- 4[4] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan , Group formation in large social networks: membership, growth, and evolution , in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2006, pp. 44–54.
- 5[5] J. K. Ball , Drug abuse warning network, 2006: National estimates of drug-related emergency department visits , tech. report, Substance Abuse and Mental Health Services Administration, Office of Applied Studies, 2007.
- 6[6] A.-L. Barabási and R. Albert , Emergence of scaling in random networks , Science, 286 (1999), pp. 509–12.
- 7[7] A. R. Benson, R. Abebe, M. T. Schaub, A. Jadbabaie, and J. Kleinberg , Simplicial closure and higher-order link prediction , Proceedings of the National Academy of Sciences, 115 (2018), pp. E 11221–E 11230.
- 8[8] M. Bhattacharjee, M. Banerjee, and G. Michailidis , Change point estimation in a dynamic stochastic block model , (2018).
