Using Compressed Suffix-Arrays for a Compact Representation of Temporal-Graphs
Nieves R. Brisaboa, Diego Caro, Antonio Fari\~na, M. Andrea, Rodriguez

TL;DR
This paper introduces TGCSA, a novel compact data structure based on the Compressed Suffix Array, for efficiently representing and querying large temporal graphs with overlapping contacts, outperforming existing methods in space and expressiveness.
Contribution
The paper presents TGCSA, a new compact data structure that improves space efficiency and query capabilities for temporal graphs, especially with overlapping contacts, compared to state-of-the-art methods.
Findings
TGCSA achieves a good space-time trade-off.
TGCSA is efficient for complex temporal queries.
TGCSA can represent overlapping contacts in temporal graphs.
Abstract
Temporal graphs represent binary relationships that change along time. They can model the dynamism of, for example, social and communication networks. Temporal graphs are defined as sets of contacts that are edges tagged with the temporal intervals when they are active. This work explores the use of the Compressed Suffix Array (CSA), a well-known compact and self-indexed data structure in the area of text indexing, to represent large temporal graphs. The new structure, called Temporal Graph CSA (TGCSA), is experimentally compared with the most competitive compact data structures in the state-of-the-art, namely, EDGELOG and CET. The experimental results show that TGCSA obtains a good space-time trade-off. It uses a reasonable space and is efficient for solving complex temporal queries. Furthermore, TGCSA has wider expressive capabilities than EDGELOG and CET, because it is able to…
| Operation | CET | EdgeLog | TGCSA |
|---|---|---|---|
| Dataset | Vertexes | Edges | Lifetime | Contacts | c/ | e/ | c/e | Sizeu32 | Sizeb |
|---|---|---|---|---|---|---|---|---|---|
| () | (e) | () | (c) | (MiB) | (MiB) | ||||
| I.Comm.Net | 10 | 15,940 | 10 | 19,061 | 1.2 | 1594.1 | 1.2 | 291 | 127 |
| Flickr-Data | 6,204 | 71,345 | 167,943 | 71,345 | 1.0 | 11.5 | 1.0 | 1,089 | 868 |
| Powerlaw | 1,000 | 31,979 | 1 | 32,280 | 1.0 | 32.0 | 1.0 | 493 | 231 |
| Wikipedia-Links | 22,608 | 564,224 | 414,347 | 731,468 | 1.3 | 25.0 | 1.3 | 11,161 | 9,417 |
| ba100k10u1000 | 100 | 941 | 100 | 941,408 | 1000.0 | 9.4 | 1000.0 | 14,365 | 7,631 |
| ba1M10p12 | 1,000 | 9,735 | 1,000 | 50,177 | 5.2 | 9.7 | 5.2 | 766 | 479 |
| ba1M10u5 | 1,000 | 9,735 | 1,000 | 48,679 | 5.0 | 9.7 | 5.0 | 743 | 464 |
| ba1M10u50 | 1,000 | 9,735 | 1,000 | 486,792 | 50.0 | 9.7 | 50.0 | 7,428 | 4,642 |
| Dataset | TGCSA | TGCSA-VB | CET | EdgeLog | plain | gzip | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| bit-wise | def | |||||||||
| I.Comm.Net | 69.36 | 61.17 | 59.17 | 91.68 | 77.17 | 73.34 | 52.28 | 82.48 | 56.00 | 66.13 |
| Flickr-Data | 82.90 | 77.60 | 76.29 | 139.12 | 132.80 | 131.34 | 49.71 | 187.39 | 79.00 | – |
| 89.65 | 81.01 | 78.84 | – | – | – | – | – | 102.00 | 97.89 | |
| Powerlaw | 81.66 | 73.85 | 71.92 | 103.88 | 90.69 | 87.50 | 67.97 | 129.88 | 60.00 | 70.01 |
| Wikipedia-Links | 78.02 | 67.73 | 65.14 | 104.69 | 94.51 | 92.34 | 57.75 | 137.08 | 108.00 | 50.67 |
| ba100k10u1000 | 74.62 | 64.47 | 61.93 | 96.52 | 79.65 | 75.18 | 43.63 | 18.22 | 68.00 | 49.88 |
| ba1M10p12 | 87.42 | 79.51 | 77.54 | 109.03 | 93.96 | 92.04 | 56.35 | 65.77 | 80.00 | 69.63 |
| ba1M10u5 | 92.74 | 85.10 | 83.22 | 115.70 | 100.68 | 98.82 | 61.37 | 67.98 | 80.00 | 72.34 |
| ba1M10u50 | 89.20 | 80.18 | 77.98 | 112.41 | 95.97 | 91.51 | 56.56 | 37.26 | 80.00 | 68.24 |
| Timeline | 0% | 25% | 50% | 75% | 100% |
|---|---|---|---|---|---|
| I.Comm.Net | 19,997 | 19,991 | 19,997 | 19,999 | 19,996 |
| Flickr-Data | 2 | 17,428 | 2,313,193 | 17,586,575 | 71,345,977 |
| Powerlaw | 2,914,527 | 2,925,980 | 2,931,495 | 2,934,810 | 2,931,023 |
| Wikipedia-Links | 1 | 5,360,597 | 80,291,698 | 206,020,758 | 307,690,159 |
| ba100k10u1000 | 18,847 | 470,948 | 470,824 | 18,786 | 470,061 |
| ba1M10p12 | 90 | 4,121,832 | 4,866,245 | 4,121,871 | 95 |
| ba1M10u5 | 90 | 4,864,776 | 4,866,275 | 4,863,160 | 94 |
| ba1M10u50 | 988 | 4,866,241 | 4,866,821 | 4,866,351 | 937 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Using Compressed Suffix-Arrays for a Compact Representation of Temporal-Graphs111Funded in part by European Union’s Horizon 2020 research and innovation programme
under the Marie Sklodowska-Curie grant agreement No 690941 (project BIRDS). D. Caro is partially funded by the Chilean government initiative CORFO 13CEE2-21592 (2013-21592-1-INNOVA PRODUCCION). M. A. Rodríguez is partially funded by Fondecyt [1170497] and the Complex Engineering Systems Institute (CONICYT: FBO16). N. R. Brisaboa and A. Fariña were partially funded by Xunta de Galicia/FEDER-UE [CSI: ED431G/01 and GRC: ED431C 2017/58]; by MINECO-AEI/FEDER-UE [Datos 4.0: TIN2016-78011-C4-1-R and ETOME-RDFD3: TIN2015-69951-R]; and by MINECO-CDTI/FEDER-UE [CIEN: LPS-BIGGER IDI-20141259 and INNTERCONECTA: uForest ITC-20161074]. An early partial version of this article appeared in Proc. SPIRE’14 [3].
Nieves R. Brisaboa
Diego Caro
Antonio Fariña
M. Andrea Rodriguez
Data Science Institute, Faculty of Engineering, Universidad del Desarrollo, Chile.
Telefónica I+D Fellow, Chile
Database Laboratory, University of A Coruña, Spain.
Department of Computer Science, University of Concepción, Chile.
Abstract
Temporal graphs represent binary relationships that change along time. They can model the dynamism of, for example, social and communication networks. Temporal graphs are defined as sets of contacts that are edges tagged with the temporal intervals when they are active. This work explores the use of the Compressed Suffix Array (CSA), a well-known compact and self-indexed data structure in the area of text indexing, to represent large temporal graphs. The new structure, called Temporal Graph CSA (TGCSA), is experimentally compared with the most competitive compact data structures in the state-of-the-art, namely, EdgeLog and CET. The experimental results show that TGCSA obtains a good space-time trade-off. It uses a reasonable space and is efficient for solving complex temporal queries. Furthermore, TGCSA has wider expressive capabilities than EdgeLog and CET, because it is able to represent temporal graphs where contacts on an edge can temporally overlap.
keywords:
Temporal Graphs , Compressed Suffix Array , Self-index
††journal: Information Sciences
1 Introduction
The main assumption of static graphs is that the relationship between two vertexes is always available. However, this is not true in many real world situations. For example, consider how friendship relations evolve in an online social network, or how the connectivity in a communication network changes when users, with their mobile devices, move in a city. Temporal graphs deal with the time-dependence of relationships between vertexes by representing these relationships as a set of contacts [36]. Each contact represents an edge (i.e., two vertexes) tagged with the time interval when the edge was active. For example, in a communication network, a contact may represent a call between users made from 4 pm to 4.05 pm.
The temporal dimension of edges adds a new constraint to the relationship between vertices not found in static graphs: two vertexes can communicate only if there is a time-respecting path (also called journeys [36]) between them [36, 46, 50, 47, 19]. For example, in Figure 1.b (corresponding to the time aggregation of the edges in the temporal graph of Figure 1.a), there are two paths connecting the vertexes and : one through the vertex , and the other one through . However, there is no such path when considering the temporal availability of the edges and . Notice that the vertexes and are only reachable from the vertex because the edges reaching are not available. Therefore, taking into account the temporal dynamism of graphs allows us to exploit information about temporal correlations and causality, which would be unfeasible through a classical analysis of static graphs [36, 19, 32].
A direct approach to represent temporal graphs could be a time-ordered sequence of snapshots (Figure 1a), one for each time instant, showing the state of the temporal graph at a time instant as a static graph. Several centralized and distributed processing systems follow this approach (e.g. Pregel [29], Giraph222http://giraph.apache.org/, Neo4J333http://neo4j.org/, Trinity [44]), but without specific support for temporal extensions [24].
In temporal graphs where contacts are active during long time intervals (as in a social network), consecutive snapshots tend to become very similar. Thus, strategies based on a sequence of snapshots are space consuming because edges are duplicated in each snapshot. An alternative change-based approach represents the temporal graph by the differences between snapshots; that is, by the set of edges that appear/disappear along time. These differences can be calculated with respect to consecutive snapshots [15], or with respect to a derived graph that diminishes the number of stored edges [38, 23, 26, 42].
The change-based approach has also been used for pre-computing reachability queries [43, 42], as some paths may remain available for several time instants [2]. Although these works improve the time performance of complex algorithms, they overlook the space cost, which becomes crucial for large temporal graphs. In this context, a compact representation can keep larger sections or even the whole temporal graph in memory and, in consequence, queries could become much more efficient by avoiding disk transfers.
Recently, some compact approaches to represent temporal graphs have been proposed [7, 8]. The work in [7] presents the -, a tree-shaped compact data structure based on the Quadtree [40], which represents a temporal graph as a point in a four dimensional space. This data structure was designed to reduce space usage at the expense of time access in sparse temporal graphs. EdgeLog (Time Interval Log per Edge) [8] uses a compressed inverted index, which also provides fast answers to different types of queries, in particular, when solving adjacency queries involving the recovery of active neighbors of a vertex at a specific time instant. CET (Compact Events ordered by Time) [8] uses a wavelet tree [34, 16] to represent temporal graphs and is the best alternative in the state-of-the-art to answer queries related to time-instant events that change the state of an edge.
Both EdgeLog and CET overcome the overload of storing a snapshot per each time instant by representing the temporal graph as a log of events. These events indicate when edges become active or inactive. Then, the activation state of a given edge can be recovered by counting how many events occurred on that edge during a time interval. If there is an even number of events, it means that the edge has been active and inactive several times. Conversely, if the edge has an odd number of events, it means that the last state of the edge is active. A detailed explanation of these data structures is available in Section 2.
A main drawback of the log-based structures, such as EdgeLog and CET, is that they do not allow the representation of time-overlapping contacts of an edge. For example, if a contact represents the data communication between two machines and during a time interval, it is impossible to represent a second contact between and during an overlapping time interval. This limitation arises because in these structures the event that represents the activation of the second contact would be interpreted as the deactivation event of the first contact.
The work in this paper presents and evaluates a data structure named Temporal Graph CSA (TGCSA). The TGCSA is a compact and self-indexed structure based on a modification of the well-known Compressed Suffix Array (CSA)[39], extensively used for text indexing. We focus on algorithms to process temporal-adjacency queries that recover the set of active neighbors of a vertex at a given time instant. These queries are basic blocks to solve time-respecting paths [32], which can be useful in the context of moving-object data [30, 25], and also when analyzing activity patterns as temporally ordered sequences of actions occurring at specific time instances or time intervals [27, 28].
We also present algorithms for answering queries that recover the snapshot of the graph at a time instant, as well as queries to recover the state of single edges. In addition, we include a complete experimental evaluation with real and synthetic data that compares TGCSA with EdgeLog and CET in terms of both space and time usage. The results of this evaluation show that TGCSA opens new opportunities for the application of suffix arrays [31, 39] in the context of graphs in general, and of temporal graphs in particular.
As discussed above, there are different fields where the application of our TGCSA, or other compact existing alternatives from the state of the art such as EdgeLog or CET, can be of interest. Among others, we can mention [45, 21]: (i) Social networks, where friendships establish connections between nodes that can vary along time. (ii) Biological networks, where function brain connections are dynamic. (iii) Communication networks, where nodes are connected while their exchange information. This applies to person-to-person and machine-to-machine communication. (iv) Transportation networks, where the connectivity between nodes can change due to scheduling and traffic conditions. In this context, one could also model movements on a network by considering that two nodes are connected if there exists an object that moves from one to the other node during a time interval.
The structure of this paper is as follows. Section 2 presents preliminary concepts about temporal graphs and relevant queries on them. To make the paper self-contained, Sections 2.2 and 2.3 provide a brief overview of both EdgeLog and CET. These are the state-of-the-art techniques we compare TGCSA with. Section 3 introduces TGCSA by showing how to modify a traditional CSA to create TGCSA. It also describes how TGCSA solves relevant queries for temporal graphs and provides pseudocode for such operations. Finally, this section presents a new representation of the array from CSA [17, 14], called in this work , which increases the query performance of TGCSA. Section 4 provides the experimental evaluation that uses real and synthetic data. Final conclusions and future research directions are given in Section 5.
2 Preliminary concepts
In this section we introduce temporal graphs and a classification of the relevant basic queries that could be of interest for most applications. We also revise previous compact representations of temporal graphs.
2.1 Temporal graph definition
Formally, a temporal graph is a set of contacts that connect pairs of vertexes in a set during a time interval defined over the set that represents the lifetime of the graph. A contact in of an edge is a 4-tuple , where is the time interval when the edge is active [36]. We say that an edge is active at time if there exists a contact such that . Note that this definition applies for directed graphs as we consider ordered pairs of vertexes.
We classify operations on temporal graphs into two categories: queries for checking the connectivity between vertexes and queries for retrieving the changes on the connectivity occurred along time. For the first category of queries, we define four operations: (1) checks if an edge is active. (2) returns the active direct neighbors of a vertex. (3) gives the active reverse neighbors of a vertex. (4) returns all the active edges. For example, in the temporal graph of Figure 2.a, we know that at time instant the edge is active, the set of direct neighbors of is and the set of reverse neighbors of is ; whereas the snapshot at time corresponds to the edges .
For queries retrieving the changes on connectivity, we defined two operations: (1) returns the set of edges that were activated. (2) returns the set of edges that were deactivated. For example, given Figure 2.a at time instant , the edge was activated, and the edges were deactivated.
Note that all previous queries have a time-instant or a time-interval version. In what follows, we concentrate on time-instant queries, which can be easily extended to answer time-interval queries, and they also serve as the building blocks for more complex temporal measures that are based on recovering time-respecting paths [32].
2.2 EdgeLog: Baseline representation
A simple temporal graph representation [6] stores the aggregated graph444The static graph including all the edges that were active at any time during the lifetime of the temporal graph. as adjacency lists, one per each vertex, with a sorted list of time intervals attached to each neighboring vertex indicating when that edge is/was active. Figure 2.b shows a conceptual example.
To check if an edge was active at time , we first check if appears within the adjacency list of vertex . If is found, then we need to check if falls into one of the time intervals related to that are represented in the time-interval list of that edge. Direct neighbors of vertex at time are recovered similarly. For each neighbor in the adjacency list of , we check if is within the time intervals of the edge .
A simple representation of the aggregated graph and the temporal labels attached to vertices has two main drawbacks: (1) it uses much space; and (2) operation requires traversing all the adjacency lists. The data structure EdgeLog [8] addressed these weaknesses. On the one hand, since both the adjacency list and the time-interval list are sorted (i.e., they are of the form , with ), they can be represented as d-gaps , and those differences can be compressed using a variable-length encoding (e.g., [52], [51], [49]). On the other hand, to avoid traversing all the adjacency lists in queries, EdgeLog stores a reverse aggregated graph containing an adjacency list with all the reverse neighbors of each vertex. Therefore, to get the reverse neighbors of vertex at time , we first use the reverse adjacency list to obtain the candidate reverse neighbors of . Then, for each candidate reverse neighbor , we search for in its adjacency list and, finally, check if the edge is active at time (using the time-interval list of the edge).
2.2.1 Strengths and weaknesses of EdgeLog
Although EdgeLog is a simple structure using well-known technology, it is expected to be extremely space-efficient when the temporal graph has a low number of edges per vertex and a large number of contacts per edge. In the opposite way, a low number of contacts per edge will have a negative impact on the compression obtained by EdgeLog (as d-gaps become large). Note also that, even with the reverse aggregated graph to find reverse neighbors, the performance is expected to be poor if the number of edges per vertex is high because all their adjacency lists will have to be checked.
EdgeLog was designed to be efficient for , , and queries, but it could not efficiently answer queries such as: “Find all the edges that have active contacts at time ” or “Find all the edges that have been active only once”. This is because in such operations, all the adjacency lists must be processed. Also, the applicability of EdgeLog is limited to temporal graphs whose contacts do not temporally overlap; that is, it assumes that a contact of an edge ends before another contact of the same edge starts.
2.3 CET: Compact Events ordered by Time
In CET a temporal graph is a sequence of symbol pairs that represent the changes on the connectivity between vertexes. Each pair represents either the activation or deactivation of an edge along time. Note that a contact of the form generates two changes: an activation of the edge at time , and a deactivation at time instant . The sequence of pairs () is composed of the changes on the connectivity of edges (i.e., activations or deactivations produced by all the contacts in the temporal graph) grouped by time instant in increasing order. In Figure 3.a, we show how the sequence of changes of the temporal graph from Figure 2.a is built. We can see that the first two entries of correspond to the edges and that are activated at time instant . Next entry corresponds to the activation of the edge at time instant . The fourth and fifth entries of are related to the edge , which is deactivated at time instant and activated again at , respectively. The next three entries reflect the changes produced at when the edges and are deactivated and is activated. Finally, the edges and are deactivated at time instant .
The activation state of an edge at time instant is computed by counting how many times the pair encoding the edge appears in the subsequence of changes within the time interval between [math] and (in the closed time interval). As we assume that all edges are inactive at the beginning of the lifetime, the first occurrence of the pair means that the edge becomes active, the second occurrence means that the edge becomes inactive, and so on. In consequence, if the pair appears an odd number of times, it means that the state of the edge is active; otherwise, it is inactive. For example, we can see in Figure 3.a that, because the pair occurs three times within interval , the edge is active at time instant . The direct neighbors of a vertex at time are also recovered using the counting strategy, but checking the frequency of the form , i.e., the pairs whose first component is . Similarly, the reverse neighbors of are obtained by counting the pairs that end with .
The sequence of pairs that composes is represented in an Interleaved Wavelet Tree (IWT) [8], a variant of the Wavelet Tree [16, 18] capable of counting the number of occurrences of multidimensional symbols in logarithmic time, while keeping a reduced space. The Wavelet Tree is a balanced binary tree, whose leaves are labeled with symbols in an alphabet , and whose internal nodes handle a range of the alphabet. Each node of the Wavelet Tree represents the sequence as a bitmap with 0s and 1s, depending on the binary code used to represent each symbol in the alphabet . Figure 3.b shows the IWT representation for the sequence of changes of the temporal graph in Figure 2.a. (For more details on the Wavelet Tree and its applications, refer to [34]).
In the IWT, the pairs of symbols in are represented by an interleaved code that is the result of interleaving the bits (Morton Code [41]) of the codes corresponding to the source and target vertexes of each pair. Figure 3.c shows the interleaved bits for the pairs (corresponding to the edges) of the temporal graph in Figure 2.a. Note that the symbols in pair ad are given the codes 00 and 11 respectively. Therefore, the interleaved code for pair ad is 0101, and those four bits are represented along the wavelet tree by starting in the root node with the first 0. Because that bit is a zero, we move to the left child in the next level where we use the second bit of such code. This second bit is 1 and appears at the first position in the bitmap. Subsequently, we move to the right child in the next level, and use the third bit of the code, which is the 0 at the first position of the bitmap. Finally, we move again to the left child of the node and reach the last level where we set the last bit of the code of ad, which is 1.
The counting operation of a symbol in the sequence 555For simplicity, we will use the notation to refer to the sequence of elements . is translated into counting operations over the bitmaps in the path of the symbol . In order to show how the counting algorithm works, let us use the operation .666Given a bitmap , computes the number of occurrences of bit in . The algorithm works as follows. At the root node, if the first bit of symbol is 0 (1) we descend through the left (right) child of the node. At the child node, the position is updated to (), if the first bit of the symbol is 0 (1). This process is recursively repeated until we reach a leaf node. At the leaf node, the number of occurrences of the symbol corresponds to the updated value of . In total, this counting strategy requires to answer operations over the bitmaps in the path of a symbol. Figure 3.b shows, with a darker background, the bitmaps used to count how many times the symbol appears until the fifth position of the sequence.
2.3.1 Strengths and weaknesses of CET
One advantage of CET is its ability to retrieve reverse neighbors with the same time performance of direct neighbors, due to the bi-dimensional representation used for storing the events of activation/deactivation of edges. Indeed, we just need to update the retrieval range to to obtain the frequency of neighboring changes of the edges whose target vertex is .
Another advantage is that the time performance in operations about vertexes and edges is independent of the number of contacts per query in the graph. This is because IWT allows the counting of events in logarithmic time with respect to the number of edges (instead of a sequential counting on the history of events). Due to the temporal arrangement of events of activation/deactivation of edges, operations regarding events on edges are easily obtained by extracting the subsequence related to the time instant of the query. For example, to obtain the edges that change their state at time instant , we just need to recover the pairs of vertexes in the section related to events occurred at time .
Despite the advantages of CET, its main weakness is related to the counting strategy used to recover the states of edges when contacts are active for short time intervals. For example, if we want to retrieve a snapshot at a time instant in a graph where all the edges were activated and deactivated before , we are forced to retrieve the frequency of all the edges (i.e., visiting each node of the IWT), although only a small fraction of them will actually be in the output. In addition, the frequency counting does not allow the representation of temporal graphs with overlapping contacts. This is because a symbol representing an overlapping contact will be interpreted as a symbol denoting the deactivation of the contact.
2.4 Improved representations of EdgeLog and CET
In the previous section, the descriptions of EdgeLog and CET are given for temporal graphs where edges can freely appear and disappear along time, with no restrictions on the number of contacts per edge. The representation of these data structures can be improved by taking into account properties of the graph being represented. In particular, properties such as the duration and the dynamism of contacts [19].
When all contacts last only one time instant, both EdgeLog and CET can be modified to only store the event that activates an edge because, by definition, all edges will only remain active for one time instant. This small modification invalidates the strategy used to check if the edge is active (i.e., the counting strategy in CET, and the check of the interval in EdgeLog). However, it enables a new strategy to check if an edge is active. For example, in EdgeLog, the list of time intervals per edge is replaced by a list of time instants when an edge was active. Thus, the updated algorithm for checking the activation state of an edge at time is replaced by verifying if the new list of time instants contains . Similarly in CET, the activation state of an edge is replaced by checking if the edge appears in the subsequence related with the events occurred at time instant .
The data structures were also specialized for temporal graphs where each edge has only one contact, and once activated, this contact remains active until the end of the lifetime. In the literature, these graphs are called incremental graphs [13]. With this kind of temporal graphs, the modification is straightforward. As all contacts end at the same time instant (i.e., at the end of the lifetime), it is not necessary to explicitly store the events that deactivate the edges. Caro et al. [8] also used this strategy to improve the space cost of both EdgeLog and CET data structures, without the need of updating the query algorithms. Nevertheless, its usefulness depends on how many contacts effectively end at the last time instant of the graph.
3 CSA for Temporal graphs (TGCSA)
The Compressed Suffix Array for Temporal Graphs (TGCSA) is a new data structure adapted from Sadakane’s Compressed Suffix Array (CSA) [39] to represent temporal graphs. Unlike EdgeLog and CET, it can represent contacts of the same edge that temporally overlap, what makes TGCSA a more general representation for temporal graphs.
Below we provide a brief presentation of the CSA. Then, we include a detailed description of TGCSA where we show how to create a TGCSA and we present a modification of the main structure () of TGCSA (Section 3.4) that targets at improving its efficiency. Finally, we also show how it solves the most relevant temporal queries.
3.1 Sadakane’s Compressed Suffix Array (CSA)
Given a sequence built over an alphabet of length , the suffix array built on is a permutation of of all the suffixes such that for all , being the lexicographic ordering [31]. In Figure 4.a, we show the suffix array for the text "abracadabra".777The \$$ at the end of SS$.
Because contains all the suffixes of in lexicographic order, this structure permits to search for any pattern in time with a simple binary search of the range (i.e., ) that contains pointers to all the positions in where occurs. The term of the cost appears because, at each step of the binary search, one could need to compare up to symbols from with those in the suffix . Unfortunately, the space needs of are high.
To reduce the space needs, CSA [39] uses another permutation defined in [17]. For each position in pointed by , gives the position such that points to . There is a special case when , in which case gives the position such that . In addition, two other structures are needed, a vocabulary array with all the different symbols that appear in , and a bitmap aligned to so that if or if (; otherwise). Basically, a in marks the beginning of a range of suffixes pointed from such that the first symbol of these suffixes coincides. Therefore, if the and ones in occur in and , respectively, that is, if and , it means that all the suffixes , ,… pointed from the entries start by the same symbol of the vocabulary. The bitmap is used to index the vocabulary array. Note that . Recall that returns the number of 1s in and can be computed in constant time using extra bits [22, 33], whereas returns the position of the in . In Figure 4.b, we show the components of the CSA for the text "abracadabra".
By using , , and , it is possible to perform binary search without the need of accessing or . Note that, the symbol pointed by can be obtained by , symbol can be obtained by , symbol can be obtained by , and so on. Recall that basically indicates the position in that points to the symbol . Therefore, by using , , and we can obtain the symbols that we could need to compare with in each step of the binary search.
In principle, would have the same space requirements as . Fortunately, is highly compressible. It was shown to be formed by subsequences of increasing values [17] and, therefore, it can be compressed to around the zero-order entropy of [39], and by using -codes to represent the differential values, a space cost of bits is obtained. Note that, in Figure 4.b, the arrows under denote the subsequences of increasing values in . In [35], they showed that can be split into (for any ) runs of consecutive values so that the differences within those runs are always . This permitted them to combine -coding of gaps with run-length encoding (of 1-runs) yielding higher-order compression of . In addition, to maintain fast random access to , absolute samples at regular intervals are kept.
In [14], authors adapted CSA to deal with large (integer-based) alphabets and created the integer-based CSA (iCSA). They also showed that, in this scenario, the best compression of was obtained by combining differential encoding of runs with Huffman [20] and run-length encoding.
As said before, , , and are enough to simulate the binary search for the interval where pattern occurs without keeping and (). Being the number of occurrences of in , this permits to solve the well-known count operation. However, if one is interested in locating those occurrences in , is still needed. In addition, to be able to extract the subsequence , we also need to keep so that we know the position in that points to . In practice, only sampled values of and are stored. Non-sampled values can be retrieved by applying -times until a sampled position is reached (then ). Similarly, sampled values of can be obtained by applying k-times from the previous sample (starting with ). In this case, . From this point, the CSA is a self-index built on that replaces (as any substring could be extracted) and does not need anymore to perform searches.
3.2 Modifying CSA to represent Temporal Graphs
Recall that a temporal graph is a set of contacts of the form , where and are vertexes () and a link or edge between them is active during a time interval . Also , with being the time instants representing the lifetime of the graph. In Example 1, we include a set of five contacts that we will use in our discussion below.
Example 1
Let us consider the temporal graph in Figure 5 with vertexes numbered and time instants numbered . This graph contains the following five contacts: , , , , and . \qed
Targeting at using a CSA to obtain a self-indexed representation of a set of contacts (i.e. all their terms regarded as a unique sequence), we discuss in this section two adaptations that we performed. The first one, using-disjoint-alphabets, consists in assigning ids from disjoint alphabets to both vertexes and time instants. Then, when we perform a query for a given (or a sequence of ) within the CSA, that will correspond either to a source vertex, a target vertex, a starting time instant, or an ending time instant. The second modification consists in making cyclical on the elements of the 4-tuple representing a contact. This will permit us to use the regular binary search procedure of the CSA to efficiently search for (and retrieve) those contacts matching some constraints on their terms.
3.2.1 Using disjoint alphabets
Given a set of contacts, such as the one in Example 1, our procedure to create TGCSA starts by creating an ordered list of the contacts, so that they are sorted by their first term, then (if they have the same first term) by the second term, and so on. After that, these sorted contacts are regarded as a sequence with elements (), and a suffix array is built over it. This is depicted in Figure 6.
If were made up of text, and (or a CSA built on ) would be enough to perform searches for any word or text substring . In such case, if we looked for the occurrences of symbol (i.e ), would indicate that there are occurrences of symbol . They occur at , , and respectively. However, in our scenario, when we search for symbol (i.e. ) we have to be able to distinguish among the source vertex , the target vertex , the starting time instant and the ending time instant . This would require accessing all the entries , , and checking the positions in they are pointing to. In practice, if then points to a source vertex; otherwise, if then it points to a target vertex, and so on. However, this procedure would ruin the search time that would now become , where is the number of occurrences of the query pattern in .
A simple workaround to the problem above consists in using disjoint alphabets for the four terms in a contact. In our case, we use alphabets , and satisfying that ( indicates lexicographic order). Note that we can always replace vertexes and time instants in the original set of contacts by new satisfying this property. For example, in Figure 7, we have created a new sequence where: (i) the of the source vertexes have been kept as they were initially (); (ii) the of the target vertexes have been added (); (iii) the of the starting time instants have been added (); and (iv) the of the ending time instants have been added (). Now, when we build the suffix array for the new , we can search for either the pattern , , , or , depending on if we want to find the occurrences of the term that corresponds to a source vertex, target vertex, starting time, or ending time, respectively. For example, we can see in the figure that when we are searching for the starting time , we can simply add to its and actually use the suffix array (or the CSA) to look for obtaining its two occurrences pointed by and . However, to search for the target vertex we would add to its and found that points to its unique occurrence in . In any case, we retain the original search time as expected.
An interesting by-product that arises from the use of disjoint alphabets is that, since values from are smaller than those from (), the first quarter of entries in () will point to the first terms of all the contacts (), the next entries in () to the second terms (), and so on. Consequently, the first quarter of entries of () will point to a position in the range , because in the indexed sequence each symbol is followed by a symbol , and so on. In this way, each entry in the last quarter of will point to a position in the range , corresponding to the first quarter of entries in .
In our example, recall we have contacts. We can see that the entries in the four quarters of discussed above match that: ; ; ; and . In addition, in Figure 7, we have also included the structure that arises when we build the corresponding CSA. In this case, we can also verify that it holds that: ; ; ; and . This property will be of interest in the following section.
3.2.2 Modifying to make it cyclical on the terms of each contact
Recall that in a regular CSA, once we know that the entry in the underlying suffix array points to a position of the source sequence , we can recover the entries from the original sequence as , the next symbol as , the next symbol as , and so on. Therefore, as shown in Section 3.1, by using , , and , we can binary search for any pattern obtaining the range so that points to the positions in where can be found. Then, from those positions on, we could recover the source data of the suffixes that start with . Unfortunately, this mechanism allows us to recover the source data only forward-wise (not backwards), and this is not enough in our scenario because we typically want to search for the contacts that match a given constraint and then we want to retrieve all their terms.
To clarify the issue above, consider, for example, when we look for the contacts whose target vertex is (), then we obtain its unique occurrence at the position (). Consequently, to retrieve the terms of that contact , we would compute: ; ; . However, would not recover the first term of the current contact, but the first term of the next contact in . As in a regular CSA, to retrieve , we would have to access to know that the target vertex occurs at position , and consequently the source vertex should be retrieved from . Now, because is not actually kept in the CSA, to extract , we have to know the entry in such that . We can use that .888Recall indicates which position from points to the entry of . That is, such that . Finally, by using we have fully recovered the contact we were searching for. To sum up, the previous procedure would make it necessary to use not only , , and , but also and as explained in Section 3.1. Fortunately, we can modify in such a way that it allows us to move circularly from one term to the next term within a given contact.
Recall that, due to our disjoint alphabets, if points to the last term of the contact, then would store the position in pointing to the first term of the following contact (), which would be in the range . For TGCSA, we modified these pointers in the last quarter of in such a way that, instead of pointing to the position corresponding to the first term of the following contact, they point to the first term of the same contact; that is, or if . The modified quarter of is depicted as in Figure 7. In this way, starting at any entry in , and following the pointers , , and , all the elements of the current contact can be retrieved, but no entry from any other tuple will be reached. Due to this modification, in the example above, we can recover , and and are no longer needed.
Note that it is not possible now to traverse the whole CSA by just using because consecutive applications of the function will cyclically obtain the four elements of the corresponding contact. However, this small change in to make it cyclical on the terms of each contact, brings additional interesting searching capabilities that we will exploit in Section 3.5.
3.3 Detailed construction of TGCSA
Once we have explained the need of using disjoint alphabets and the reason why we use a modified , in this section we explain the actual procedure to build our TGCSA. In Figure 8, we depict all the structures involved in the creation of a TGCSA representing the temporal graph in Example 1.
As indicated above, the first step to build a TGCSA is to create a sequence with the ordered contacts. Hence we obtain, .999Note that the ordering is not relevant because we have a set of contacts. Therefore, we will assume that contacts are sorted by the first term, then by the second one, and so on.
The second step involves defining a reversible mapping that enables us to use disjoint alphabets. Let us assume we have different vertexes and time instants. It is possible to define a reversible mapping function that maps the terms of any original contact to . To achieve this, we define an array and a set with elements . This mapping defines four ranges of entries in an alphabet for both vertexes and time instants such that . Note that vertex is mapped to either the integer or depending on whether it is the source or target vertex of an edge. Similarly, the time instant is mapped to either or . This allows us to distinguish between starting/ending vertexes/time instants by simply checking the range where their value falls into.
Observe that even though vertex always exists in the temporal graph, either source vertex or target vertex could actually not be used. Similarly, a time instant could not occur either as an initial or as an ending time of a contact, yet we could be interested in retrieving all the edges that were active at that time .
To overcome the existence of holes in the alphabet , a bitmap is used. We set if the symbol from occurs in a contact, and ; otherwise. Therefore, each of the four terms within a contact will correspond to a in . Then an alphabet of size 101010Recall returns the number of 1s in . is created containing the positions in where occurs. For each symbol , a mapID() function assigns an integer to , so that mapID() = if , and mapID if . The reverse mapping function is provided via unmapID.111111Recall computes the position of the in B.
At this point, a sequence of ids can be created by setting mapID. Indeed, being , respectively, the types of source vertexes, target vertexes, starting time instants, and ending time instants from the original sequence , we can map any source symbol from into by getmap. Similarly, the reverse mapping obtains getunmap.
Once we have made up our indexable sequence , an iCSA is built over it.121212We actually added four integers set to that make up a dummy contact ([math],[math],[math],[math]) at the beginning of . This is required to avoid limit-checks at query time. Then, as discussed in Section 3.2.2, we modified the array in our TGCSA to allow to move circularly from one term to the next one within the same contact. To do this, we simply have to modify the last quarter of the regular array so that, . This small change brings an interesting property that allows us to perform a query for any term of a contact in the same way. We use the iCSA to binary search for a term of a contact(s), obtaining a range , and then by circularly applying up to three times, we can retrieve the other terms of the contact(s).
To sum up, TGCSA consists of a bitmap , and the structures and of the iCSA. In practice, is compressed using Raman et al. strategy131313Raman et al strategy allows both and in time and requires bits. [37], and for we used a faster bitmap representation [14] using bits. For the representation of we also used the best option (named ) that samples at regular intervals and then differentially encodes the remaining values [14]. Yet, we also created an alternative representation for that is discussed in Section 3.4.
3.4 A more suitable representation of for temporal graphs: strategy
The regular representation of is based on sampling the array at regular intervals (one sample every entries) and then, differentially encoding the remaining values between two samples. In [14], they studied different alternative encodings for the non-sampled values, and showed that the best space/time trade-off in a text-indexing scenario was reported by coupling run-length encoding of 1-runs (sequences of values) with bit-oriented Huffman ( approach). In practice, they used Huffman codes to indicate the presence of -runs of length . They also reserved Huffman codes to represent short gaps (where is a parameter typically set to ). Finally, being the machine word size, additional Huffman codes are used as escape codes to mark the number of bits needed to either represent a large positive gap () or a negative gap (). In both cases, such a escape code is followed by represented with bits.
In this paper, we present a new strategy to represent , that we called , where we try to speed up the access performance at the cost of using a little more space. An example of the structure for the resulting representation is shown in Figure 9. We also use sampling and differentially encode non-sampled values. Yet, we made some changes with respect to the traditional representations (i.e., ), which are summarized as follows:
We used vbyte (byte-aligned) codes [48] rather that bit-oriented Huffman codes to differentially encode non-sampled values. This should result in around one order of magnitude improvement in decoding speed when sequential values of are to be retrieved. Note that in the bottom part of Figure 9, we include a sequence of byte-oriented codewords (either 1 or 2-byte codewords in our example) that are used to represent the gaps from the original structure. It can also contain a pair of codewords for the pair to encode a 1-run of length . Of course, using byte-aligned rather than bit-oriented codes will imply a loss in compression effectiveness.
- 2.
We do not sample at regular intervals. Instead of that, we keep samples aligned with the ones in bitmap , that is, there is a sample at the beginning of the interval in corresponding to each symbol . This modification brings three main advantages:
- (ii)
We ensure that is always sampled, whereas with the traditional representation of the previous sampled position could be in the range . Therefore, was sampled with probability . Note that, in TGCSA, a typical access pattern to during searches (see Section 3.5) consists in traversing all the values once we know the interval corresponding to a given symbol . This requires decoding gaps from the previous sample to in to obtain synchronization at value , and sequentially decoding gaps from there on. Since is always sampled in , we avoid that synchronization cost. 2. (ii)
While in the traditional representation of , the differential sequence () could contain up to negative values (when belongs to a symbol and to symbol )[17], the representation does not deal with negative values because is always a sampled position. 3. (iii)
We do not break 1-runs. Recall that 1-runs could occur mainly within the range corresponding to a given symbol . Because our first-level sampling stores only a sample at position , 1-runs are no longer split. This is interesting for both space and access time because a unique codeword can be used to represent a large 1-run sequence. In our example, we can see that the codewords in \mathsf{vbytegaps}$$[10,11] represent the 1-run of length within . That is, we do not break the 1-run every values.
In Figure 9, we can see that samples consist of a triple of values that are aligned with the ones in : indicates the absolute value, is a pointer to sequence, and indicates the index of the sampled position. In practice, these values are set in three arrays , , and , respectively, such that if is sampled, we set , , and .
Note that the absolute values are kept explicitly in and are not represented within the sequence (exactly as in ). For example, is stored at the first entry of , and the first codeword in represents value , which corresponds to the gap . Hence, no codeword in is associated with the sampled value . Note also that is the position in that we have to access to recover values . In our example, we can see that can be recovered by accessing the previous sampled value , then accessing sequence at position to obtain the gap by . Finally, we recover . As an important remark, observe that given a symbol , we will use to obtain the starting sampled position for the range . We could skip storing array as we can compute . This introduces a space/time trade-off that we discuss in the next section.
Despite the advantages of the sampling structures described above, our representation has also a main drawback: we cannot parameterize the number of samples we want to use. Thus, we can be using a rather too dense sampling for infrequent symbols (consequently, we expect that compression will suffer in datasets with very large vocabularies ()), or we can be using a very sparse sampling for frequent symbols , as they will have only one sample at the beginning of the corresponding interval . This fact could slow down the access to an individual position , with . To overcome this, we added a second-level sampling where we sample the positions ( is again the sampling interval). We use a bitmap (see Figure 9) to mark the positions of these samples in , and, aligned with the ones in , arrays , , and keep the sampling data ( is the number of ones in ). This second-level sampling works exactly like the first-level one with the exception that sampled values are also retained in the sequence. This redundant data is kept to allow us to sequentially decode the whole values belonging to a given symbol without the need to access the second-level sampling data. This is of interest when we want to retrieve a range of consecutive values from instead of simply recovering an individual value.
3.4.1 Comparing the Space/time trade-off of with .
We run experiments to compare the space/time trade-off obtained by against and (the latter is the variant of where arrays and are not stored). We tuned these representations using four different sampling values for . In particular, we used values (from sparser to denser sampling, respectively). In addition, we include in the comparison a non-compressed baseline representation for (we refer to it as ) that represents each entry of with bits and provides direct access to any position.
In Figures 10 and 11, we compare the space (shown as the number of bits needed to represent each entry in ) and time (in per entry reported) required to access all the values in for three different scenarios. In the plots labeled by [B1] and [B2], we assume that the ranges for all the symbols are known and we perform a buffered access to retrieve the values for all these symbols. In scenario [B2], we only retrieve those values for symbols occurring at least times (hence ). In these buffered scenarios, synchronization is done once to obtain (except in that has direct access and does not require synchronization at all) and from there on, we apply sequential decoding of subsequent values. In the last scenario (plot labeled [S1]), we show the cost of accessing at individual positions (hence synchronization, for the compressed variants, is required for each access to ). We access sequentially all the positions in , .
We have run tests for all the datasets in Table 2 (described in Section 4) and show results here for datasets: I.Comm.Net, Powerlaw, Flickr-Data, and Wikipedia-Links. We do not show plots for ba* datasets because they obtain as fairly identical shapes as those for I.Comm.Net (yet with slightly different x-axis).
We can see that the cost of the synchronization required by and the slower decoding of bit-Huffman in comparison with vbyte make more than 5 times slower than when decoding all the entries of corresponding to a given symbol . In Section 3.5, we will see that this particular operation appears in most TGCSA query algorithms (a for loop after a binary search that returns the range of values for a given symbol). The shortcoming of this speed up at recovering values is that the overall size of increases by around -%. As we expected, it can be seen that in the Flickr-Data dataset, due to the large vocabulary size of this dataset in comparison with the number of contacts, the representation becomes unsuccessful because a plain representation of would even be smaller. We also include results for the counterpart. In this case, we do not explicitly store arrays and , and we require operations to know the position in corresponding to the -th sample. In general, when the number of synchronization operations is small (this occurs when is small), offers an interesting space/time trade-off. In particular, we can see that it typically yields the same performance of baseline representation while requiring -% less space.
Unfortunately, not all the accesses to performed at query time will follow a sequential pattern in TGCSA. In that case, the previous buffered retrieval of values is not applicable, and we need to perform many random accesses to positions within . Accessing random positions implies that each access to must initially check if is a sampled position. This is accomplished by checking if in or if in .141414 returns the value of the bit at position in the bitmap . In that case or , respectively. Yet, in we could still have a sampled value if and we would obtain the sampled value by .
In Figure 11, we can see that when we access individual positions of , and its two-level sampling approach is still able to improve the access time of . In general, using (very dense setup) obtains similar values than with (a relatively sparse setup). Yet, in we still have room to decrease access time at the cost of using a denser tuning. As expected, in this scenario, becomes unsuccessful, and is unbeatable due to its direct access capabilities.
3.5 Performing queries in TGCSA
We can take advantage of the iCSA capabilities at search time to solve all the typical queries in a temporal graph regarding direct and reverse vertexes from contacts that are active at a given time instant ( and queries, respectively). Basically, we binary search the range in for the given source or target vertex, and for each position , we apply circularly up to the third or four ranges where we can check whether or not the starting-time and ending-time constrains hold. In Figure 12, we include the pseudocode of the algorithms to answer both and queries. Note that they are almost identical with the difference that, in the former, the search begins in the range corresponding to the source vertex, whereas in the latter the starting range corresponds to the target vertex being searched for.
Note that the accesses to in the for loop in line 8 traverse consecutive positions (or for reverse neighbors). Recall that we do not have direct access to all the values of , but only to sampled positions and the remaining values require accessing the previous sample (to gain synchronization on either the Huffman-compressed or Vbyte-compressed stream of gaps) and sequentially decoding gaps from there on up to the desired position (see Section 3.4 for more details). Therefore, although it is not stated in the pseudocode, we have boosted the access to consecutive positions in (i.e. ) by implementing a buffered access method to . By using this buffered access method to recover , we only access the sample before position , then we synchronize at value ,151515Recall is always sampled in and no synchronization costs are involved. and from there on, we sequentially decompress the remaining values in . The other accesses to (i.e., and in ) are completely random and there is no room for optimization there. We will also apply this buffered access to in the loops on the following algorithms.
When comparing queries, is expected to be faster than because we can binary search for a phrase rather than by a unique vertex , hence returning a much shorter initial range. The pseudocode for solving the operation at a given time instant is included in Figure 13.
To solve queries given a time instant , which return the set of active contacts such that , we can binary search the starting and ending-time intervals: CSA_binSearchgetmap and CSA_binSearchgetmap. All the contacts pointed by hold and those in hold . Therefore, , if , we recover the source and target vertexes by and , respectively. The original values are obtained via getunmap(). Figure 14 includes the pseudocode to solve queries.
Queries regarding activation/deactivation events at a given time instant in the graph can be solved very efficiently. A unique binary search allows TGCSA to find all the contacts that have an event at time . In the case of the operation, the binary search looks for the range corresponding to contacts () where , whereas for the operation we obtain an interval corresponding to those contacts where . From these intervals, we apply circularly (twice or three times, respectively) up to reaching the values and corresponding to the source and target vertex of these contacts. In Figure 15, we include the pseudocode for the operation. Note that the operation would be similar but the loop would traverse positions with in line 5.
Taking a look at the pseudocodes presented for TGCSA query operations, we can see that we are using the following operations during searches: (i) getmap and getunmap calls that imply performing rank and select over and can be solved in time. (ii) A call to CSA_binSearch() that requires time and returns a range containing the occurrences of a pattern . Up to two additional calls to CSA_binSearch() could be needed depending on the query, also requiring time. (iii) A loop traversing the entries in that involves only operations, typically getunmap and accesses to . The exception is the operation that traverses always entries. To sum up, the temporal queries in TGCSA can be solved in time .
Dealing with interval queries
As indicated in Section 2, we have shown how TGCSA handles , , , , , and queries at a given time instant . Yet, these operations could be easily extended in TGCSA to time intervals. In queries that refer to checking the connectivity between vertexes (the first three ones), one would be interested in contacts occurring not only at a given time instant , but during a whole time interval ; that is, (this is called strong semantics for intervals in the literature). A different option (referred to as weak semantics) consists in reporting those contacts occurring at least at some point of ; that is, such that it holds . Note that for queries retrieving the changes on connectivity ( and ), it makes no sense to distinguish between weak and strong semantics, and we would be interested in simply checking if the connectivity changed at some point of the interval .
If we focus on queries constrained to an interval under strong semantics, to solve queries, we should only adapt the temporal constraint so that contacts match . Yet, in this case, and must be the right hand of the ranges and corresponding to and , respectively. Therefore, we should modify line 4 in the pseudocode of Figure 12 to set getmap and getmap; instead of getmap and getmap. Algorithms (in Figure 12) and (in Figure 13) could be adapted by simply modifying their line 4 in the same way.
Although not considered in previous works, we could also think of defining a operation to recover the contacts that were active during the interval . Under strong semantics, this interval-wise could be defined such that it would retrieve the contacts that were activated before and deactivated after . Therefore, we could see this operation as the union of the results of at a given time , . This case would only require modifying line 2 from Figure 14, to again set getmap.
For queries at time interval (see Figure 15), we would have to replace lines by the following: First, we map both and values to the ending times and ; that is, getmap and getmap. Then, we binary search for the corresponding intervals in TGCSA: CSA_binSearch() and CSA_binSearch(). And finally, all the ending time instants between and correspond to contacts deactivated within . Therefore, we have to traverse the entries in that range, that is, we would iterate (line 4) for to . A similar adaptation is possible for queries.
We can also deal with weak semantics in TGCSA. As an example, we show how to adapt queries to this scenario. The rest of operations can be adapted similarly. Now, a query for a given vertex constrained to an interval must retrieve any vertex from a contact that were active at some time instant within . Therefore, these contacts must match the time constraint . Focusing on Figure 12, because we need to compare the starting time instant of the contacts () with , and their ending time instant () with , we would have to replace line to set getmap and getmap. Finally, the sentences in lines in the for-loop must be changed to modify the temporal condition. In practice, we replace them by:
[TABLE]
3.6 Strengths and weaknesses of TGCSA
The strong expressive power of TGCSA is probably its main advantage with respect to other state-of-the-art representations such as EdgeLog and CET ([12, 8]). Recall TGCSA can really represent any set of contacts, including contacts of a given edge that temporally overlap.
Another important property is that it can answer queries over any term of a contact in the same way; that is, searching for all the contacts of a source node is performed exactly with the same algorithm as searching for all the contacts starting in a specific time instant : first a binary search is performed over one of the four sectors of the array , depending on the term of the contact that is searched for (i.e., bounded in the query), to locate the area devoted to that value, and then, for each of the entries in that area, is applied three times to recover the other components of each contact. The overall search time is , where is the length of the range reported by the initial binary search (with the exception of the operation). Although other data structures are more efficient for some types of queries, TGCSA has a more regular behavior over all types of queries. Table 1 compares the cost of the query operations in TGCSA with those of the most representative state-of-the-art counterparts: CET and EdgeLog. Furthermore, for graphs whose contacts last for only one time instant (Point-contact Temporal Graphs), the behavior of TGCSA improves because the suffix array only has three sections and has only to be applied twice to recover each contact.
Observe that within the section devoted to any symbol, in each of the four quarters of , all the pointers are always growing, which is a property that allows good compression. However, this property is also the main drawback of this representation. When there are few occurrences of the symbols in the vocabulary; that is, when the vocabulary is huge and there are few occurrences of each symbol, will not be very compressible. As shown in the experimental results, the compression in some synthetic collections is poor when the relative number of contacts per time instant is low or when the number of edges per node is low. In these cases, the increasing areas of are small. Therefore, the differences between pointer values are rather big, and consequently, not very compressible.
4 Experimental results
We ran several experiments with real and synthetic temporal graphs. Table 2 gives the main characteristics of these graphs including: the name of each dataset, the numbers of their vertexes, edges, and contacts, and the length of the graphs’ lifetime. In addition, we show the numbers of contacts per vertex, edges per vertex, and contacts per edge, respectively. Finally, we show the space of a plain representation of the original datasets (in MiB) assuming that each contact was represented with four 32-bit integers (), or with bits ().
The dataset I.Comm.Net is a synthetic dataset where short communications between random vertexes are simulated. The dataset Powerlaw is also synthetic; it simulates a power-law degree graph, where few vertexes have many more connections than the other vertexes (following a power-law distribution), but with a short lifetime. Flickr-Data is a real dataset that consists in an incremental temporal graph that indicates the time instant in which two people became friends in the Flickr social network, with a temporal granularity given in seconds, and a lifetime that starts with the creation of Flickr and ends in April 2008. The dataset Wikipedia-Links contains the history of links between articles from the English version of the Wikipedia with a time granularity given also in seconds. This dataset corresponds to a history dump of the Wikipedia161616Downloaded from http://dumps.wikimedia.org/enwiki/. downloaded on 2014-03-04. Other synthetic datasets were built by first setting a given degree distribution on the aggregated graph, and then assigning a number of contacts to each edge that follows a given distribution. The time interval of each edge was selected uniformly over the lifetime. We used the Barabási-Albert model [1] (see datasets ba* below) to generate a powerlaw degree distribution. Then we used a uniform () and a pareto () distribution to assign the number of contacts per edge. Pareto distributions were generated with , whereas for the uniform distributions, we created graphs with , and contacts per edge.
Even though TGCSA allows us to deal with datasets where contacts could have overlapping times, in order to allow the comparison with EdgeLog and CET, the datasets above have contacts with no time overlapping. Yet, these datasets still allow us to show the behavior of TGCSA.
Our tests were run on a machine with two Intel(R) Xeon(R) Intel(R) E5620 CPUs @ 2.40GHz. They sum eight-cores (sixteen siblings), yet our experiments run in a single core. The system has 64GB DDR3 RAM @ 1066Mhz. The operating system was Ubuntu 12.04 (kernel Linux version 3.2.0-79-generic), and the compiler used was gcc 4.6.3 (option -O3). Time measures refer to CPU user-time.
In the following sections, we include experiments to compare both the space and time performance of CET, EdgeLog, and TGCSA. In particular, we compare the time performance for the following queries: , , , , and at a given time instant.
For EdgeLog and CET we used the same source code as in [8]. Therefore, EdgeLog uses an implementation in C of from the PolyIRTK project,171717Available at http://code.google.com/p/poly-ir-toolkit/. and the best space was obtained by tuning block-size to (rather than the usual value). In addition, when the number of elements to compress is smaller than the block size, is replaced either by the word-wise coding [51], when , or by [49] when (both are also available in the PolyIRTK project).
The Interleaved Wavelet Tree in CET is implemented as a Wavelet Matrix [11], which keeps a good space/time trade-off for sequences with large alphabets. Compressed bitmaps [37, 10] included in CET can be found in the Compact Data Structures Library (libcds181818Available at https://github.com/fclaude/libcds).
The implementation of TGCSA is an adaptation of the implementation of iCSA 191919Available at http://vios.dc.fi.udc.es/indexing/wsi [14]. The bitmap representation used by is exactly the same than in iCSA, whereas bitmap uses the same libcds implementation of Raman et al. [37] in CET. In addition, TGCSA uses strategy to represent . We will show results including three different configurations by setting the sampling parameter on to values . Note that ( in advance) corresponds to the densest sampling and to the most sparse one. We have also included results for TGCSA-VB, the variant of TGCSA that uses the strategy to represent . Again, we set for the second-level sampling in TGCSA-VB.
A further detail is related to the Flickr-Data dataset. In this case, the ending time of all the contacts is set to the same value (the last time instant in the timeline). Therefore, we could avoid representing this value explicitly. We have adapted TGCSA, and also used adapted versions of CET and EdgeLog [8], in order to index only the first three elements of the contacts. This reduces (rather slightly) the size of the resulting structures, and also improves their overall performance. We will include both the regular TGCSA and the TGCSA built over 3-element contacts (TGCSA-3R) when showing time performance on the Flickr-Data dataset.
4.1 Space comparison
Table 3 shows the comparison of TGCSA and TGCSA-VB against CET, EdgeLog, and a plain baseline representation using bits. Finally, we also include gzip in that table (run over the source plain-text-wise datasets) because this will allow us to compare the compressibility obtained by iCSA in our datasets with that originally obtained when dealing with text [14]. Note that for the Flickr-Data dataset we include two rows. The first one refers to the space obtained by the structures when we assume contacts containing only three elements, hence excluding the final time instant (the plain baseline uses only = ). In the case of TGCSA, this corresponds to the variant TGCSA-3R. The space needs are shown as the number of bits needed to represent each contact ().
Even tough an iCSA-based self-index built on English text typically reached the compression of gzip [14], the compressibility of temporal graphs is not so good. Actually, the large number of -runs that appear in when dealing with text is now much smaller in the TGCSA, and we are not able to reach the compression levels of gzip in most cases. As expected, taking into account the experiments regarding the representation of that we showed in Section 3.4, we typically obtain that TGCSA-VB requires around -% more space than TGCSA. With the Flickr-Data dataset, the space usage of TGCSA-VB is huge due to the non-parameterizable first-level sampling and the large vocabulary in such dataset.
Focusing on EdgeLog, we see that it is also unsuccessful when the number of contacts per edge is very small. However, when there are few edges and the number of contacts per edge grows, it becomes very interesting because its inverted lists become highly compressible. TGCSA shows a more stable behavior, with reasonable space needs in most cases. It does not require as much space as EdgeLog when the number of contacts per edge is small, but it cannot cope with many contacts per edge because is irregular, as discussed above.
With respect to CET, we can see that CET obtains always a more compact representation than TGCSA, and becomes the best overall alternative if one aims at obtaining little space cost (with the exception of ba100k10u1000 and ba1M10u50 datasets). Yet, in the following sections we will show that TGCSA typically performs faster.
4.2 Time comparison: Direct and Reverse neighbors operations
This section presents the evaluation of the time performance to retrieve the set of direct and reverse neighbors that were active at a given time instant. To evaluate these operations, we generated queries by randomly choosing contacts from each graph dataset. For each selected contact , we took the pairs and () to create the query patterns to use for and , respectively. The time performance is measured in per contact reported and the space usage in bits per contact (as in Table 3).
Figures 16 and 17 show the results. Despite the fact that TGCSA uses always more space than CET to represent our temporal graphs, we can see that both techniques have similar performance at solving queries when the number of contacts per vertex is small. The only exception is the synthetic dataset ba100k10u1000 where there are direct neighbors for each vertex, which forces TGCSA to sequentially check a lot of probably unsuccessful direct neighbors. We can see that in the Powerlaw and Flickr-Data datasets, TGCSA clearly overcomes CET. Considering TGCSA-VB, it is typically faster (around 3-5 times) than TGCSA when using the densest sampling setup. Yet, assuming that we could tune TGCSA-VB and TGCSA to use similar space, TGCSA-VB would always be slower than TGCSA because it would use a very sparse sampling.
Finally, in the plot corresponding to the Flickr-Data dataset, we show the gain in both space and time that TGCSA-3R obtains with respect to TGCSA. As shown, it is worth not to explicitly represent the fourth component (ending-time) of the contacts for incremental graphs. When comparing TGCSA-3R with EdgeLog, results show that solving queries is indeed one of the main strengths of EdgeLog, because EdgeLog only needs to traverse the corresponding adjacency list.
With respect to queries, we can see similar results as for queries when comparing CET with TGCSA. Yet, now we can see that TGCSA (and TGCSA-VB) are clearly faster to solve reverse- instead of direct-neighbors operations, whereas the results of CET are very similar for both types of operations.
It is easy to understand why TGCSA is faster at queries than at operations. Note that the time instants are the third and forth elements of the contacts, and the source vertex and target vertex are, respectively, the first and second elements. Therefore, in the case of operations TGCSA must traverse a range of source vertexes and it has to apply and , respectively, to reach the starting and ending time instants (in order to either accept or discard the contact due to the time constraints). In the case of operations, the traversal starts in the range of the target vertexes, and we save one application of to reach the time components of the contact (we apply and , respectively, to reach the starting and ending time instants of the contact). Recall that in these operations, the first application of to obtain is performed over a range of consecutive positions , which benefits from the buffered access to . From there on, obtaining or requires, respectively, one or two (slower) additional random accesses to .
As expected, EdgeLog performance drastically worsens in queries. Yet, the use of the reverse aggregated graph still allows a good performance in most cases. The exception is in the I.Comm.Net graph, where the number of edges per vertex is high. In the other cases, the number of edges per vertex is relatively small (from to ) and the time performance does not degrade in excess.
4.3 Time comparison: Activation and deactivation at a given time instant
This section shows the performance of and queries; that is, retrieving the set of edges that have been either activated or deactivated at a given time instant. For the evaluation, we generated random time instants, uniformly distributed over the lifetime of the corresponding graph. Again, time measures are shown as the average time in per contact reported.
Figures 18 and 19 show the results. We can see that these types of operations are probably the best scenario for TGCSA because they are solved by a single binary search to find the given time instant. For example, in the case of queries at time , the binary search returns an interval corresponding to all the contacts that are deactivated at time . Therefore, for each , we apply circularly to recover the corresponding source vertex () and target vertex (). Similarly, for queries at time instant , we apply circularly from a starting interval within the third part of the suffix array in TGCSA.
Note that the time per contact reported of TGCSA for these operations is much better than for the and operations because now the traversal of the starting range and the application of always recover one contact. For the and operations, however, many checks (that implied applying to reach a starting or ending time instant) could discard a candidate contact and, consequently, TGCSA was doing unsuccessful work that increases the reported time per occurrence.
As expected, TGCSA reports the best time performance for and operations. With the densest configuration, TGCSA slightly overcomes TGCSA-VB (being 0-40% faster). Yet, when we set , TGCSA-VB becomes around times faster than TGCSA.
CET still draws good results, yet it is is clearly overcome by TGCSA. We can also see that EdgeLog is by far the slowest technique. Finally, it is interesting to note that in the Flickr-Data graph, TGCSA-3R improves the times of TGCSA by around one third in queries. This is clearly expectable because TGCSA has to apply three times to recover the source and target vertexes of the edge, whereas TGCSA-3R requires only two applications.
4.4 Time comparison: Snapshot operation
We studied the performance obtained when retrieving the set of all the active edges at a certain time instant ( operation). We compared the average retrieval time at five instants of the lifetime of the temporal graphs: the first and last ones, and those at the 25%, 50%, and 75% of the lifetime in each graph. Table 4 provides the average number of active edges per time instant, that is, the expected output size.
Note that EdgeLog computes the operations with the application of queries over all the vertexes in the graph. CET computes this operation as a operation in the underlying Wavelet Matrix [11] and its cost is logarithmic with respect to the total number of edges in the graph. TGCSA, instead, must check which contacts match the time constraints of the query for all the candidate contacts. As shown, this is done with a binary search to find the ranges within the suffix array with possible both valid starting and ending time instants. That is followed by a traversal of the valid starting times (buffered access to ) to check if the end-time constraint is matched. In that case, we recover the source and target vertexes with one and two applications of , respectively.
Figure 20 shows the results. The time measures are shown in per edge reported. Overall, the results show that TGCSA overcomes CET in most cases and, in particular, in the non-synthetic datasets. TGCSA-VB draws also very good performance for snapshot operations and, as expected, it excels in ba100k10u1000 dataset due to its small vocabulary (few vertexes and short lifetime). This allows TGCSA-VB to exploit the faster sequential decoding of when compared with the that is used in TGCSA. Note that, in this particular dataset, where CET clearly overcomes TGCSA, now TGCSA-VB is able to reach the same performance as CET.
For these types of queries, EdgeLog has a fast decoding of posting lists based on the use of , but it must traverse all these lists for each source vertex. This leads to a very fast performance when the number of retrieved contacts is high, but it becomes very slow when we recover only a few contacts.
5 Conclusions and future work
We presented TGCSA, a new representation for temporal graphs based on the well-known CSA. We showed how we can adapt the temporal graph so that it can be indexed with an iCSA self-index. Then, we proposed a modification of the regular structure in iCSA in such a way that it allows us to move circularly from one term to the other within each contact. This modification solves queries using the CSA mechanism to search for one or more terms of the contacts. This is both fast and flexible.
In addition, we explored a new way to increase the performance of iCSA based on replacing its traditional compressed representation of by a new representation that we called . To improve access to values, our new technique uses byte-aligned codewords instead of bit-oriented Huffman (other traditional representations used delta and gamma codes, see [14] for more details). We also avoided sampling at regular intervals because it is done in traditional compressed representations of . In our case, since many operations in TGCSA imply recovering a sequence of consecutive values related to a given symbol , we sampled the starting positions of () for all the different symbols . We ran experiments that verified that our new representation is typically much faster than when we want to retrieve a buffer with consecutive values from . Yet, it is not so advantageous when accessing values at random positions. We created a variant of TGCSA, named TGCSA-VB, that uses the approach to represent . TGCSA-VB is up to times faster than TGCSA in some operations; however, it uses around -% more space. Finally, we also adapted TGCSA to the particular case of temporal graphs where contacts have only three terms (an edge is never deactivated). This is the particular case of the Flickr-Data dataset. The resulting variant (referred to as TGCSA-3R) improved the results of TGCSA in both space and time.
The experimental results showed that TGCSA behaves reasonably well in space. In general, space needs are between - bits per contact. With respect to time performance, TGCSA is very successful for queries that can filter out many contacts from the dataset with an initial binary search in the TGCSA. This avoids the need for sequentially checking a large number of contacts.
We compared TGCSA with CET and EdgeLog. In and queries, EdgeLog is a hard rival because it is an inverted index designed to answer queries in a very efficient way and it also uses a reverse aggregated graph to support queries efficiently. However, even in this case, TGCSA solves most queries in less than 1 millisecond per contact reported. For queries about events (i.e., or ), in constrast, EdgeLog performs poorly and TGCSA is clearly the fastest alternative. With respect to CET, we have shown that, even though CET typically uses less space than TGCSA, it is also usually slower. In particular, in and queries CET is around one order of magnitude slower than TGCSA.
An important feature of TGCSA is its expressive power. We can use it to represent any set of contacts without any limitation. For example, we could deal with contacts of an edge with overlapping time intervals. Also, as it was indicated above, the indexing capabilities of the CSA allow us to perform most operations following the same structure: (i) performing an initial binary search in CSA to obtain one range (or more) corresponding either to the vertexes or the times in the contacts, and (ii) for all the entries in such range (each one corresponding to a different contact), we can apply circularly to either recover the other terms of the contacts, or to check a constraint about them.
As future work, we consider that there are two interesting lines we would like to explore in the scope of temporal graphs. On the one hand, our new allows us to improve the performance of previous representations [14], but it requires a large amount of extra space. Likewise, the variant uses less space but it also shows to be slower. Since is the most important structure in TGCSA (it uses around 80-90% of its space, and it is accessed profusely during searches), we still want to try other ways to represent . On the other hand, we are also interested in studying the applicability of other self indexes to the scope of this paper.
Finally, the variant of CSA shown in this paper is not only of interest in the field of temporal graphs, but it has also opened new opportunities for the application of suffix arrays in other fields. For example, it has obtained very good results when representing RDF datasets [4, 9]. In the future we are also planning to study its applicability to represent other types of networks. For example, we have obtained promising results when using a CSA-based approach to represent trajectories of moving objects constrained to a network [5]. We would expect that the flexibility of our approach could make it successful in other contexts.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Albert and Barabási [2002] Albert, R., Barabási, A.-L., 2002. Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97.
- 2Bannister et al. [2013] Bannister, M. J., Du Bois, C., Eppstein, D., Smyth, P., 2013. Windows into Relational Events: Data Structures for Contiguous Subsequences of Edges. In: Proc. Symposium on Discrete Algorithms (SODA). pp. 856–864.
- 3Brisaboa et al. [2014] Brisaboa, N. R., Caro, D., Fariña, A., Rodríguez, M. A., 2014. A compressed suffix-array strategy for temporal-graph indexing. In: Proc. 21st International Symposium on String Processing and Information Retrieval (SPIRE). LNCS 8799. pp. 77–88.
- 4Brisaboa et al. [2015] Brisaboa, N. R., Cerdeira, A., Fariña, A., Navarro, G., 2015. A compact RDF store using suffix arrays. In: Proc. 22nd International Symposium on String Processing and Information Retrieval (SPIRE). LNCS 9309. pp. 103–115.
- 5Brisaboa et al. [2016] Brisaboa, N. R., Fariña, A., Galaktionov, D., Rodríguez, M. A., 2016. Compact trip representation over networks. In: Proc. 23rd International Symposium on String Processing and Information Retrieval (SPIRE). LNCS 9954. pp. 240–253.
- 6Buin-Xuan et al. [2003] Buin-Xuan, B.-M., Ferreira, A., Jarry, A., 2003. Computing shortest, fastest, and foremost journeys in dynamic networks. Int. J. Found. Comput. Sci. 14 (02), 267–285.
- 7Caro et al. [2016] Caro, D., Rodríguez, A., Brisaboa, N. R., Fariña, A., 2016. Compressed k d superscript 𝑘 𝑑 k^{d} -tree for temporal graphs. Knowl Inf Syst. 49 (2), 553–595.
- 8Caro et al. [2015] Caro, D., Rodríguez, M. A., Brisaboa, N. R., 2015. Data structures for temporal graphs based on compact sequence representations. Inf. Syst. 51, 1–26.
