A Compact Representation for Trips over Networks built on self-indexes
Nieves R. Brisaboa, Antonio Fari\~na, Daniil Galaktionov, M., Andrea Rodriguez

TL;DR
This paper introduces a new compact data structure called CTR for efficiently representing and querying large sets of trips over transportation networks, combining spatial and temporal indexing for improved performance.
Contribution
The paper presents the CTR, a novel compact trip representation that integrates CSA-based spatial indexing with wavelet-based temporal indexing for efficient spatio-temporal queries.
Findings
CTR reduces space usage by 50-70% compared to non-indexed baselines.
Most queries are answered within 1-1000 microseconds.
CTR effectively supports various spatial, temporal, and spatio-temporal queries.
Abstract
Representing the movements of objects (trips) over a network in a compact way while retaining the capability of exploiting such data effectively is an important challenge of real applications. We present a new Compact Trip Representation (CTR) that handles the spatio-temporal data associated with users' trips over transportation networks. Depending on the network and types of queries, nodes in the network can represent intersections, stops, or even street segments. CTR represents separately sequences of nodes and the time instants when users traverse these nodes. The spatial component is handled with a data structure based on the well-known Compressed Suffix Array (CSA), which provides both a compact representation and interesting indexing capabilities. The temporal component is self-indexed with either a Hu-Tucker-shaped Wavelet-tree or a Wavelet Matrix that solve range-interval…
| 32 | 128 | 512 | |
|---|---|---|---|
| Madrid | 41.32% | 26.80% | 23.06% |
| Porto | 23.66% | 15.49% | 13.37% |
| Type of bitvector in / | ||||
|---|---|---|---|---|
| Madrid (, 5-min) | 91.33% | 80.89% | 76.90% | 74.90% |
| Madrid (, 5-min) | 103.13% | 86.03% | 80.61% | 77.88% |
| Madrid (, 30-min) | 92.30% | 78.90% | 74.66% | 72.52% |
| Madrid (, 30-min) | 103.14% | 83.32% | 77.90% | 75.18% |
| Porto (, 5-min) | 93.52% | 102.61% | 98.27% | 96.11% |
| Porto (, 5-min) | 103.13% | 106.88% | 101.41% | 98.66% |
| Porto (, 30-min) | 96.00% | 103.78% | 99.08% | 96.74% |
| Porto (, 30-min) | 103.12% | 107.00% | 101.50% | 98.75% |
| Type of bitvector in / | Type of bitvector in / | |||||||
| Madrid (, 5-min) | 69.90% | 63.93% | 61.65% | 60.51% | 62.07% | 56.10% | 53.82% | 52.68% |
| Madrid (, 5-min) | 76.64% | 66.87% | 63.77% | 62.21% | 68.81% | 59.04% | 55.94% | 54.38% |
| Madrid (, 30-min) | 66.81% | 60.11% | 57.99% | 56.92% | 57.68% | 50.98% | 48.86% | 47.79% |
| Madrid (, 30-min) | 72.23% | 62.32% | 59.61% | 58.25% | 63.10% | 53.19% | 50.48% | 49.12% |
| Porto (, 5-min) | 48.81% | 52.08% | 50.52% | 49.74% | 42.22% | 45.49% | 43.93% | 43.15% |
| Porto (, 5-min) | 52.27% | 53.62% | 51.65% | 50.66% | 45.68% | 47.03% | 45.06% | 44.07% |
| Porto (, 30-min) | 43.39% | 45.51% | 44.23% | 43.59% | 35.91% | 38.03% | 36.75% | 36.11% |
| Porto (, 30-min) | 45.33% | 46.39% | 44.89% | 44.14% | 37.85% | 38.91% | 37.41% | 36.66% |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Compact Representation for Trips over Networks built on self-indexes111Funded in part by European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 690941 (project BIRDS).
The Spanish group is also partially funded by Xunta de Galicia/FEDER-UE [CSI: ED431G/01 and GRC: ED431C 2017/58]; by MINECO-AEI/FEDER-UE [Datos 4.0: TIN2016-78011-C4-1-R; Velocity: TIN2016-77158-C4-3-R; and ETOME-RDFD3: TIN2015-69951-R]; and by MINECO-CDTI/FEDER-UE [CIEN: LPS-BIGGER IDI-20141259 and INNTERCONECTA: uForest ITC-20161074]. M. A. Rodríguez is partially funded by Fondecyt [1170497] and and by the Millennium Institute for Foundational Research on Data.
Nieves R. Brisaboa
Antonio Fariña
Daniil Galaktionov
M. Andrea Rodriguez
University of A Coruña, Database Laboratory, Spain.
University of Concepción, Department of Computer Science, Chile.
Millennium Institute for Foundational Research on Data, Chile
Abstract
Representing the movements of objects (trips) over a network in a compact way while retaining the capability of exploiting such data effectively is an important challenge of real applications. We present a new Compact Trip Representation () that handles the spatio-temporal data associated with users’ trips over transportation networks. Depending on the network and types of queries, nodes in the network can represent intersections, stops, or even street segments.
represents separately sequences of nodes and the time instants when users traverse these nodes. The spatial component is handled with a data structure based on the well-known Compressed Suffix Array (CSA), which provides both a compact representation and interesting indexing capabilities. The temporal component is self-indexed with either a Hu-Tucker-shaped Wavelet-tree or a Wavelet Matrix that solve range-interval queries efficiently. We show how can solve relevant counting-based spatial, temporal, and spatio-temporal queries over large sets of trips. Experimental results show the space requirements (around 50-70% of the space needed by a compact non-indexed baseline) and query efficiency (most queries are solved in the range of - microseconds) of .
keywords:
Trips on networks , counting queries , self-index , compression
††journal: Information Systems††footnotetext: An early partial version of this article appeared in Proc SPIRE’16 [1].
1 Introduction
Due to the current advances in sensor networks, wireless technologies, and RFID-enabled ubiquitous computing, data about moving-objects (also referred to as trajectories) is an example of massive data relevant in many real applications. Think in the notion of Smart Cities, where the implementation of new technologies in public transportation systems has become more widespread all around the world in the last decades. For instance, nowadays many cities -from London to Santiago- provide the users of the public transportation with smartcards that help in making the payment to access buses or subways an easier task. Even though smartcards may only collect data when users enter to the transportation system, it is possible to derive the users’ trip (when they enter and leave the system) using historical data and transportation models [2]. When having this data, counting or aggregate queries of trajectories become useful tools for traffic monitoring, road planning, and road navigation systems.
New technologies and devices generate a huge amount of highly detailed, real-time data. Several research exists about moving-object databases (MODs) [3, 4, 5, 6] and indexing structures [7, 8, 9, 10, 11, 12, 13]. They, however, have addressed typical spatio-temporal queries such as time slice or time interval queries that retrieve trajectories or objects that were in a spatial region at a time instant or during a time interval. They were not specially designed to answer queries that are based on counting, such as the number of distinct trips, which are more meaningful queries for public-transportation or traffic administrators. This problem was recently highlighted in [14], where authors describe an approximate query processing of aggregate queries that count the number of distinct trajectories within a region. In this work, we concentrate on counting-based queries on a network, which includes the number of trips starting or ending at some time instant in specific stops (nodes) or the top-k most used stops of a network during a given time interval.
The work in this paper proposes a new structure named Compact Trip Representation () that answers counting-based queries and uses compact self-indexed data structures to represent the large amount of trips in a compact space. combines two well-known data structures. The first one, initially designed for the representation of strings, is Sadakane’s Compressed Suffix Array () [15]. The second one is the Wavelet Tree [16] (). To make the use of the possible, we define a trip or trajectory of a moving object over a network as the temporally-ordered sequence of the nodes the trip traverses. An integer is assigned to each node such that a trip is a string with the s of the nodes. Note that this representation avoids the cost of storing coordinates to represent the locations users pass through during a trip. It is just enough to identify the stops or nodes and when necessary to map these nodes to geographic locations. Then a , over the concatenation of these strings (trips), is built with some adaptations for this context. In addition, we discretize the time in periods of fixed duration (i.e. timeline split into 5-minute intervals) and each time segment is identified by an integer . In this way, it is possible to store the times when trips reach each node by associating the corresponding time with each node in each trip. The sequence of times for all the nodes within a trip is self-indexed with a to efficiently answer temporal and spatio-temporal queries.
We experimentally tested our proposal using two sets of data representing trips over two different real public transportation systems. Our results are promising because the representation uses around % of its original size and answers most of our spatial, temporal, and spatio-temporal queries within microseconds. In addition, since implicitly keeps all the original trajectories in a compact and self-indexed way, it would permit us to extend its functionality with additional operations that could benefit from the indexed access provided both by the underlying and structures. No experimental comparisons with classical spatial or spatio-temporal indexing structures were possible, because none of them were designed to answer the types of queries in this work. Our approach can be considered as a proof of concept that opens new application domains for the use of well-known compact data structures such as the and the , creating a new strategy for exploiting trajectories represented in a self-indexed way.
The organization of this paper is as follows. Section 2 reviews previous works on trip representations. It also presents and upon which we develop our proposal. We pay special attention to show the internals of those structures and discuss also their properties and functionality. Then, in Section 2.3.2, we also present the wavelet matrix () and show how to create a Hu-Tucker-shaped (). These are the two variants of we use to represent temporal data. In Section 3, we present the main counting-based queries that are of interest for a transportation network. In Section 4, we present and show how to reorganize a dataset of trips to allow a to handle the spatial data and a -based structure to manage the temporal data. Section 5 shows how represents the spatial component and how spatial queries are dealt with. In Section 6, we focus on how to represent the temporal component of trips and how to answer temporal queries. We also include a brief comparison of the space/time trade-off of and . In Section 7, we show how spatio-temporal queries are solved by , and Section 8 includes our experimental results. Finally, conclusions and future work are discussed in Section 9.
2 Previous Work
2.1 Models of trajectory and types of queries
There is a large amount of work on data models for moving-object data [17, 18, 19, 3, 4, 5, 6, 20]. Basically, a moving-object data model represents the continuous change of the location of an object over time, what is called the trajectory of the object.
Moving-object data is an example of big data that differ in the representation of location, contextual or environmental information where the movement takes place, the time dimension that can be continuous or discrete, and the level of abstraction or granularity on which the trajectories are described [21]. A common classification of trajectories distinguishes free from network-based trajectories. Free trajectories or Euclidean trajectories are a sequence of GPS points represented by an ad-hoc data type of moving points [17, 18, 19]. Network-based trajectories are a temporal ordered sequence of locations on networks. This trajectory model includes a data type for representing networks and for representing the relative location of static and moving points on the network [22]. In a recent work, network-matched trajectories are defined to avoid the need of a mobile map at the moving-object side [23].
The definition of trajectories at an abstract level must be materialized in an internal representation with access methods for query processing. An early and broad classification of spatial-temporal queries for historical positions of moving objects [8] identifies coordinate- and trajectory-based queries. Coordinate-based queries include the well-known time-slice, time-interval and nearest-neighbor queries. Examples are find objects or trajectories in a region at a particular time instant or during some time interval. Another important example of range-based queries is find the k-closest objects with respect to a given point at a given time instant. Trajectory-based queries involve topology of trajectories (e.g., overlap and disjoint) and information (e.g., speed, area, and heading) that can be derived from the combination of time and space. An example of such queries would be find objects or trajectories that satisfy a spatial predicate (eg., leave or enter a region) at a particular time instant or time interval. There also exist combined queries addressing information of particular objects: Where was object X at a particular time instant or time interval?. In all previous queries, the results are individual trajectories that satisfy the query constraints.
When dealing with large datasets of trajectories we can find scenarios where answering counting based or aggregated queries are typically of concern. This is for example the case of network management applications, those for mobility analysis, or when there are privacy issues that prevent us from revealing the original individual trajectories. In this context, we can further distinguish range- from trajectory-based queries. Range queries impose constraints in terms of a spatial location and temporal interval. Examples of these queries are to retrieve the number of distinct trajectories that intersect a spatial region or spatial location (stop) at a given time instant or time interval, retrieve the number of distinct trajectories that start at a particular location (stop) or in a region and/or end in another particular location of region, retrieve the number of trajectories that follow a path, and retrieve the top-k locations (stops) or regions with the larger number of trajectories that intersect at a given time instant or time interval. Trajectory-based queries require not only to use the spatio-temporal points of trajectories but also the sequence of these points. Examples of such queries are to find the number of trajectories that are heading (not necessarily ending at) to a spatial location during a time interval, find the destination of trajectories that are passing through a region during a time interval, find the number of starting locations of trajectories that go or pass through a region during a time interval.
2.2 Trajectory indexing
Many data structures have been proposed to support efficient query capabilities on collections of trajectories. We refer to [7, Chapter 4] for a comprehensive and up-to-date survey on data management techniques for trajectories of moving objects. We can broadly classify these data structures into two groups: those that index trajectories in free space and those that index trajectories that are constrained to a network.
In free space, it is common to see spatial indexes that extend the R-Tree index [24] beyond a simple 3D R-Tree where the time is one of the dimensions. Two examples of such indexes can be found in [8] where the authors present two fundamental variations of the R-Tree: the STR-Tree and the TB-Tree. Both indexes modify the classical construction algorithm for the R-Tree, where the nodes are not only grouped by the spatial distance among the indexed objects, but also by the trajectories they belong to. In the MV3R-tree [9], the construction takes into account temporal information of the moving objects, adapting ideas from the Historical R-Tree [25]. Another interesting approach is described in [26], where the authors split trajectories of moving objects across partitions of space, indexing each partition separately. This improves query efficiency, as only the partitions that intersect a query region are accessed.
R-Tree adaptations can also be useful when the trajectories are constrained to a network. They exploit the constraints imposed by the topology of the network to optimize the data structure. This is the case of the FNR-tree [10], which consists of a 2D R-Tree to index the segments of the trajectories over the network, and a forest of 1D R-Trees used to index the time interval when each trajectory is moving through each segment of the network. The MON-Tree [11] can be seen as an improvement over the FNR-Tree, saving considerable space by indexing MBRs of larger network elements (edge segments or entire roads) and reducing the number of disk accesses at query time. Both indexes are outperformed by the TMN-Tree [27] in query time, which indexes whole trajectories of moving objects with a 2D R-Tree and indexing the temporal component with a B-Tree, which proves to be more efficient for that application than the R-Tree.
PARINET is another interesting alternative to represent trajectories constrained to a network [12]. It partitions trajectories into segments from an underlying road network using a complex cost model to minimize the number of disk accesses at query time. It takes into account the spatial relations among the indexed network elements, as well as some statistics of the data to index. Then it adds a temporal B-tree to index the trajectory segments from each road. Those indexes permit PARINET to filter out candidate trajectory segments matching time constraints at query time. The same ideas were used in TRIFL [28], where the cost model is adapted for flash storage.
All previous data structures were designed to answer spatio-temporal queries, where the space, in particular geographic coordinates, and time are the main filtering criteria. Examples of such queries are: retrieve trajectories that crossed a region within a time interval, retrieve trajectories that intersect, or retrieve the -best connected trajectories (i.e., the most similar trajectories in terms of a distance function). Yet, they could not easily support queries such as the number of trips starting in X and ending at Y. A recent work in [14] proposes a method to compute the approximate number of distinct trajectories that cross a region. Note that computing aggregate queries of trajectories in the hierarchical structure of classical spatio-temporal indices is usually done by aggregating the information maintained in index nodes at the higher levels to avoid accessing the raw spatio-temporal data. However, for a trajectory aggregate query, maintaining the statistical trajectory information on index nodes does not work because what matters for these queries is to determine the number of distinct trajectories in a spatio-temporal query region.
The application of data compression techniques has been explored in the context of massive data about trajectories. The work by Meratnia and de By [29] adapts a classical simplification algorithm by Douglas and Peucker to reduce the number of points in a curve and, in consequence, the space used to represent trajectories. Potamias et al. [30] use concepts, such as speed and orientation, to improve compression. It is also possible [31] to compress a trajectory in a way that the maximum error at query time is deterministic, although the method greatly depends on the distance function to be used.
In [32, 33, 34], they focus mainly on how to represent trajectories constrained to networks, and in how to gather the location of one or more given moving objects from those trajectories. Yet, these works are out of our scope as they would poorly support queries oriented to exploit the data about the network usage such as those that compute the number of trips with a specific spatio-temporal pattern (e.g. Count the trips starting at stop and ending at stop in working days between 7:00 and 9:00).
A recent work [35] proposed an indexing structure called NETTRA to answer strict and approximate path queries that can be implemented in standard SQL using -trees and self-JOIN operations. For each trajectory, NETTRA represents the sequence of adjacent network edges touched by the trajectory as entries in a table with four columns: id, entering and leaving time, and a hash value of the entire path up to and including the edge itself. Using the hash value for the first and last edge on a query path, NETTRA determines whether the trajectory followed a specific path between these edges. Also for strict path queries, Koide et al. [36] proposed a spatio-temporal index structure called SNT-Index that is based on the integration of a FM-index [37] to store spatial information with a forest of B+trees that stores temporal information. To the best of our knowledge, this makes up the first technique using compact data structures to handle spatial data in this scenario. Yet, in our opinion, strict path queries have little interest in the context of exploiting data to analyze the usage of a transportation network.
Unlike previous works, we designed an in-memory representation, that targets at solving counting-based queries, and is completely based on the use of compact data structures (discussed in the next section) to make it successful not only in time but also in space needs. Since keeps data in a compressed way, it will permit to handle larger sets of trajectories entirely in memory and consequently to avoid costly disk accesses.
2.3 Underlying Compact Structures of
relies on two components: one to handle the spatial information and another to represent temporal information. The spatial component is based on the well-known a Compressed Suffix Array () [15]. The temporal component can be implemented with either a Wavelet Tree () [16] or a Wavelet Matrix () [38]. The latter is a variant of that performs better when representing sequences built on a large alphabet as we see below.
2.3.1 Sadakane’s Compressed Suffix Array (CSA)
Given a sequence built over an alphabet of length , the suffix array built on [39] is a permutation of of all the suffixes so that for all , being the lexicographic ordering. Because contains all the suffixes of in lexicographic order, this structure permits to search for any pattern in time by simply binary searching the range that contains pointers to all the positions in where occurs.
To reduce the space needs of , Sadakane’s CSA [15] uses another permutation defined in [40]. For each position in pointed from , gives the position such that points to . There is a special case when , in this case gives the position such that . In addition, we could set up a vocabulary array with all the different symbols from that appear in , and a bitmap aligned to so that if or if (; otherwise). Basically, a in marks the beginning of a range of suffixes pointed from such that the first symbol of those suffixes coincides. That is, if the and one in occur in and respectively, we will have . Note that returns the number of 1s in and can be computed in constant time using extra bits [41, 42].
By using , , and , it is possible to simulate a binary search for the interval where a given pattern occurs () without keeping nor . Note that, the symbol pointed by can be obtained as , and we can obtain the following symbol from the source sequence as , can be obtained as , and so on. Therefore, replaces , and it does not need anymore to perform searches.
However, in principle, would have the same space requirements as . Fortunately, is highly compressible. It was shown to be formed by subsequences of increasing values [40] so that it can be compressed to around the zero-order entropy of [15] and, by using -codes to represent the differential values, its space needs are bits. In [43], they showed that can be split into (for any ) runs of consecutive values so that the differences within those runs are always . This permitted to combine -coding of gaps with run-length encoding (of -) yielding higher-order compression of . In addition, to maintain fast random access to , absolute samples at regular intervals (every entries) are kept. Parameter implies a space/time trade-off. Larger values lead to better compression of but slow down access time to non-sampled values.
In [44], authors adapted Sadakane’s CSA to deal with large (integer-based) alphabets and created the integer-based CSA (). They also showed that, in their scenario (natural language text indexing), the best compression of was obtained by combining differential encoding of runs with Huffman and run-length encoding.
2.3.2 The Wavelet Tree (WT)
Given a sequence built on an alphabet with symbols that are encoded with a fixed-length binary code , a [16] built over is a balanced binary tree where leaves are labeled with the different symbols in , and each internal node contains a bitvector . The bitvector in the root node contains the first bit from the codes of all the symbols in . Then symbols whose code starts with a 0 are assigned to the left child, and those with codes starting with a 1 are assigned to the right child. In the second level, the bitvectors contain the second bits of the codes of their assigned symbols. This applies recursively for every node, until a leaf node is reached. Leaf nodes can only contain one kind of symbol. The height of the tree is , and since the bitvectors of each level contain bits, the overall size of all the bitvectors is bits. To calculate the total size of the we also need to take into account the space needed to store pointers from each symbol to its corresponding tree node which is bits. In addition, as we see below the reduces the general problem of solving , , and operations to the problem of computing , and on the bitvectors. Therefore, additional structures to efficiently support those operations add up to space. The overall size of the is + . Figure 1.(left) shows a built on the sequence assuming we use a 3-bit binary encoding for the symbols in . Shaded areas are not included in the but help us to see the subsequences handled by the children of a given node.
Among others, the permits to answer the following queries in time:
- •
returns .
- •
returns the number of occurrences of symbol in .
- •
returns the position of the i-th occurrence of symbol in .
- •
described in [45], returns the number of occurences in of the symbols between and .
To solve and operations we traverse the from the root until we reach a leaf. In the case of we descend the tree taking into account the encoding of in each level. Being the bitmap in the root node, if we move to the left child and set ; otherwise we move to the right child and set . We proceed recursively until we reach a leaf where we return . is solved similarly, but at each level, we either move left or right depending on if or respectively. The leaf where we arrive corresponds to the symbol which is returned. To solve we traverse the tree from the leaf corresponding to symbol until the root. At level , we look at the value of the -th bit of the encoding of (). If we set (where is the bitmap of the parent of the current node), otherwise we set . Then we move to level and proceed recursively until the root, where the final value of is returned.
In this work we are also interested in operation that allow us to count the number of occurrences of all the symbols within . Assuming the encodings of the symbols form also a contiguous range this can be solved in [45, 38]. The idea is to traverse the tree from the root and descend through the nodes that cover the leaves in . At each node (whose bitmap is and range is considered), that covers symbols in range we check whether . In that case we sum occurrences. If both ranges are disjoint we found a node not covering the range and stop the traversal. Similarly, if range becomes empty traversal stops on that branch. Otherwise, we recursively descend from node to their children and where we map the interval into and with operation. In practice, , and , . For more details and pseudocodes see [46, 45, 38]. In Figure 1.(left) we can see the nodes (, , and ) that must be traversed, and the ranges within the bitmaps in those nodes, to solve . Therefore, we want to compute the number of occurrences of symbols between and that occur within . We start in the root node , where contains zeroes and ones. We compute , and . At this point we could move to , but we can see that all the encodings of the symbols start by and they are covered by . Therefore, we report occurrences of symbols in range and no further processing is done in the subtree whose root is . However, we descent to since in could belong to any symbol in range , and we have to track only occurrences of symbol . We check and compute , and . Since the second bit of the encoding of symbol is a (as for symbol ), we can discard descending on the left child of and move only to its right child where we are interested in the range . Since covers both symbols and , and the third bit of the encoding of is a zero whereas it is a one for , we do only need to count the number of ones in . After computing , and , we report occurrence of symbol . Therefore, we conclude .
One way of reducing the space needs of a WT consists in compressing its bitvectors [47]. Among others (i.e. Golynski et al. [48] which is better theoretically), Raman el al. technique [49] (RRR) is, in practice, one of the best choices. The overall size of the becomes , whereas operations still require time.
Another way of compressing a is to use a prefix-free variable-length encoding for the symbols. For example, Huffman [50] code can be used to build a Huffman-Shaped [51], where the tree is not balanced anymore. The size reduces to ,222 term includes both the tree pointers and the size of the Huffman model. and average time becomes for , , and (worst-case time is still [52]). By using compressed bitvectors [38] space can be reduced even further to . Unfortunately, the Huffman codes given to adjacent symbols are no longer contiguous, and it is not possible to give a bound for anymore, even if the code is canonical. Hu-Tucker codes [53] can be used instead.333Hu-Tucker [53] is an optimal prefix code that preserves the order of the input vocabulary. This means that the lexicographic order of the output variable-length binary codes is the same as the order of the input symbols. Compression degrades slightly with respect to using Huffman coding,444Being and the average codeword length of Huffman coding and Hu-Tucker codes respectively, it holds: and (see [54] (pages 122-123), or [55, 56]). but the codes for adjacent symbols are lexicographically contiguous. This permits to solve efficiently. The size of a Hu-Tucker-shaped () can be bounded to and can be reduced to by using compressed bitvectors as well.
The Wavelet Matrix (WM)
For large alphabets, the size of the is affected by the term . A pointerless [57] permits to remove555In a pointerless Huffman-shaped a term still remains due to the need of storing the canonical Huffman model. that term by concatenating all the bitmaps level-wise and computing the values of the pointers during the traversals. The operations on a pointerless have the same time complexity but become slower in practice.
By reorganizing the nodes in each level of a pointerless , the Wavelet Matrix () [38] obtains the same space requirements ( bits), yet its performance is very close to that of the regular with pointers. Figure 1.(right) shows an example.
As in the , the -th level stores the -th bits of the encoded symbols. A single bitvector is kept for each level. In the first level, stores the -st bit of the encoding of the symbols in the order of the original sequence . From there on, at level , symbols are reordered according to the -th bit of their encoding; that is, according to the bit they had in the previous level. Those symbols whose encoding had a zero at position must be arranged before those that had a one. After that, the relative order from the previous level is maintained. That is, if a symbol occurred before other symbol , and the -th bit of their encoding coincides, then will precede at level .
If we simply keep the number of zeros at each level (), we can easily see that the -th zero at level is mapped at position within , whereas the -th one at level is mapped at position within . This avoids the need for pointers and permits to retain the same time complexity of the operations. For implementation details see [38, 58]. For example to solve , we see that and . We move to the next level where we check position ; we see that and . We move to next level and check position , where we finally see . Therefore, we have decoded the bits that correspond to symbol .
To reduce the space needs of we could use compressed bitvectors as for s. Space needs become bits. Yet, compressing the by giving either a Huffman or Hu-Tucker shape is not possible as the reordering of the could lead to the existence of holes in the structure that would ruin the process of tracking symbols during traversals. To overcome this issue an optimal Huffman-based coding was specifically developed for wavelet matrices [38, 59]. This allows to obtain space similar to that of a pointerless Huffman-shaped but faster , , and operations. Unfortunately, since the encodings of consecutive symbols do not form a contiguous range, is no longer supported in time and computing is required for each in .
As indicated before, since in we need efficient support for operation, we will try (see Section 6) the Hu-Tucker-shaped as well as the uncompressed . In addition, we will couple them with both uncompressed and RRR compressed bitvectors.
3 Counting-based queries
In transportation systems, new technologies such as automatic fare collection (e.g., smartcards) and automatic passenger counting have made possible to generate a huge amount of highly detailed, real-time data useful to define measures that characterize a transportation network. This data is particularly useful because it actually consists of real trips, combining implicitly the service offered by a public transportation system with the demand for the system. When having this data, it is not the data about individual trajectories but measures of the use of the network what matters for traffic monitoring and road planning tasks. Examples of useful measures are accessibility and centrality indicators, referred to how easy is to reach locations or how important certain stops are within a network [60, 61, 62, 63]. All these measures are based on some kind of counting queries that determine the number of distinct trips that occur within a spatial and/or temporal window.
Among other types of queries, in this work we focus on the following counting queries, which to the best of our knowledge have not been addressed by previous proposals. In general terms, we define two general queries, number-of-trips queries and top-k queries, upon which we apply spatial, temporal or spatio-temporal constraint when useful.
- (a)
Number-of-trips queries. This is a general type of queries that counts the number of distinct trips. When applying spatial, temporal or spatio-temporal constraints, it can specialized in the following queries:
Pure spatial queries:
Number of trips starting at node (starts-with-x).
- -
Number of trips ending at node (ends-with-x).
- -
Number of trips starting at and ending at (from-x-to-y).
- -
Number of trips using or passing through node (uses-x) 2. 2.
Spatio-temporal queries:
Number of trips starting at node during time interval (starts-with-x).
- -
*Number of trips ending at node during the time interval (ends-with-x). *
- -
Number of trips starting at and ending at occurring during time interval (from-x-to-y). This type of queries is further classified into: (i) from-x-to-y with strong semantics (from-x-to-y-strong), which considers trips that completely occur within interval . (ii) from-x-to-y with weak semantics (from-x-to-y-weak), which considers trips whose life time overlap .
- -
Number of trips using node during the time interval (uses-x). 3. 3.
Pure temporal queries:
*Number of trips starting during the time interval (starts-t). *
- -
Total usage of network stops during the time interval (uses-t).
- -
Number of trips performed during the time interval (trips-t).
- (b)
Top-k queries. In this type of queries we want to retrieve the nodes with the highest number of trips. In this case, depending on having a temporal constraint or not we include the following queries:
Pure spatial Top-k queries:
Top-k most used nodes (top-k) that returns the nodes with the largest number of trips passing through.
- -
Top-k most used nodes to start a trip (top-k-starts) that returns the nodes with the largest number of trips that start at that node. 2. 2.
Spatio-temporal Top-k queries:
Top-k most used nodes during time interval (top-k) that returns the nodes with the largest number of trips passing through within time interval .
- -
Top-k most used nodes to start a trip during time interval (top-k-starts) that returns the nodes with the largest number of trips starting there within time interval at that node.
4 Compact Trip Representation ()
If we consider a network with nodes, we can see a dataset of trips over as a set of trips, where for each trip , we represent a list with the temporary-ordered nodes it traverses and the corresponding timestamps: , , , and . Note that every node in the network can be identified with an integer ID () and that, if we are interested in analyzing the usage patterns of the network, we will probably be interested in discretizing time into time intervals (i.e. 5-min, 30-min intervals). Therefore, we will have different time intervals that can also be identified with an integer ID ().
The size of the time interval is a parameter for the time-discretizing process that can be adjusted to fit the required precision in each domain. For example, in a public transportation network where we could have data including five years of trips, one possibility would be to divide that five-years period into 10-minute intervals hence obtaining a vocabulary of roughly different intervals. Other possibility would be to use cyclically annual 10-minute periods resulting in . However, in public transportation networks, queries such as “Number of trips using the stop X on May 10 between 9:15 and 10:00” may be not as useful as queries such as “Number of trips using stop X on Sundays between 9:15 and 10:00”. For this reason, can adapt how the time component is encoded depending on the queries that the system must answer.
Example 4.1**.**
Figure 2 shows a network that contains nodes numbered from to . Over that network we have six trips (), and, for each of them, we indicate the sequence of nodes it traverses and the time when the trip goes through those nodes. If we discretize time into 5-minute intervals, starting at 08:00h, and ending at 9:20h, we will have have different time intervals. Any timestamp within interval will be assigned time-code [math], those within code , and so on until times within that are given time-code . Therefore, our dataset of trips will be: : \{$$\langle(\mathbf{1,2,3}), , , , , , , , , , , (\mathit{12,14,15})\rangle$$\}, where bold numbers indicate node IDs and slanted ones indicate times. ∎
In we represent both the spatial and the temporal component of the trips using well-known self-indexing structures in order to provide both a compact representation and the ability to perform fast indexed searches at query time. In Section 5 we focus on the spatial component and discuss how we adapted to deal with trips. We also show how we support spatial queries. Then, in Section 6 we show that the times, which are kept aligned with the spatial component of the trips, can be handled with a -based representation. Actually we study two alternatives (a and a ) and show how temporal and spatio-temporal (Section 7) queries are supported by .
5 Spatial component of
We use a to represent the spatial component of our dataset of trips within . Yet, we perform some preprocessing on before building a on it. Initially, we sort the trips by their first node (), then by the last node (), then by the starting time (), and finally, by its second node (), third node (), and successive nodes (). Note that the start time () of the trip does not belong to the spatial component, but it is nevertheless used for the sorting.666This initial sorting of the trips will allow us to answer some useful queries very efficiently (i.e., count trips starting at and ending at ).
Following with Example 4.1, after sorting the trips in with the criteria above, our sorted dataset would look like: : \{$$\langle(\mathbf{1,2,3}), , , , , , , , , , , (\mathit{12,14,15})\rangle$$\}. Note that appears before because during the sorting process we compare with ; that is, we compare the starting nodes ( and ) and then the ending nodes ( and ). If needed (not in this example) we would have also compared the slanted values ( and ) that are the starting times of the trips, and finally the rest of nodes ( and ). Similarly, the two trips containing nodes are sorted by the starting times ( and ).
In a second step, we enlarge all the trips with a fictitious terminator-node \{i}${i}\prec${j},\forall i<j${i}i\mathcal{T}^{s}{i}\mathcal{T}^{\prime}{i}=\langle(s^{i}{1},s^{i}{2},\dots,s^{i}{l{i}},\mathbf{${i}}),(t^{i}{1},t^{i}{2},\dots,t^{i}{l_{i}},\mathbf{t^{i}_{1}})\rangle$.
The next step involves concatenating the spatial component of all the enlarged trips and to add an extra trailing terminator \{0}S[1,n]${0}S${0}\prec${i}\forall i\in[1,z]). In the top part of Figure [3](#S5.F3), we can see array SIcodeI$ shows the original times).
Finally, we build a on top of to obtain a self-indexed representation of the spatial component in . Figure 3 depicts the structures and used by built over . There is also a vocabulary containing a $$$ symbol and the different node IDs in lexicographic order.
Note that the use of different values \{i}S[i,n]S[j,n]A\mathsf{CSA}${i}$$V$.
Although they are not needed in , we show also suffix array and ’ for clarity reasons in Figure 3. contains the first entries of from a regular , whereas we introduced a small variation in for entries . For example, points to the first node of the first trip . and point to the second node. and point to the third node. and point to the ending \_{1}\mathsf{CSA}\Psi^{\prime}[2]=9A[9]=5\mathsf{CTR}\Psi[2]=8A[8]=1\PsiS$.
Another interesting property arises from the use of a cyclical on trips, and from using trip terminators. Since the first entries in correspond the \$$ symbols that mark the end of each trip in S\Psi[1]${0}j^{th}i^{th}V[rank{1}(D,\Psi^{j}[i+1])]\Psi^{3}[x]=\Psi[\Psi[\Psi[x]]]\Psi[2,7]${4}5^{th}${0}$S[28]V[rank_{1}(D,\Psi[4+1])]\Psi[5]=12rank_{1}(D,12)=3V[3]=\mathbf{2}S[A[12]]\Psi\Psi[12]=16rank_{1}(D,16)=4V[4]=\mathbf{3}V[rank_{1}(D,\Psi[\Psi[4+1]])]=\mathbf{3}$, and so on.
Regarding the space requirements of the in , we can expect to obtain a good compressibility due to the structure of the network, and the fact that trips that start in a given node or simply those going through that node will probably share the same sequence of “next” nodes. This will lead us to obtaining many runs in [43], and consequently, good compression.
5.1 Dealing with Spatial Queries
With the structure described for representing the spatial component of the trips, the following queries can be solved.
- •
Number of trips starting at node (starts-with-x).
Because was cyclically built in such a way that every \$$ symbol is followed by the first node
of its trip, this query is solved by [l,r]{}\leftarrow{}bsearch($X)\mathsf{CSA}$X\Psi[2,z+1]$r-l+1X$.
- •
Number of trips ending at node (ends-with-x). In a similar way to the previous query, this one can be answered with bsearch(X\)$.
- •
Number of trips starting at and ending at (from-x-to-y). Combining both ideas from above, and thanks to the cyclical construction of , this query is solved using bsearch(Y\X)$.
- •
Number of trips using node (uses-x). Even though we could solve this query with , it is more efficient to solve it by directly operating on . Assuming that is at position in the vocabulary of (), its total frequency is obtained by . If is the last entry in , we set .
- •
Top-k most used nodes (top-k). We provide two possible solutions for this query named: sequential and binary-partition approaches.
- –
To return the most used nodes using sequential approach (top-k-seq). The idea is to apply operation sequentially for every node from to to compute the frequency of each node and to return the nodes with highest frequency. We use a min-heap that is initialized with the first nodes, and for every node from to , we compare its frequency with that of the minimum node (the root) from the heap. In case the frequency of is higher, the root of the heap is replaced by and then moved down to comply with the heap ordering. At the end of the process, the heap will contain the top-k most used nodes , which can be sorted with the heapsort algorithm if needed. Finally, we return . Note that this approach always performs operations on .
- –
The binary-partition (top-k-bin) approach takes advantage of a skewed distribution of frequency of the nodes that trips traverse. Working over and , we recursively split into two segments after each iteration. If possible, we leave the same number of different nodes in each side of the partition. Initially, we start considering the range in which corresponds to the nodes that appear in from positions to .888We skip the \$$ at the first entry of VDD[1,select_{1}(D,2)-1]Q\leftarrow(\langle i,j\rangle,\langle l,r\rangle)m=i+\frac{j-i+1}{2}q=select_{1}(D,m)D[l,q-1]D[q,r]V[i,m-1]V[m,j]Q$. The pseudocode can be found in Figure 4.
The priority of each segment in is directly the size of its range in (). When a segment extracted from represents the instance of only one node (, with ), that node is returned as a result of the top-k algorithm (we return ). The algorithm stops when the first nodes are found.
For example, when searching for the top-1 most used nodes in the example from Figure 3, is initialized with the segment , corresponding to nodes from 1 to 10 (positions from 2 to 11 in ). Note that the entries of from 1 to 7 and represent the \$$ symbol. Since it is not an actual node, it must be skipped. Then [8,28][8,20]V[2,6][21,28]V[7,11](\langle 3,3\rangle,\langle 14,18\rangle)[14,18]4V\mathbf{3}=V[4]5=18-14$.
- •
Top-k most used nodes to start a trip (top-k-starts). Both top-k approaches above can be adapted for answering top-k-starts. However, unlike its simpler variant, it requires performing bsearch(\X)\PsiselectD$) at each iteration, hence increasing the temporal complexity of the operation.
The implementation of the linear approach is straightforward. The binary-partition approach differs slightly from the algorithm in Figure 4: in (l.2) we insert into , and we replace (l.11) with [x,y]~{}\leftarrow~{}bsearch(\V[m])q\leftarrow x$.
5.2 Implementation details
In our implementation of , we used the 999http://vios.dc.fi.udc.es/indexing from [44] briefly discussed in Section 2.3.1. Yet, we introduced some small modifications:
- •
The construction of the Suffix Array is done with SA-IS algorithm [64].101010https://sites.google.com/site/yuta256/sais In comparison with the qsufsort algorithm111111 http://www.larsson.dogma.net/research.html [65] used in the original , it achieves a linear time construction and a lower extra working space.
- •
In , they used a plain representation for bitvector and additional structures to support in constant time using ( bits). With that structure, they could solve in time (yet they did not actually needed solving in ). In our , we have used the SDArray from [66] to represent . It provides a very good compression for sparse bitvectors, as well as constant-time operation.
- •
In [44], operation was implemented with a simple binary search over rather than using the backward-search optimization proposed in the original [15]. In our experiments, we used backward search since it led to a much lower performance degradation at query time when a sparse sampling of was used.
6 Temporal component of
In this section we focus on the temporal component associated with each node of the enlarged trips in our dataset. Recall that in Figure 3, sequence contains the time associated with each node in a trip, and a possible encoding of times. In we focus on the values in , yet, since is not kept anymore in , we reorganize the values in to keep them aligned with rather than with . Those values are represented within array in Figure 3. For example, we can see that corresponds with , corresponds with , and so on.
Aiming at having a compact representation of while permitting fast access and resolution of range-based queries (that we could use to search for trips within a given time interval), we have considered two -based alternatives from the ones presented in Section 2.3.2:
- •
A Wavelet Tree [16] using variable-length Hu-Tucker codes [53] (). Recall this is the variant that permits to compress the original symbols with variable-length codes and still supports operation in time. Since Hu-Tucker coding assigns shorter codes to the most frequent symbols, the compression of our is highly dependent of the distribution of frequencies of the values in . Yet, if our trips represent movements of single users in a transportation network, we could expect to observe two or more periods corresponding to rush hours within a single day. This would lead to obtaining a skewed distribution of the frequencies for the symbols in , and consequently, we could expect to have better compression than if we used a balanced . The expected number of bits of our is .
- •
A balanced Wavelet Matrix () [38]. As we showed in Section 2.3.2 the is typically the most compact uncompressed variant of and it is faster than a pointerless . This is the reason why we chose a balanced instead of a balanced as this second alternative. Recall that, contains symbols, and each symbol can be encoded with bits, hence the balanced will be a matrix of bits.
In Figure 5, we show both the and the built on top of from Figure 3. The binary code-assignment to the source symbols in and that obtained after applying Hu-Tucker encoding algorithm [53] are also included in the figure.
6.1 Dealing with Temporal queries
With either one of the described alternatives ( or ) to represent time intervals we can answer the following pure temporal queries:
- •
Number of trips starting during the time interval (starts-t). Since we keep the starting time of each trip within , we can efficiently solve this query by simply computing .
- •
Total usage of network stops during the time interval (uses-t). This query can be seem as the sum of the number of trips that traversed each network node during . We can solve this query by computing .
- •
Number of trips performed during the time interval (trips-t). This is also an interesting query that permits to know the actual network usage during a time interval. To solve this query we could compute trips-t by subtracting the number of trips that started after (starts-t) and the number of trips that ended before (ends-t) from the total number of trips (). However, recall that has the starting time of each trip, but we do not keep their ending time. We could solve ends-t by taking the first node () of each trip starting before , then applying until reaching the ending node (), and finally getting the ending time of that trip associated to node . However, this would be rather inefficient. A possible solution to efficiently solve ends-t, would require to increment our temporal component, in parallel with , with another -based representation of the ending times for our trips. This would permit to report the number of trips ending before as , but would increase the overall size of . Yet, note that even without keeping ending-times, we could provide rather accurate estimations of trips-t for a system administrator. For example, using uses-t to compute the number of times each trip went through any node during the time interval , and dividing that value by the average nodes per trip. Another good estimation can also be obtained with starts-t.
6.2 Implementation details
We include here details regarding how we tune our and . As we discussed in Section 2.3.2, both and are built over bitvectors that require support for and operations. In our implementations we included two alternative bitvector representations avaliable at libcds library:121212https://github.com/fclaude/libcds
- •
A plain bitvector based on [42] named RG with additional structures to support in constant time ( in logaritmic time). RG includes a sampling parameter () that we set to value . In this case, our bitvector RG uses bits. That is, we tune RG to use a sparse sampling.
- •
A compressed RRR bitvector [49]. The RRR implementation includes a sampling parameter that we tune to values , , and . Higher sampling values achieve better compression.
In advance, when presenting results for and we will consider the four bitvector configurations above. Regarding our implementations of and , note that we reused the same implementation of from [38], and we created our custom implementation, paying special focus at solving efficiently.
6.3 Comparing the space/time trade-off of and
In order to compare the efficiency of our (that uses variable-length codes and supports efficiently) with a balanced alternative under different time distributions (recall that this is time distribution invariant), we run some experiments that evaluate the average time to execute operation on both representations.
We used a dataset of generated trips (Refer to Section 8.1 for the details about Madrid dataset) and we generated three kinds of time distributions for our evaluation. We refer to them as: uniform, skewed and very skewed. They are shown in Figure 6. According to the total number of passengers in a day, in the uniform distribution passengers use the network for each 5-minute interval. We also generated a skewed distribution for the time interval frequencies in an effort to model the usage of a public transportation network in a regular working day, where the starting time of a trip is generated according to the following rules:
- •
With 30% of probability, a trip occurs during a morning rush hour.
- •
With 45% of probability, a trip occurs in an evening rush hour.
- •
With 5% of probability, a trip occurs during lunch rush hour.
- •
The remaining 20% of probability is associated to unclassified trips, starting at a random hour of the day, which may also fall into one of the three previous periods discussed.
In the very skewed distribution we increase the rush-hour probabilities with 40% for the morning rush hour, 50% for the evening rush hour, 8% for lunch period and only 2% of random movements.
Then we built the and the considering two different granularities for the discretization of times: five-minute and thirty-minute intervals. Then, we generated random intervals of times over the whole time sequence of the dataset considering interval widths of five minutes, one hour, and six hours. Finally, we run queries (we show average times) from each query set over the six configurations of and (2 different granularities for the time discretization and 3 datasets).
In Figure 7, we show the results of our experiments. In the upper part of the figure, we include the results for and built over the times assuming uniform frequency distribution. In the middle part we assume times follow a the skewed distribution, and in the bottom of the figure we show results when considering a very skewed distribution. Moreover, figures in the left column show results for our structures considering that a 5-minute granularity is chosen for the discretization of times, whereas figures on the right column assume time granularity is 30 minutes. For each scenario we include plots wtht:5-min, wtht:1-hour, and wtht:6-hour for (range width for is respectively 5-minutes, 1-hour, and 6-hours). We also present those plots for (wm:5-min, wm:1-hour, and wm:6-hour).
The baseline used for the space usage (x-axis) is the size of an array of fixed-length time-interval IDs represented with the least number of bits needed (12 bits and 9 bits respectively for 5-minute and 30-minute granularity, see Section 8.1).
When times are uniformly distributed, our can only exploit the redundancy introduced by the \$$ symbols. This fact permits \mathsf{WTHT}96-98%RG\mathsf{WM}104%RRR_{128}RRR_{64}RRR_{32}RGRRR\mathsf{WM}\mathsf{WM}\mathsf{WTHT}RRR$ clearly slows down queries.
A skewed distribution favors the compression for a statistical coder like Hu-Tucker, which explains the higher compression obtained. However, it also slightly increases the query times, especially in the wider one-hour and six-hours query sets. This happens because the probability of having a query that forces to descend completely up to the leaves of the increases.
For a very skewed distribution, the gap in compression between and increases clearly (around percentage points), whereas query times remain similar to those in the previous scenario.
As a conclusion of the experiments discussed in this section, we have shown that the distribution of the sequence of times can be exploited by our to achieve a better compression and even improved query times than the balanced counterpart.
7 Dealing with Spatio-temporal queries
Apart from the pure spatial and temporal queries discussed in the previous sections, we can combine both the self-indexed spatial and temporal components from to answer spatio-temporal queries. The idea is to restrict spatial queries to a time interval . An example of this type of query is to return the number of trips starting at node that occurred between and , which we can solve by first finding the range in the of the trips starting in and then relying on the operation in the (or ). The following spatio-temporal queries can be solved by :
- •
Number of trips starting at node during time interval (starts-with-x).
Recall that in the time sequence we also included timestamps associated with the area of \$$-symbols in \Psi$[l,r]{}\leftarrow{}bsearch($X)\Psi[l,r][l,r]\subseteq[2,z+1]$XIcode^{\Psi}\mathsf{WTHT}\mathsf{WM}\Psi[t_{1},t_{2}]count(l,r,t_{1},t_{2})$. In Figure 8 (steps ① and ②) we can see the steps involved.
- •
*Number of trips ending at node during the time interval (ends-with-x). * As above, we initially perform the spatial query [l,r]~{}\leftarrow~{}bsearch(X\)\PsiX$Xcount(l,r,t_{1},t_{2})$ operation to count how many of those trips match the temporal constraint. See steps ③ and ④ in Figure 8.
- •
Number of trips using node during the time interval (uses-x). As in the corresponding spatial query, the range in is obtained with two operations on . Finally, finds the occurrences within the time interval . See steps ⑤ and ⑥ in Figure 8.
- •
Number of trips starting at and ending at occurring during time interval (from-x-to-y). We consider two different semantics. A query with strong semantics will obtain trips that start and end within . Whereas, a query with weak semantics will obtain trips whose time intervals overlap and, therefore, they could actually start before or end after .
In Figure 9, we show the step-by-step process to solve this type of queries. As in a spatial query, we start by searching for the range [l,r]\leftarrow bsearch(Y\X)\PsiYXY$X\Psi[l,r][\alpha,\beta]$XY\Psi\alpha\leftarrow\Psi[l],\beta\leftarrow\alpha+r-l$XYY$X$.
At this point, since was aligned with , we could check ending-time constraints within and starting-time constraints within (recall we keep starting times associated with the corresponding \$$ of each trip). Note also that, due to our sorting (by starting-node, ending-node, starting-time,\dotsIcode^{\Psi}[\alpha,\beta]XY[\alpha^{\prime},\beta^{\prime}]\subseteq[\alpha,\beta][t_{1},t_{2}]countLRin Figure [9](#S7.F9). Thus, assuming thatIcode^{\Psi}[\alpha,\beta][\alpha^{\prime},\beta^{\prime}]\leftarrow countLR(\alpha,\beta,t_{1},t_{2})[\alpha^{\prime},\beta^{\prime}]\subseteq[\alpha,\beta]\alpha^{\prime}=argmin_{x}(Icode^{\Psi}[x]\geq t_{1})\beta^{\prime}=argmax_{x}(Icode^{\Psi}[x]\leq t_{2})$.
Using a , a simple way to implement consists in performing two binary searches within to find , where at each step we would use operation. This would cost . Yet, we could also regard on operation to obtain a more efficient and also rather straightforward implementation of so that we set and . It costs .
- –
Strong semantics (from-x-to-y-strong). Note that the subrange (containing trips starting within ) has a matching subrange (step-④), where some of the ending times of these trips will fall inside (this allows us to check the ending time constraint). By performing we get the final result (step-⑤). To sum up, answering this query requires: one over (to find ), one to to obtain (), one to find , and one (to count the valid ending times in ).
- –
Weak semantics (from-x-to-y-weak). The size of is already a partial answer. To get the final result, we need to add also the occurrences of those trips starting before that end at or later. To do so, if , we need to compute . This gives us the number of time instants in the range of that fall inside . That is, ending times equal or after .
- •
Top-k most used nodes during time interval (top-k). Both the sequential and binary-partition approaches discussed in Section 5.1 can easily be extended to support this query. The idea is that, when we add a node either to the min-heap or priority-queue respectively, we compute its frequency within time interval (using operation) rather than using its overall frequency.
- –
In the sequential approach (top-k-seq), given a node whose corresponding range in is , we compute its frequency using instead of simply using . The rest of the process is exactly as discussed for the pure spatial top-k-seq query.
- –
In the binary-partition approach (top-k-bin), we have to consider the priority of a given segment as the number of trips covered by that segment that occurred during . Again, given a segment in we compute that priority as instead of . Apart from that, the only modifications that we must consider over the pure spatial top-k-bin algorithm in Figure 4 are: we replace (l.2) by ; , and we replace (l.12) and (l.13) respectively by and .
- •
Top-k most used nodes to start a trip during time interval (top-k-starts). Following the same guidelines discussed above for top-k, adapting the sequential and binary-partition solutions for the spatial top-k-starts to include temporal constraints is straightforward.
8 Experimental evaluation
We have run experiments to evaluate both the space requirement and performance at query time of when dealing with spatial, temporal and spatio-temporal queries over two different datasets (Porto and Madrid) that are described in Section 8.1.
We have used several configurations of by tuning both its spatial and temporal components. In the spatial part, we set the sampling parameter () to the values . For the temporal component, we have tested both the balanced , and the Hu-Tucker-shaped () using the same bitvector configurations discussed in Section 6.3. That is, using either a plain bitvector with a sparse sampling (), or a bitvector with sampling parameter (, and .
8.1 Experimental datasets
We used two different datasets of trips in our experiments:
- •
Madrid dataset: Using GTFS131313GTFS is a well-known specification for representing an urban transportation network. See https://developers.google.com/transit/gtfs/reference?hl=en data from the public transportation network of Madrid,141414Data from the EMT corporation at https://www.emtmadrid.es/movilidad20/googlet.html we generated a dataset of synthetic trips combining the subway network with the Spanish commuter rail system (called cercanías). In total, there are different stations/nodes from lines.
We generated million trips with lengths varying from to nodes traversed. Those lengths follow a binomial distribution. The average length of the trips is nodes.
In the generation of a trip of length , we randomly choose a starting node from a line, and the starting direction. Then, we follow that line until we reach a switching node. At this node, we decide whether to follow the current line or to switch to a new line. We allow only up to four line switches for a given trip, and use fixed probability values to decide whether to switch line or not. Such probability is , , , and respectively for the first, second, third, and fourth line switch in a trip. We also avoid revisiting nodes in the same trip. The generation process ends when nodes have been added to the trip, or a dead end is reached.
As a baseline, the plain representation of the generated trips using a -bit integer () for every node-ID (and \$$ separator) would require 137.47$ MiB.
We also generated synthetic times for those trips following the same rules used to create the time distribution named skewed in Figure 6, so most of the trip timestamps belong to rush hours. Yet, instead of using only regular working days, we distinguished four kinds of days in a week: regular working days; Fridays and holiday eves; Saturdays; and Sundays and holidays. We also assume that there are two kinds of weeks related to high and low season periods. Therefore, a time interval may belong to eight types of day. When discretized at five-minute intervals we obtain distinct time intervals, while when we use thirty-minute intervals we obtain . In the former case, our baseline for the generated times using bits per time-ID would occupy MiB. In the latter one, each time-ID requires bits and the temporal baseline requires MiB.
- •
Porto dataset: We downloaded a collection of trajectories from the city of Porto corresponding to taxi trips during a full year (from July 1, 2013 to June 30, 2014), provided by [67].151515Description at http://www.geolink.pt/ecmlpkdd2015-challenge/dataset.html. Download at https://archive.ics.uci.edu/ml/machine-learning-databases/00339/train.csv.zip Among other fields those data include, for each taxi ride, a list of GPS coordinates and times gathered every seconds of the trip. We adapted such data to our needs by using a map matching algorithm provided by the Graphhopper library,161616https://github.com/graphhopper/map-matching and OpenStreetMap cartography.171717http://www.openstreetmap.org/ This permitted us to figure out the streets that trips were passing through. Finally, trips were encoded as a sequence of identifiers corresponding to adjacent stretches of street (that is, basic street segments with no intersections) the trip traversed, each one of them tagged with a timestamp.
After filtering incomplete matches, trips, built over distinct street segments, were used for the dataset. Due to the nature of the network and the trips, the average number of street segments per trip is ; that is, the length of the trips is longer than in Madrid dataset. Since we needed bits to represent each segment in a trip, the total size of our plain spatial baseline is MiB.
For the temporal part, we considered only one kind of day. Therefore, when we sample those hours into five-minute intervals, we obtain distinct time intervals that are given a -bit time-ID. Consequently the overall size of the temporal baseline becomes MiB. However, if we split those hours into thirty-minute intervals, only time intervals arise. In this case, each time-ID needs only bits and the total size of the temporal baseline is MiB. The average number of daily passengers for each time interval is shown in Figure 10.
8.2 Space Requirements of
We show the compression obtained by when built on our two test datasets. Compression is shown as the percentage of the size of the plain baselines discussed above. Using different configurations of , we will show the compression of the spatial component (), that of the temporal component ( and ), and finally the overall compression of .
Results regarding the compression obtained by are given in Table 1. The compression ratio is calculated over a plain spatial-only (stop-IDs or street-segment-IDs in each case) representation. In a rather dense configuration of with we obtain compression ratios around % and % for Madrid and Porto datasets respectively. Those results are interesting from the simple point that the baseline representations were only using respectively -bits per node (Madrid) and -bits per segment (Porto). As expected, compression improves as we increase the sampling parameter . We show that by tuning in a more sparse setup we can almost halve the space needs of using . Yet, the resulting would become much slower as we will see in the next section. In general, we can see that obtains better compression in Porto than in Madrid. This is probably due to the longer and more predictable trips. Note that is not common to arrive at an intersection having more than two valid street links where to navigate to.
In Table 2, we focus on the space needed by the temporal component of . In this case we show the compression ratios obtained by and considering that time is either discretized into -min or -min intervals. Recall that the size of the plain baseline representations differs depending on the discretization period. Both and were tuned by using bitvector representations , , , and .
It is interesting to see that in the synthetic dataset from Madrid, bitvectors always lead to a better compression than the plain , while in the real dataset from Porto that is never the case. In some cases does not compress the times at all. Consequently, for Porto dataset, the faster plain bitvectors are probably the best choice. In Madrid dataset, we can see an actual space/time trade-off: obtains better compression but will be slower (as we will see in the next section).
To understand why is much more effective in Madrid than in Porto, recall that the values in are aligned to the suffix Array (). Recall also that, within the range in corresponding to each node , suffixes are sorted by ending node, then by starting time, and finally by the remaining nodes. Therefore, at least all the trips that start in node and end in the same node are sorted by time, and consequently, the corresponding range in keeps non-decreasing values. We have measured the number of times that the “starting-time” component was used during the suffix-sort step from the construction of . We obtained that, for Madrid, it was used times, while for Porto, it was only used times (note that this is not the number of repeated trips from to , but the number of times the “starting-times” were actually compared during suffix-sort). Since the number of entries in is rather similar in both datasets (around in Madrid, and in Porto) we could expect that the sequence of is much more regular in Madrid than in Porto. Consequently, this could explain why performs much better in Madrid than in Porto.
Finally, in Table 3, we show the overall compression ratios of . We use the same configurations for and as in Table 2, and both the most dense and sparse tuning of ( and respectively). For Madrid dataset, the pair (node,timestamp) is represented with bits in our baseline representation when time is discretized into -minute intervals, and with when we use -minute intervals. In the case of Porto dataset, when using -minute intervals, each pair (node,timestamp) from the baseline requires bits. If discretization considers -minute intervals, the baseline requires bits. We can see that the overall compression of in Madrid dataset ranges between and . Also we show that Porto dataset is much more compressible, obtaining compression ratios from around to .
8.3 Performance at query time
Through this section, we evaluate the time performance of when solving spatial, temporal, and spatio-temporal queries. We have randomly generated query patterns from our two datasets for each type of query. Each time measurement presented below is the average execution time of runs using the corresponding query patterns, except for the top-k queries where we perform runs of the top-k algorithms with .
Our test machine has an Intel(R) Core(tm) [email protected] CPU (4 cores/4 siblings) and 8GB of DDR3 RAM. It runs Ubuntu Linux 16.04 (Kernel 4.4.0-21-generic). The compiler used was g++ version 5.4.0 and we set compiler optimization flags to . All our experiments run in a single core and time measures refer to CPU user-time.
During the generation of query patterns, for those queries involving only one node from the network, we have randomly chosen times from the available network nodes. This is the case of the query patters used both for the spatial queries starts-with-x, ends-with-x, and uses-x or the spatio-temporal starts-with-x, ends-with-x, and uses-x. In the case of the spatial from-x-to-y and the spatio-temporal from-x-to-y-strong, and from-x-to-y-weak the pair of network nodes that compose our query patterns were generated by randomly choosing trips and then extracting the initial and ending nodes of those trips.
Moreover, we also generated the time intervals required for the spatio-temporal queries. Considering the different available time-IDs, we chose a random starting instant and then randomly generated the width of that interval from five minutes to two hours. Note that if we discretized time into -minute intervals and - minutes, our time interval would contain exactly time IDs (). However, if time was discretized into -minute intervals, would contain only time IDs (). We followed the same procedure to gather the query patterns used for the pure temporal queries uses-t and starts-t.
8.3.1 Space/time trade-off when dealing with spatial queries
In Figures 12 and 12, we show the performance of at solving spatial queries for Madrid and Porto datasets respectively. Note that all these queries can be answered using only the component of . Therefore, the size of the temporal component is not considered here and compression values (x-axis) refer only to the size of with respect to the spatial baseline as in Table 1. We show the average query time (in s) depending on the space used by with three different sampling configurations ().
Results show that the queries that involve searching in the \$$ region of \Psi$$ for every trip.
In both datasets, we can see that uses-x (solved using select on rather than on ) is the fastest query. On average, it takes only around ns per query. Except in the most sparse configuration of , queries ends-with-x, starts-with-x, and from-x-to-y require typically less than s. This basically shows the cost of performing on a compressed . In the most sparse setup, times for starts-with-x and from-x-to-y are always better in Madrid than in Porto dataset, and ends-with-x draws rather identical times. With the densest configuration (), ends-with-x and from-x-to-y are respectively around -% fastest in Madrid dataset (ends-with-x takes s and s respectively, and from-x-to-y takes s and s). However, starts-with-x performs around % faster in Porto dataset (s vs s).
Focusing on top-k queries, we can see huge differences between top-k-starts and the rest of the top-k queries, as the former needs to perform over the compressed instead of a on .
We can also see that due to the small number of stops in Madrid dataset, it is always more efficient to use the sequential version of top-k-starts and top-k algorithms. This is also because a rather uniform frequency among nodes increases the number of insertions in the priority queue () of the binary algorithm needed for retrieving the first nodes (). Moreover, note that for the sequential algorithm is at most , whereas for the binary-partition counterpart it could become up to .
However, in Porto dataset, where nodes follow a biased distribution (some streets are far more used than others by taxis), and whose vocabulary is times larger than that of Madrid’s, the binary-partition version of top-k-starts and top-k algorithms is clearly faster than the sequential counterpart (top-k-seq and top-k-starts-seq). Note that in Madrid dataset, top-100 returns 32% of the nodes (hence sequential processing worths it) whereas in Porto dataset less than 0.2% of the nodes are returned.
The gap between top-10-seq and top-100-seq that we can clearly appreciate in Madrid dataset is due to the cost of the insertion of nodes in the min-heap. However, the gap between the binary top-10 and top-100 is mainly related to the number of iterations performed until the binary-partition algorithm gathers the first and nodes returned respectively. The same discussion applies for top-k-starts queries.
8.3.2 Space/time trade-off when performing temporal queries
In this section we focus on the performance of the temporal component of . We use the same configurations as in Table 2 for and , and show the space/time trade-offs obtained when solving pure temporal queries. Figures 14 and 14 present the results obtained at uses-t and starts-t queries for Madrid and Porto datasets respectively. Note that, in this case, since the is not actually needed to solve temporal queries, we do not include its size within the compression values (x-axis).
We can see that when running uses-t queries, both and obtain rather similar times (requiring less than 4s to perform a operation in all cases) and that those times improve as the height of the structure decreases. We can see that in the highest and , corresponding to using -min intervals in Madrid dataset, uses-t requires less than s. Then, when using -min intervals, the time required to solve uses-t is always below s (yet performs faster than here), and those times are similar to the ones obtained for Porto dataset when using -min intervals. And finally, the best query times (below s) are obtained for Porto dataset with -min intervals.
Regarding starts-t, recall that it also performs a operation, but within a smaller range () in comparison with the range where is performed for uses-t. We can see that, whereas obtains similar times to those of uses-t query, starts-t performs clearly faster than uses-t over .
As a final note, recall that in Madrid dataset, bitvector always needs more space than counterparts whereas in Porto dataset (as discussed in Section 8.2) obtains the best space values when using -min intervals and still requires less space than when using -min intervals. This is the reason why while plots for Madrid dataset are decreasing from left to right, in Porto dataset the first point () in the left figures (-min intervals), and the third point () in the right figures (-min intervals) require less space than the others () and are also typically faster.
8.3.3 Space/time trade-off when performing spatio-temporal queries
In Figures 16 and 16 we show the space/time tradeoff obtained by when dealing with spatio-temporal queries. Recall that this type of queries require both using the , to exploit indexed access to the nodes in the trips, and the temporal component of to handle temporal constraints. In this case, the space values showed in the figures include both the size of and that of either or . Therefore, we also show the overall space needs of . In the case of we have set (a fixed dense sampling), and for and we used again the same configurations as in the previous sections obtained by varying the bitvectors and the temporal discretization.
For queries starts-with-x, ends-with-x, and uses-x we can see typically small differences between using or . In Madrid dataset, overcomes being -% faster in these types of queries. However, in Porto dataset is slightly faster (from to %) than its counterpart.
For queries from-x-to-y-strong and from-x-to-y-weak we can see a big gap between the times reported by and . This gap arises because in we have used exactly the operation discussed in Section 7 that is implemented with two calls to the operation from the .181818For we used exactly the same implementation in [38] and simply added the new operation that calls the underlying from the . However, in our implementation of we have engineered an improved version of where, during the execution of , we also report and , hence avoiding two calls to .
Finally, we also include results for top-k and top-k-starts queries in Figures 18 and 18. As explained in Section 8.3.1, the sequential approach is preferred when the frequency distribution of nodes is rather uniform (Madrid dataset). Otherwise, the binary-partition counterpart outperforms it. The need for applying a temporal constraint simply accentuates this effect in comparison with the corresponding pure spatial queries.
8.4 Discussion: solving queries on CTR Vs Pre-computing counters
Along this section we have focused our experiments on using to answer our set of queries. However, all those queries could somehow be pre-computed in such a way that the could be solved faster at the cost of dealing with additional supporting structures. For example, all the spatial queries could be pre-computed with tables that store for each node, or pair of nodes, the corresponding counters. In any case, all those tables would occupy less than 2MB. In the case of spatio-temporal queries, we could create a straightforward structure where, for each node, we used a sparse array to keep the counters for each time instant. These would permit us to solve time interval queries by summing up the counters matching the temporal constraint of the query. We have implemented those structures for queries starts-with-x, ends-with-x, and uses-x for Madrid and Porto datasets when using -minute intervals. In all cases the pre-computed structure for each query occupies around 20MB and permits us to answer those queries from to orders of magnitude faster than . Finally, we also created a simple pre-computed structure to handle the spatio-temporal query from-x-to-y. We used a sparse array that, for each pair , keeps a counter for all those trajectories that started at and ended at .191919Source code at https://github.com/dgalaktionov/compact-trip-representation/blob/master/src/buildFacade.h In this case, the resulting structure roughly occupies from to MB in memory. That is, around -% the size of our plain baseline, which is approximately the same space required by . Again times are roughly from to times faster than in .
We have seen above that a simple and straightforward implementation of additional pre-computed structures can handle most of the queries proposed in this work and improve the query times obtained with a solution based on compact data structures such as . Yet, still owns some advantages: i) actually keeps all the trips implicitly in a compressed and self-indexed way. Therefore, it avoids the need of storing them apart for the case in which we had to support further queries. ii) In some management scenarios, not all the queries can be pre-computed. For example, some indicators in the context of transportation networks require “counting the number of trips that went through two nodes and ”. Using we could relay on the underlying to efficiently locate the ranges corresponding to and and apply to extract the original trips to check if they contain . Other queries such as “Count how many of the trips from X to Y passed through Z” could be solved similarly by initially locating the range corresponding to Y\Z\PsiXYZ$.
9 Conclusions and future work
With the installation of better user-tracking mechanisms in public transportation networks, or the fact that a simple app installed in a mobile phone permits us to track user movements, the problem of storing user trips to finally support network analysis operations has been gaining increasing interest in multiple scenarios. For example, we could consider a network management administration, a taxi company, services like Uber, Cabify, Car2go, or simply end-user applications.
With enough data of vehicle trips from a significant amount of drivers over the network composed of the streets in a city, it would be possible to infer traffic rules by examining turns that nobody takes, their usual driving speed across the network, congestion points at a given time, and other useful information. Also, a taxi company (or similar services) could benefit from knowing the city areas where it is more probable that a user would start a trip, the average time to go from one area to another, etc. This also applies for the administrators of public transportation networks including buses, trains, subway, etc.
We have presented and showed that it is a powerful tool to represent user trips. Actually, we have used to handle user trips from two different scenarios: the network of subway and local trains from Madrid, and taxi trips from Porto. uses compact data structures to store both the nodes traversed (spatial component) by an user during a trip and the corresponding timestamps (temporal component). This permits us not only to reduce the amount of data to store but also to efficiently perform spatial, temporal, and spatio-temporal queries that can help us to analyze the actual usage of the network.
In particular, we used the well-known to represent the spatial component of the trips. For Madrid dataset, the size of is around -% the size of the source data. Porto dataset is still more compressible and requires only around -% the space of the original data. This structure is enough to solve typical spatial queries within microseconds and top-k queries in milliseconds. For the temporal component, we used two -based structures. We adapted the existing balanced and we created a Hu-Tucker-shaped () that permits to exploit a biased distribution of times to gain compression. These structures obtained only a moderate improvement in compression with respect to a plain representation of times (compression ratio from to %), but they provided indexed access to the temporal data, and consequently allowed us to support temporal queries very efficiently. Finally, we have also shown that the overall , including both and either or , permits also to efficiently solve spatio-temporal queries (within microseconds); that is, spatial queries constrained to a time period. The overall compression obtained by is around -% in Madrid dataset and around -% in Porto dataset.
We have presented as a proof of concept development, and we have shown how to solve different types of queries. Yet, based on the underlying data-structures, is flexible enough to allow us to increase its functionality. As future work, we are interested in exploiting the underlying network topology to obtain a more compact representation of the trips in . In this promising line [68, 69] we are working on a succinct representation for the context of public transportation networks. Also, we want to explore ways to improve the compression of the temporal component of . We consider that an inverted-index based representation can be promising.
References
- [1]
N. R. Brisaboa, A. Fariña, D. Galaktionov H., M. A. Rodríguez, Compact trip representation over networks, in: Proc. 23th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 9954, 2016, pp. 240–253.
- [2]
M. A. Munizaga, C. Palma, Estimation of a disaggregate multimodal public transport Origin–Destination matrix from passive smartcard data from Santiago, Chile, Transportation Research Part C: Emerging Technologies 24 (2012) 9–18.
doi:10.1016/j.trc.2012.01.007.
- [3]
R. H. Güting, M. H. Böhlen, M. Erwig, C. S. Jensen, N. Lorentzos, E. Nardelli, M. Schneider, J. R. R. Viqueira, Chapter 4: Spatio-temporal models and languages: An approach based on data types, in: Spatio-Temporal Databases: The CHOROCHRONOS Approach, LNCS 2520, 2003, pp. 117–176.
- [4]
S. Spaccapietra, Editorial: Spatio-Temporal Data Models and Languages, GeoInformatica 5 (1) (2001) 5–9.
- [5]
L. Forlizzi, R. H. Güting, E. Nardelli, M. Schneider, A data model and data structures for moving objects databases, in: Proc. ACM SIGMOD International Conference on Management of Data (SIGMOD), 2000, pp. 319–330.
- [6]
M. Erwig, R. H. Güting, M. Schneider, M. Vazirgiannis, Spatio-Temporal Data Types: An Approach to Modeling and Querying Moving Objects in Databases, GeoInformatica 3 (3) (1999) 269–296.
- [7]
N. Pelekis, Y. Theodoridis, Mobility Data Management and Exploration, Springer, 2014.
doi:10.1007/978-1-4939-0392-4.
- [8]
D. Pfoser, C. S. Jensen, Y. Theodoridis, Novel Approaches in Query Processing for Moving Object Trajectories, in: Proc. 26th International Conference on Very Large Data Bases (VLDB), 2000, pp. 395–406.
- [9]
Y. Tao, D. Papadias, MV3R-Tree: A Spatio-Temporal Access Method for Timestamp and Interval Queries, in: Proc. 27th International Conference on Very Large Data Bases (VLDB), 2001, pp. 431–440.
- [10]
E. Frentzos, Indexing Objects Moving on Fixed Networks, in: Proc. 8th International Symposium on Spatial and Temporal Databases (SSTD), 2003, pp. 289–305.
- [11]
V. T. de Almeida, R. H. Güting, Indexing the Trajectories of Moving Objects in Networks, GeoInformatica 9 (1) (2005) 33–60.
doi:10.1007/s10707-004-5621-7.
- [12]
I. Sandu Popa, K. Zeitouni, V. Oria, D. Barth, S. Vial, Indexing In-network Trajectory Flows, The VLDB Journal 20 (5) (2011) 643–669.
doi:10.1007/s00778-011-0236-8.
- [13]
P. Cudré-Mauroux, E. Wu, S. Madden, TrajStore: An Adaptive Storage System for Very Large Trajectory Data Sets, in: Proc. 26th International Conference on Data Engineering (ICDE), 2010, pp. 109–120.
- [14]
Y. Li, C. Chow, K. Deng, M. Yuan, J. Zeng, J. Zhang, Q. Yang, Z. Zhang, Sampling big trajectory data, in: Proc. 24th ACM International Conference on Information and Knowledge Management (CIKM), 2015, pp. 941–950.
- [15]
K. Sadakane, New text indexing functionalities of the compressed suffix arrays, Journal of Algorithms 48 (2) (2003) 294–313.
- [16]
R. Grossi, A. Gupta, J. S. Vitter, High-order entropy-compressed text, in: Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2003, pp. 841–850.
- [17]
O. Wolfson, B. Xu, S. Chamberlain, L. Jiang, Moving objects databases: Issues and solutions, in: Proc. 10th International Conference on Scientific and Statistical Database Management (SSDBM), 1998, pp. 111–122.
- [18]
A. P. Sistla, O. Wolfson, S. Chamberlain, S. Dao, Modeling and querying moving objects, in: Proc. 13th International Conference on Data Engineering (ICDE), 1997, pp. 422–432.
- [19]
R. H. Güting, M. H. Böhlen, M. Erwig, C. S. Jensen, N. A. Lorentzos, M. Schneider, M. Vazirgiannis, A foundation for representing and querying moving objects, ACM Transactions on Database Systems 25 (1) (2000) 1–42.
- [20]
R. H. Güting, M. Schneider, Moving Objects Databases, Morgan Kaufmann, 2005.
- [21]
M. L. Damiani, H. Issa, R. H. Güting, F. Valdés, Symbolic trajectories and application challenges, SIGSPATIAL Special 7 (1) (2015) 51–58.
- [22]
R. H. Güting, V. Teixeira de Almeida, Z. Ding, Modeling and querying moving objects in networks, The VLDB Journal 15 (2) (2006) 165–190.
doi:10.1007/s00778-005-0152-x.
- [23]
Z. Ding, B. Yang, R. H. Güting, Y. Li, Network-matched trajectory-based moving-object database: Models and applications, IEEE Trans. Intelligent Transportation Systems 16 (4) (2015) 1918–1928.
doi:10.1109/TITS.2014.2383494.
- [24]
A. Guttman, R-trees: A dynamic index structure for spatial searching, in: Proc. ACM SIGMOD International Conference on Management of Data (SIGMOD), 1984, pp. 47–57.
- [25]
M. A. Nascimento, J. R. Silva, Towards historical R-trees, in: Proc. ACM symposium on Applied Computing (SAC), ACM, 1998, pp. 235–240.
- [26]
V. P. Chakka, A. Everspaugh, J. M. Patel, Indexing large trajectory data sets with SETI, in: Proc. 1st Conference on Innovative Data Systems Research (CIDR), 2003, pp. 1–12.
- [27]
J.-W. Chang, M.-S. Song, J.-H. Um, TMN-tree: new trajectory index structure for moving objects in spatial networks, in: Proc. 10th International Conference on Computer and Information Technology (CIT), 2010, pp. 1633–1638.
- [28]
D. H. T. That, I. S. Popa, K. Zeitouni, TRIFL: A generic trajectory index for flash storage, ACM Transactions on Spatial Algorithms and Systems 1 (2) (2015) 6.
- [29]
N. Meratnia, R. A. de By, Spatiotemporal compression techniques for moving point objects, in: Proc. 9th International Conference on Extending Database Technology (EDBT), LNCS 2992, 2004, pp. 765–782.
- [30]
M. Potamias, K. Patroumpas, T. Sellis, Sampling Trajectory Streams with Spatiotemporal Criteria, in: Proc. 18th International Conference on Scientific and Statistical Database Management (SSDBM), 2006, pp. 275–284.
- [31]
H. Cao, O. Wolfson, G. Trajcevski, Spatio-temporal data reduction with deterministic error bounds, The VLDB Journal 15 (3) (2006) 211–228.
doi:10.1007/s00778-005-0163-7.
- [32]
K.-F. Richter, F. Schmid, P. Laube, Semantic Trajectory Compression: Representing Urban Movement in a Nutshell, Journal of Spatial Information Science 4 (1) (2012) 3–30.
- [33]
G. Kellaris, N. Pelekis, Y. Theodoridis, Map-matched Trajectory Compression, Journal of Systems and Software 86 (6) (2013) 1566–1579.
doi:10.1016/j.jss.2013.01.071.
- [34]
S. Funke, R. Schirrmeister, S. Skilevic, S. Storandt, Compass-based navigation in street networks, in: Proc. 14th International Symposium on Web and Wireless Geographical Information Systems (W2GIS), LNCS 9080, 2015, pp. 71–88.
- [35]
B. Krogh, N. Pelekis, Y. Theodoridis, K. Torp, Path-based queries on trajectory data, in: Proc. 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL), 2014, pp. 341–350.
- [36]
S. Koide, Y. Tadokoro, T. Yoshimura, SNT-index: Spatio-temporal index for vehicular trajectories on a road network based on substring matching, in: Proc. 1st International ACM SIGSPATIAL Workshop on Smart Cities and Urban Analytics (UrbanGIS@SIGSPATIAL), 2015, pp. 1–8.
- [37]
P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. 41st IEEE Symposium on Foundations of Computer Science (FOCS), 2000, pp. 390–398.
- [38]
F. Claude, G. Navarro, A. Ordóñez, The wavelet matrix: An efficient wavelet tree for large alphabets, Information Systems 47 (2015) 15–32.
- [39]
U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, SIAM Journal on Computing 22 (5) (1993) 935–948.
- [40]
R. Grossi, J. S. Vitter, Compressed suffix arrays and suffix trees with applications to text indexing and string matching, in: Proc. 32nd ACM Symposium on Theory of Computing (STOC), 2000, pp. 397–406.
- [41]
G. Jacobson, Space-efficient static trees and graphs, in: Proc. 30th IEEE Symposium on Foundations of Computer Science (FOCS), 1989, pp. 549–554.
- [42]
I. Munro, Tables, in: Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), LNCS 1180, 1996, pp. 37–42.
- [43]
G. Navarro, V. Mäkinen, Compressed Full-text Indexes, ACM Computing Surveys 39 (1) (2007) article 2.
- [44]
A. Fariña, N. R. Brisaboa, G. Navarro, F. Claude, Á. S. Places, E. Rodríguez, Word-based Self-Indexes for Natural Language Text, ACM Transactions on Information Systems 30 (1) (2012) article 1:.
- [45]
T. Gagie, G. Navarro, S. J. Puglisi, New algorithms on wavelet trees and applications to information retrieval, Theoretical Computer Science 426 (2012) 25–41.
- [46]
G. Navarro, Compact Data Structures – A practical approach, Cambridge University Press, 2016.
- [47]
G. Navarro, Wavelet trees for All, Journal of Discrete Algorithms 25 (2014) 2–20.
doi:10.1016/j.jda.2013.07.004.
- [48]
A. Golynski, R. Grossi, A. Gupta, R. Raman, S. S. Rao, On the size of succinct indices, in: Proc. 15th Annual European Symposium on Algorithms (ESA), LNCS 4698, 2007, pp. 371–382.
- [49]
R. Raman, V. Raman, S. S. Rao, Succinct indexable dictionaries with applications to encoding k-ary trees and multisets, in: Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002, pp. 233–242.
- [50]
D. A. Huffman, A method for the construction of minimum-redundancy codes, Proceedings of the IRE 40 (9) (1952) 1098–1101.
doi:10.1109/JRPROC.1952.273898.
- [51]
P. Ferragina, R. González, G. Navarro, R. Venturini, Compressed text indexes: From theory to practice, Journal of Experimental Algorithmics 13 (2009) 1–12.
- [52]
J. Barbay, G. Navarro, On compressing permutations and adaptive sorting, Theoretical Computer Science 513 (2013) 109–123.
doi:10.1016/j.tcs.2013.10.019.
- [53]
T. C. Hu, A. C. Tucker, Optimal computer search trees and variable-length alphabetical codes, SIAM Journal on Applied Mathematics 21 (4) (1971) 514–532.
- [54]
T. M. Cover, J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing), 2nd Edition, Wiley-Interscience, 2006.
- [55]
Y. Horibe, An improved bound for weight-balanced tree, Information and Control 34 (2) (1977) 148–151.
doi:10.1016/S0019-9958(77)80011-9.
- [56]
E. N. Gilbert, E. F. Moore, Variable-length binary encodings, Bell System Technical Journal 38 (4) (1959) 933–967.
doi:10.1002/j.1538-7305.1959.tb01583.x.
- [57]
F. Claude, G. Navarro, Practical rank/select queries over arbitrary sequences, in: Proc. 15th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5280, 2008, pp. 176–187.
- [58]
A. Ordóñez, Statistical and repetition-based compressed data structures, Ph.D. thesis, Department of Computer Science, University of A Coruña (2016).
- [59]
A. Fariña, T. Gagie, G. Manzini, G. Navarro, A. Ordóñez, Efficient and Compact Representations of Some Non-canonical Prefix-Free Codes, LNCS 9954, 2016, pp. 50–60.
- [60]
C. Morency, M. Trépanier, B. Agard, Measuring transit use variability with smart-card data, Transport Policy 14 (3) (2007) 193–203.
doi:10.1016/j.tranpol.2007.01.001.
- [61]
A. El-Geneidy, D. Levinson, Place rank: Valuing spatial interactions, Networks and Spatial Economics 11 (4) (2011) 643–659.
doi:10.1007/s11067-011-9153-z.
- [62]
G. Wang, Y. Zhong, C.-P. Teo, Q. Liu, Flow-based accessibility measurement: The place rank approach, Transportation Research Part C: Emerging Technologies 56 (2015) 335–345.
doi:10.1016/j.trc.2015.04.017.
- [63]
R. Guimerà, S. Mossa, A. Turtschi, L. A. N. Amaral, The worldwide air transportation network: Anomalous centrality, community structure, and cities’ global roles, Proc. National Academy of Sciences 102 (22) (2005) 7794–7799.
- [64]
G. Nong, S. Zhang, W. H. Chan, Two efficient algorithms for linear time suffix array construction, IEEE Transactions on Computers 60 (10) (2011) 1471–1484.
- [65]
N. J. Larsson, K. Sadakane, Faster suffix sorting, Theoretical Computer Science 387 (3) (2007) 258–272.
doi:10.1016/j.tcs.2007.07.017.
- [66]
D. Okanohara, K. Sadakane, Practical entropy-compressed rank/select dictionary, in: Proc. 9th Workshop on Algorithm Engineering and Experiments (ALENEX), 2007, pp. 60–70.
doi:10.1137/1.9781611972870.6.
- [67]
L. Moreira-Matias, J. Gama, M. Ferreira, J. Mendes-Moreira, L. Damas, Predicting taxi–passenger demand using streaming data, IEEE Transactions on Intelligent Transportation Systems 14 (3) (2013) 1393–1402.
- [68]
Y. Han, W. Sun, B. Zheng, Compress: A comprehensive framework of trajectory compression in road networks, ACM Transactions on Database Systems 42 (2) (2017) 11:1–11:49.
- [69]
S. Koide, Y. Tadokoro, C. Xiao, Y. Ishikawa, Cinct: Compression and retrieval for massive vehicular trajectories via relative movement labeling, in: Proc. 34th IEEE International Conference on Data Engineering (ICDE), 2018.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. R. Brisaboa, A. Fariña, D. Galaktionov H., M. A. Rodríguez, Compact trip representation over networks, in: Proc. 23th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 9954, 2016, pp. 240–253.
- 2[2] M. A. Munizaga, C. Palma, Estimation of a disaggregate multimodal public transport Origin–Destination matrix from passive smartcard data from Santiago, Chile, Transportation Research Part C: Emerging Technologies 24 (2012) 9–18. doi:10.1016/j.trc.2012.01.007 . · doi ↗
- 3[3] R. H. Güting, M. H. Böhlen, M. Erwig, C. S. Jensen, N. Lorentzos, E. Nardelli, M. Schneider, J. R. R. Viqueira, Chapter 4: Spatio-temporal models and languages: An approach based on data types, in: Spatio-Temporal Databases: The CHOROCHRONOS Approach, LNCS 2520, 2003, pp. 117–176.
- 4[4] S. Spaccapietra, Editorial: Spatio-Temporal Data Models and Languages, Geo Informatica 5 (1) (2001) 5–9. doi:10.1023/A:1011403703806 . · doi ↗
- 5[5] L. Forlizzi, R. H. Güting, E. Nardelli, M. Schneider, A data model and data structures for moving objects databases, in: Proc. ACM SIGMOD International Conference on Management of Data (SIGMOD), 2000, pp. 319–330. doi:10.1145/342009.335426 . · doi ↗
- 6[6] M. Erwig, R. H. Güting, M. Schneider, M. Vazirgiannis, Spatio-Temporal Data Types: An Approach to Modeling and Querying Moving Objects in Databases, Geo Informatica 3 (3) (1999) 269–296. doi:10.1023/A:1009805532638 . · doi ↗
- 7[7] N. Pelekis, Y. Theodoridis, Mobility Data Management and Exploration, Springer, 2014. doi:10.1007/978-1-4939-0392-4 . · doi ↗
- 8[8] D. Pfoser, C. S. Jensen, Y. Theodoridis, Novel Approaches in Query Processing for Moving Object Trajectories, in: Proc. 26th International Conference on Very Large Data Bases (VLDB), 2000, pp. 395–406.
