Quantitative Comparison of Time-Dependent Treemaps
Eduardo Vernier, Max Sondag, Joao Comba, Bettina Speckmann, Alexandru, Telea, and Kevin Verbeek

TL;DR
This paper provides a comprehensive quantitative comparison of 14 treemapping algorithms for time-dependent hierarchical data, introducing a dataset classification scheme and analyzing algorithm performance based on dataset features.
Contribution
It introduces a novel classification scheme for time-dependent datasets and evaluates treemapping algorithms' performance across these classes.
Findings
Performance varies with dataset characteristics.
Certain algorithms excel in stability or visual quality depending on dataset features.
Open datasets and metrics are provided for further research.
Abstract
Rectangular treemaps are often the method of choice to visualize large hierarchical datasets. Nowadays such datasets are available over time, hence there is a need for (a) treemaps that can handle time-dependent data, and (b) corresponding quality criteria that cover both a treemap's visual quality and its stability over time. In recent years a wide variety of (stable) treemapping algorithms has been proposed, with various advantages and limitations. We aim to provide insights to researchers and practitioners to allow them to make an informed choice when selecting a treemapping algorithm for specific applications and data. To this end, we perform an extensive quantitative evaluation of rectangular treemaps for time-dependent data. As part of this evaluation we propose a novel classification scheme for time-dependent datasets. Specifically, we observe that the performance of treemapping…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\SpecialIssuePaper\BibtexOrBiblatex\electronicVersion\PrintedOrElectronic
Quantitative Comparison of Time-Dependent Treemaps
Eduardo Vernier2\orcid0000-0001-9483-4978, Max Sondag1\orcid0000-0003-3309-638X, João Comba2\orcid0000-0003-2921-2130, Bettina Speckmann1\orcid0000-0002-8514-7858, Alexandru Telea3\orcid0000-0003-0750-0502, and Kevin Verbeek1\orcid0000-0003-3052-4844
1TU Eindhoven , Netherlands
2UFRGS, Brazil
3Utrecht University, Netherlands
Abstract
Rectangular treemaps are often the method of choice to visualize large hierarchical datasets. Nowadays such datasets are available over time, hence there is a need for (a) treemaps that can handle time-dependent data, and (b) corresponding quality criteria that cover both a treemap’s visual quality and its stability over time. In recent years a wide variety of (stable) treemapping algorithms has been proposed, with various advantages and limitations. We aim to provide insights to researchers and practitioners to allow them to make an informed choice when selecting a treemapping algorithm for specific applications and data. To this end, we perform an extensive quantitative evaluation of rectangular treemaps for time-dependent data. As part of this evaluation we propose a novel classification scheme for time-dependent datasets. Specifically, we observe that the performance of treemapping algorithms depends on the characteristics of the datasets used. We identify four potential representative features that characterize time-dependent hierarchical datasets and classify all datasets used in our experiments accordingly. We experimentally test the validity of this classification on more than 2000 datasets, and analyze the relative performance of 14 state-of-the-art rectangular treemapping algorithms across varying features. Finally, we visually summarize our results with respect to both visual quality and stability to aid users in making an informed choice among treemapping algorithms. All datasets, metrics, and algorithms are openly available to facilitate reuse and further comparative studies.
\ccsdesc
[500]Human-centered computing Treemaps \ccsdesc[300]Information systems Temporal data
\printccsdesc
††volume: 39††issue: 3
1 Introduction
Treemaps are one of the best-known methods for visualizing large hierarchical datasets. Given an input tree whose leaves have several attributes, treemaps recursively partition a 2D spatial region into cells whose visual attributes (area, color, shading, or annotation) encode the tree’s data attributes. Compared to other methods such as node-link techniques, treemaps effectively use all available screen pixels to show data, and thus can display trees of tens of thousands of nodes on a single screen. Most treemaps use rectangles, although there are alternative models such as Voronoi treemaps [BDL05], orthoconvex and L-shaped treemaps [dBSvdW14], and Jigsaw treemaps [Wat05]. In this paper, we focus exclusively on rectangular treemaps.
The input for a rectangular treemap is a rectangle and a set of non-negative values together with a hierarchy on these values (represented by a tree). The output is a treemap , which is a recursive partition of into a set of interior-disjoint rectangles, where each rectangle has area , and the regions of the children of an interior node of the hierarchy form a rectangle (associated with their parent). Such a partition of a rectangle into a set of disjoint rectangles is also called a rectangular layout, or layout for short. Typically the input values are normalized, that is, the sum corresponds to the area of .
Nowadays, large hierarchical datasets are also available over time. Hence, there is a need for time-dependent treemaps which display changing trees and data values. Ideally, such time-dependent treemaps enable the user to easily follow structural changes in the tree and in the data. In a time-dependent setting, the input values become functions for each , where the discrete domain represents the different time steps in the data. We assume that the hierarchy on the values and are not time-dependent, and that the values are properly normalized for each time step separately. We use the special value to represent that data element is not present at time ; and we speak of insertions or deletions if starts or stops to be nonzero, respectively.
The visual quality of rectangular treemaps is usually measured via the aspect ratio of its rectangles. This indicator can become arbitrarily bad: Consider a treemap that consists of only two rectangles. If the area of one of these rectangles tends towards zero, then its aspect ratio tends towards infinity. Nagamochi and Abe [NA07] describe an algorithm (APP) which computes, for a given set of values and a hierarchy, a treemap which provably approximates the optimal aspect ratio. De Berg et al. [dBSvdW14] prove that minimizing the aspect ratio for rectangular treemaps is strongly NP-complete. Kong et al. [KHA10] propose perceptional guidelines to improve treemap design and Zhou et al. [ZCYT17] perform user studies to test the effectiveness of different rectangular treemapping algorithms. Recently Lu et al. [LFH*∗*17] argue that the optimal aspect ratio for treemaps should, in fact, be the golden ratio. In Section 2, we describe the state-of-the-art of rectangular treemaps in detail along with the various characteristics of rectangular treemaps.
For time-dependent treemaps, a second quality criterion is stability. Ideally, small changes in the data should result only in small changes in the treemap. Such stable behavior ensures that the only changes the user sees are due to the data, and not due to the decisions the algorithm makes. In recent years a few non-rectangular treemaps were specifically developed for time-dependent data. Hahn et al. [HTMD14] and Hees and Hage [vHH17] describe stable versions of Voronoi treemaps. Chen et al. [CDY17] propose a small-multiple metaphor to visualize time-dependent hierarchies. Their algorithm computes a global layout for all time steps simultaneously, but does not handle insertions or deletions. Scheibel et al. [SWD18] give an algorithm that maps changes in the data onto an initial layout. However, “treemaps” of subsequent time steps are not proper rectangular layouts as white space is introduced when resolving overlaps between rectangles. Finally, Lukasczyk et al. [LWM*∗*17] and Köpp and Weinkauf [KW19] show how to compute static overviews of the whole evolution of time-dependent hierarchical data sets, while Guerra-Gómez et al. [GGPPS13] and Card et al. [CSP*∗*06] present interactive visualization tools.
Contribution. Despite their enduring popularity, a comprehensive evaluation of treemaps is currently lacking, even more so for the time-dependent case. Individual papers tend to report on only a few algorithms and evaluate only a few datasets, often without a principled discussion of quality metrics. To provide insights to both researchers and practitioners and to allow them to make an informed choice when selecting a treemap for their specific application and data, we perform an extensive quantitative evaluation of rectangular treemaps for time-dependent data. Our three main contributions are:
(1) We introduce a new method to measure the stability of time-dependent treemaps which explicitly considers the input data (Section 3.2). An algorithm is stable if small changes in the input data result in small changes in the layout, that is, data change and layout change correlate positively. Previously proposed stability metrics measure only the layout change and conclude that small layout changes are a sign of a stable algorithm. However, to properly measure stability, we also need to capture the data change and then correlate data and layout change. Here, we have to overcome the difficulty that the data and the layout space are a priori incomparable. We solve this problem by introducing the concept of a baseline treemap which represents the minimum amount of change that any time-dependent treemap must incur (given the input data) when moving from treemap to the next treemap .
(2) We propose a novel classification scheme for time-dependent datasets. Specifically, based on our discussion of the state-of-the-art of treemaps in Section 2, we observe that the performance of treemaps depends on the characteristics of the datasets used. We identify four potential representative features that characterize time-dependent hierarchical datasets and classify all datasets used in our experiments accordingly. We experimentally test the validity of this classification on 2405 datasets, and analyze the relative performance of 14 state-of-the-art rectangular treemapping algorithms across varying features. Generally we conclude that our proposed features do indeed have predictive value, both with respect to visual quality and stability. We also observe that algorithms that are designed to be stable tend to in fact be more stable across features.
(3) We perform a quantitative evaluation of 14 rectangular treemapping algorithms on more than 2000 datasets. We visually summarize our results with respect to both visual quality and stability to aid users in making an informed choice among treemaps. All datasets, metrics, and algorithms are openly available [URLb]. Section 5 reports on our experimental results, we conclude in Section 6.
2 Rectangular Treemaps
We next discuss the most well-known rectangular treemapping algorithms. For a fair comparison during our experiments, we require that treemap rectangles have exactly the correct areas and partition the input rectangle. Algorithms that do not satisfy these requirements are not included in our evaluation. Recall that the input for a rectangular treemap is a rectangle and a set of non-negative values together with a hierarchy on these values (represented by a tree). The children of a node in this hierarchy are given in a particular order in the input. We distinguish two classes of treemaps, which either do or do not use this order. For time-dependent data we also distinguish between state-aware and stateless treemaps. Contrary to stateless treemaps, state-aware treemaps do not compute the treemap separately at a time step, but (can) use the layout of the previous time step to compute a new layout. Most treemaps are stateless; we discuss the state-aware algorithms separately.
Unordered treemaps do not (need to) adhere to the input nodes’ order when computing the layout. Typically, input weights are sorted to help the algorithm achieve good visual quality. Unordered treemaps in our evaluation include Squarified treemaps (SQR) [BHvW00] and Approximation treemaps (APP) [NA07]. APP comes with a guaranteed upper bound on the worst-case aspect ratio, while SQR often achieves near-optimal aspect ratios in practice. The visual quality of unordered treemaps is relatively unaffected by high weight variance, as reordering weights allows the layout to group similar-size rectangles in the treemap, typically leading to better aspect ratios. Yet, the sorted order of the weights may change rapidly over time, especially if the weights change much over time or if the weight variance is low. This can negatively affect the stability of the treemaps.
Ordered treemaps are required to adhere to the order of nodes as given in the input, which roughly ensures that rectangles close to each other in the input are close to each other in the resulting treemap. This typically improves the stability of treemaps, but may worsen visual quality. We include nine ordered treemaps in our evaluation. The first ordered treemaps [SW01] include the Pivot-by-Middle (PBM), Pivot-by-Size (PBZ), and Pivot-By-Split (PBS) algorithms. Similar algorithms are the Strip algorithm (STR) [BSW02] and the Split algorithm (SPL) [Eng05]. Other algorithms, like the Spiral algorithm (SPI) [TS07], and the Hilbert (HIL) and Moore (MOO) algorithms [TC13], lay out rectangles following a space-filling curve. Finally, the very first treemap algorithm (Slice-and-Dice (SND)) by Shneiderman [Shn92] can also be considered an ordered treemap. While not designed to be ordered, the resulting (combinatorial) layout depends only on the hierarchy and not on weights. In fact, SND uniquely uses the depth in the hierarchy to compute the layout (slicing vertically on even depth and horizontally on odd depth), rather than simply applying the same algorithm recursively. Hence, SND’s visual quality strongly depends on the number of levels in the input hierarchy. Typically, laying out large rectangles near small rectangles leads to poor aspect ratios. Hence, the visual quality of ordered treemaps is negatively affected by high weight variance. However, ordered treemaps are relatively stable over time compared to ordered treemaps, as order is maintained. Finally, insertions and deletions may affect the visual quality and stability of ordered treemaps to varying degrees, depending on how they are handled exactly.
State-aware treemaps can use the layout of the previous time step to compute a new layout and so can largely control their stability. The treemap for the first time step is typically an existing unordered treemap. The first state-aware treemap was introduced by Sondag et al. [SSV18]. Their Local Moves algorithm (LM) is initialized with APP, and allows only a small number of local modifications to the (combinatorial) layout between time steps. They also show how to update areas between time steps without significantly changing the layout (layouts remain “order-equivalent”). In our evaluation we include the Local Moves algorithm with 4 local moves between time steps (LM4), and without local moves (LM0). A similar algorithm is the Git algorithm (GIT) by Vernier et al. [VCT18b], which is initialized with SQR, and does not allow any changes to the (combinatorial) layout between time steps. Both state-aware treemaps also support insertions and deletions, updating the layouts locally where necessary (for insertions, the position in the layout can be chosen to maximize visual quality). By design, the stability of state-aware treemaps is relatively unaffected by frequent weight changes over time. Also, the visual quality of the initial treemaps should be relatively high. However, since the layouts cannot change much over time, the visual quality of state-aware treemaps will decrease over time if weights change significantly. Frequent insertions and deletions may also cause treemaps with poor visual quality, as treemaps are not recomputed as a whole. However, many insertions can help to correct rectangles with bad aspect ratio caused by weight changes over time. This is especially helpful for state-aware algorithms that do not allow any changes to the layout, like LM0 and GIT. Note that SND has a fixed layout if the input hierarchy does not change and is hence very stable, but it does not explicitly use the previous state.
Finally, note that the number of levels in the input hierarchy can have a strong effect on all classes of algorithms. In general, more levels imply less freedom in the layout strategy. As a result, unordered treemaps become more similar to ordered treemaps. Overall, the visual quality tends to decrease and the stability tends to increase.
3 Metrics
There are two important criteria to evaluate treemaps: visual quality and stability. We discuss well-established metrics for both below. We also introduce a new method to measure the stability of time-dependent treemaps which captures inherent data changes. We measure metrics for each leaf rectangle separately and then aggregate these values for each algorithm and dataset [URLb]. Note that we do not compute metrics for non-leaf nodes.
3.1 Visual quality
The weight information in a treemap is conveyed by the areas of its rectangles. Since areas of rectangles closer to squares are visually easier to estimate than areas of elongated rectangles, visual treemap quality is commonly measured by the aspect ratio of its rectangles. Although it has been proposed that the ratio should be close to the golden ratio [LFH*∗*17] we aim for the uniformly accepted goal of making rectangles as square as possible. For a rectangle of width and height , we define the aspect ratio as
[TABLE]
Observe that this definition is the inverse of the usual definition for aspect ratio. Its values range from 0 to 1, where values of close to [math] are considered “bad” and values close to are considered “good”. The bounded range allows for easy aggregation. Note that, compared to the usual definition of , rectangles with larger aspect ratios have a smaller influence on the aggregated score.
3.2 Stability
Evaluating the stability of a treemap is more involved than evaluating visual quality. Consider treemaps at two consecutive time steps and . Since stability does not explicitly depend on the value of , we denote the former and the new treemap by and respectively, to simplify notation. We also denote the rectangle areas in and by and , respectively. For a stable treemapping algorithm, the (visual) difference between and should roughly correspond to the difference between and . Note that the combination of large changes in data values and small changes in the layouts is unlikely since rectangle areas in treemaps must exactly match the data values. Hence, we want actually to measure instability, that is, large layout changes that are not caused by large data changes.
Most existing treemap stability metrics consider only the visual change in the treemap’s layout , usually computed by evaluating the change for each rectangle separately and aggregating it over all rectangles. Shneiderman and Wattenberg [SW01] define as the Euclidean distance between the vectors and , where , , , and are the coordinates of the top-left corner, width, and height of a rectangle, respectively. They then define as the average over all rectangles. Hahn et al. [HTMD14, Hah15] simplify this metric by defining as the distance moved by the centroid of a rectangle, again defining as the average. Tak and Cockburn [TC13] use the same as [SW01], but define as the variance over all values computed by . They also propose a drift metric, which measures how much a rectangle moves away from its average position over a long period. Recently, Scheibel et al. [SWD18] introduced two new layout change metrics: The average aspect ratio change defines as the relative change between the aspect ratios of and , and defines as the average. The relative parent change defines as the relative change of the distance between the center of a rectangle and the center of its parent, again defining as the average. Chen et al. [CDY17] propose a metric to quantify the ability of users to track time-dependent data in treemaps, which is closely related to the drift metric [TC13]. A different approach measures layout change using pairs of rectangles. Hahn et al. [HBD17] introduce the relative direction change, which, for every pair of rectangles and , measures how much the angle from the center of to the center of changes. Finally, Sondag et al. [SSV18] proposed the relative position change, which, for every rectangle pair , measures how much the relative position of with respect to changes. The distance is then defined as the average over all rectangle pairs.
Summarizing the above, we distinguish two types of layout change metrics: (1) absolute metrics measure how much individual rectangles move/change, and (2) relative metrics measure how much positions of pairs of rectangles change relative to each other. For our experiments, we use both an absolute and a relative metric. In particular, as an absolute metric, we use the corner-travel distance, which is a well-known metric used in computer vision to quantify change between two shapes using feature points [TM07, Sze10]. In the vision community, it was established already many years ago [S*∗*94, Bie87] that corners are a perceptually useful feature to identify and track. Besides this perceptual validation, the corner-travel metric lies also within a small bounded factor of the original metric introduced by Shneiderman and Wattenberg [SW01]. Specifically, let and be the width and height of an input rectangle , respectively. Let , , , and (, , , and ) be the positions of the corners of a rectangle (). We define the corner-travel (CT) distance for a rectangle as
[TABLE]
where denotes the norm. Simply put, is the corner-to-corner correspondence distance between and . Note that , since a rectangle corner can travel by at most the length of the diagonal of .
As a relative metric, we use the relative position change [SSV18]. We established experimentally that the corner-travel and the relative position change metric correlate clearly on more than 2000 data sets. Hence in Section 5 we report only on experiments using the corner-travel distance. All other data can be found here [URLb].
Data change. The stability metrics discussed above do not take data change into account. If data changes by a large amount, then the layouts should be allowed to change significantly without considering this to be instability. To add data change to a stability metric, one can consider the difference or ratio between the layout change and the data change [VCT18a, VCT18b]. However, there are two problems: (1) we need a way to measure data change, and (2) the metric spaces for data and layouts need to be comparable. For example, data change can be measured in terms of changes of rectangle areas (since these correspond to the data). However, layout changes such as the corner-travel distance measure lengths, not areas. Areas and lengths are not directly comparable, and thus their ratios or differences may not be meaningful. Although such metrics could be made comparable by suitable normalization, such adaptations are necessarily metric-specific.
Baseline treemap. We overcome the above issues with a new method that captures data change in the layout space. For this, we define a baseline treemap with respect to and . The layout of (that is, the combinatorial structure of the rectangular subdivision which constitutes ) is based on the layout of . However, the areas of the rectangles in are the areas of . The idea is that minimizes the layout distance to among all treemaps with the areas of . Put differently: represents the minimum amount of change that any time-dependent treemap must incur when moving from and its associated area values to the next treemap and its area values . As a result, is a good metric for data change in the layout space.
We construct for each tested algorithm and each time step using a hill-climbing algorithm, which was proven to converge in [EMSV12]. For a rectangular layout (treemap) , a maximal segment is a maximal contiguous horizontal or vertical line segment contained in the union of the borders of all rectangles in (for example, the green segments in Figure 1). Put simply, a horizontal maximal segment (which is not part of the input rectangle ) always has endpoints on the interior of two vertical segments and vice versa. For two horizontal maximal segments and , we say that if there is a rectangle in whose bottom side coincides with and whose top side coincides with . This defines a partial order on horizontal maximal segments. We define a partial order on vertical maximal segments analogously (Figure 1). We say that is order-equivalent to if the corresponding partial orders on maximal segments are isomorphic. For every possible set of areas, there exists an order-equivalent treemap to that correctly represents those areas [EMSV12]. In particular, we can initially define as the treemap order-equivalent to (computed with any of the tested algorithms) with the areas of .
If rectangles are inserted or deleted, the baseline treemap cannot be order-equivalent to , so we handle insertions and deletions separately. Dealing with deletions is easy: we simply let the areas go to zero. For insertions, we must be more careful. Indeed, while we consider only rectangles present in both and when measuring stability ( and in Equation 2), inserted rectangles can strongly impact the positions of rectangles in . We observe that the baseline treemap does not need to be a proper treemap: it only needs to capture how much rectangles must minimally move to update to the new data. To minimize the movement of the rectangles due to insertions (and hence be as stable as possible), we distribute the cumulative area of the inserted rectangles over the “walls” (borders) of treemap as evenly as possible. To do so, we replace every maximal segment in by a rectangle, and assign each such rectangle a portion of the inserted area corresponding to the length of (Figure 2). Hence all walls become equally thick and the original rectangles of need to move as little as possible to yield .
The baseline treemap does not necessarily minimize the movement of every rectangle. However, the layout change between and is still a very good estimate for the minimum necessary layout change between and , and thus a good measure for data change (see Figure 3: nearly all points lie on or below the diagonal).
Stability metric. We can now define a stability metric that takes data change into account. Consider a rectangle and the corresponding rectangles and in and , respectively, and let be the layout change function for single rectangles. Two natural choices for spatial stability are the difference or ratio between and . Our experiments showed that the difference is typically more informative, that is, it exhibits clearer, more pronounced patterns, than the ratio. Hence, we define the stability of a single rectangle as
[TABLE]
Note that if , which is possible. Indeed, a value of [math] for represents “very stable”, and is considered to be (roughly) as stable as possible.
Limitations. The stability metrics we use focus only on consecutive time steps. The stability of time-varying treemaps could conceivably be influenced by effects that span multiple time steps, which our metrics do not capture directly. However, we believe that the most salient events influencing stability occur between consecutive time steps and hence we focus on this scenario.
4 Data
The visual quality and/or stability of treemaps clearly depends on the datasets used. Simply measuring the average performance over a (large) collection of datasets does not reveal such information. We aim to provide sufficient insight so that both practitioners and researchers can make informed choices about which algorithm to use for their data. For this, we study the performance of treemaps as a function of the characteristics of the input data. We classify the datasets into data classes along with explicit features and evaluate the metrics of different treemapping algorithms for each class.
4.1 Data features
Our methodology is inspired by the framework proposed by Smith-Miles et al. [SMBWL14] to objectively measure the performance of algorithms across datasets. For each dataset, we compute a number of features that (hopefully) capture the characteristics influencing the relative performance of treemapping algorithms. As a result, every dataset is represented by a point in a low-dimensional feature space . Similar feature-based approaches are also used to measure the relative performance of dimensionality-reduction methods [EMK*∗*19] or in machine learning [Bis06]. Based on the discussion of treemapping algorithms in Section 2, we identify the following four features: 1. Levels of hierarchy, 2. Variance of node weights, 3. Speed of weight change, and 4. Insertions and deletions.
Obviously, other features can be used to characterize (time-dependent) trees, such as the total node count; the min, max, and average node degrees; and the (im)balance of the tree structure [BO73, KY15]. However, for a practical and meaningful analysis, we must keep the number of features low. We hence limit the analysis to the four features discussed above.
4.2 Data classes
Using the feature space , we partition all datasets into classes. For each feature we define a small number of subclasses based on only that feature. The data class of a dataset is then defined as the combination of the subclasses for each feature. We determined the value-ranges defining the subclasses by analyzing the distribution of feature values over our 2405 real-world tree datasets.
Levels of hierarchy (3 subclasses). We use three ranges for classification: 1 level (1L), 2 or 3 levels (2/3L), and more than 3 levels (4+L). Most hierarchical datasets we have analyzed have 2 or 3 levels. This number of levels is quite common for datasets that are visualized via treemaps, since they frequently concern geo-spatial subdivisions such as countries, continents, and their subregions, grouped by a classification scheme, such as the World Bank regional classification. Furthermore, visually understanding the node nesting in deeper treemaps becomes difficult [VvWvdL06, BHvW00]. A special case are datasets with only 1 level, that is, sets of weight values. Such datasets are also often visualized by treemaps, as these are more space-filling than alternatives such as bar charts [VvWvdL06]. These datasets are challenging for treemaps that implicitly use the depth of the hierarchy. Finally, we consider datasets with more than 3 levels, which correspond to deep hierarchies such as, for example, file systems or software architectures [HTMD14, HBD17, VCT18a].
Variance of node weights (2 subclasses). We distinguish between low variance (LWV) and high variance (HWV). To ensure that the total number of tree nodes does not strongly influence our classification, we use the coefficient of variation to determine the subclass. The standard deviation and the mean are computed over all leaf weights over all time steps. We say that there is low variance if and high variance if , respectively.
Speed of weight change (3 subclasses). We distinguish between low weight change (LWC), regular weight change (RWC), and spiky weight change (SWC). The weight change of a single rectangle is measured by the absolute difference in the relative area (with respect to the input rectangle R) between time steps. The weight change of a treemap between two time steps is defined as the sum of weight changes of all rectangles. To determine the subclass of a dataset, we use the distribution of weight changes between time steps over all time steps in the dataset, specifically the mean and the standard deviation . Datasets with low weight change have and . Datasets with a larger mean () and a relatively small coefficient of variation () are classified as having regular weight change. The weights of these datasets steadily change over time, without any extreme changes. Remaining datasets are classified as having spiky weight change. In those datasets weights change drastically (), or there is large variation () along with substantial changes ( or ).
Insertions and deletions (3 subclasses). We distinguish between low insertions and deletions (LID), regular insertions and deletions (RID), and spiky insertions and deletions (SID). We measure the impact of insertions and deletions between two time steps and as the cardinality of the symmetric difference between the two sets of rectangles with non-zero weights at and , divided by the number of rectangles with non-zero weights at . We again classify the datasets based on the distribution ( and ) of impact values over all time steps. Same as for the speed of weight change, LID is defined by and , RID is defined by and , and the remaining datasets are in SID.
The full classification results in data classes. In Section 5, we evaluate how the performance of treemapping algorithms depends on the classes, that is, if the classification is sensible.
4.3 Datasets
We collected a total of 2405 time-dependent hierarchical datasets from a variety of sources, detailed below. We found at least one dataset for 46 (out of 54) instances of our proposed data classes. See Figure 4 for the distribution of datasets over classes: clearly not all classes arise with equal frequency in our data sources.
Worldbank [URL18c]:
World development indicators such as agriculture, rural and urban development, education, trade and health.
GitHub [URL18e]:
Hierarchies of folders, files, and classes, weighted by the number of code lines, extracted from all revisions of several popular GitHub repositories using Scitools [URLa].
Movies:
Movies from MovieLens[HK16] and TMDB [URL18d]. We constructed a time-dependent hierarchy using the group-rows-by-attribute-value partitioning method [Tel06, VvWvdL06]. The hierarchy groups movies based on their genres, actors, release date, and keywords. Each leaf is a movie, whose weight corresponds to its rating over a given period of time.
Custom:
We added several hand-picked datasets to cover classes that did not appear in the automatically mined datasets: Dutch Names [Mee16] contains the frequency of popular baby names in the Netherlands per year; UN Comtrade Coffee [URL17] contains the amount of coffee each country imported per year; ATP contains personal information, historical rankings, and match results from 1968 to 2018 for ATP tennis players [URL18a]; and Earthquakes contains the time, location, depth and intensity of seismic phenomena provided by the USGS Earthquake Hazards Program [URL18b].
Importantly, note that the above selection of dataset sources is orthogonal to the description of the feature space . The former covers the origin of data (which may cover application-specific aspects not captured by our feature space); the latter covers application-independent data aspects as captured by the data classes of .
5 Experimental Results
We ran all 14 algorithms on all time steps of all 2405 datasets, generated the baselines for all these instances (Section 3), and recorded all layouts, that is, the positions of all rectangles at all time steps . Per dataset we aggregate our results for all metrics and algorithms by first by taking the mean over all rectangles in a single time step, and then by taking the mean again over all time steps. This is necessary since the number of rectangles may differ per time step.
We focus on three specific questions: We explore the validity of our data classification (Section 5.1); we study the performance of all algorithms with respect to visual quality and stability across varying data features (Section 5.2); and we compare the performance of all algorithms on each data class separately (Section 5.3). We believe that the resulting visual summary will help researchers and practitioners choose a suitable treemapping algorithm for their data.
5.1 Data classification analysis
We evaluate if the relative performance of treemapping algorithms is more consistent within a data class than for an arbitrary collection of datasets. To perform this analysis we need to establish how we can capture the consistency of relative performance for a collection of datasets, and how we can compare this consistency between multiple collections. We restrict our analysis to data classes that contain at least 50 datasets, for otherwise the observed consistency is not sufficiently reliable. For each such data class, we randomly sample 50 datasets to use in this analysis. We also randomly sample 50 datasets among all 2405 datasets (all classes) as a baseline for comparison. Note that all collections must have the same number of datasets in the analysis to ensure that the comparisons are fair.
Now consider a single collection of datasets. To measure the consistency of relative performance among different datasets in this collection, we cannot directly use the computed metrics for visual quality and stability, as these values may differ greatly between datasets. Alternatively, we could rank the algorithms per dataset, but then algorithms with very similar performance may imply a greater variance in relative performance than is the case. Instead, we define the relative performance (separately for visual quality and stability) per dataset as follows. We compute both the best value (maximum for visual quality, minimum for stability) and the median value over all algorithms over this dataset. The relative performance score for each algorithm on this dataset is then computed by linearly interpolating between these two values, where the best algorithm receives score [math], and the median algorithm receives score . The relative performance score is capped at , to avoid outliers. The resulting scores are comparable between different datasets.
We next analyze the consistency of relative performance within collections of datasets in two different ways. First, we use a quantitative approach: for each algorithm we compute the variance of the relative performance scores over all datasets in a collection, and we sum up these variances over all algorithms. This results in a consistency score for a collection of datasets. Figure 5 displays the consistency scores (for visual quality and stability) of all data classes (with at least 50 datasets) compared to the consistency scores of the baseline collection (created by random sampling). A cell is colored blue (more consistent) if is smaller than ; a cell is colored red (less consistent) if is larger than .
Nearly all data classes for visual quality and most data classes for stability are more consistent than the baseline. This indicates that our features are splitting the datasets into valid data classes where the relative performance of an algorithm is easier to predict than in the baseline. However, the stability column for high weight variance and low weight change is less consistent than the baseline. As discussed in Section 2, the stability of unordered treemaps becomes worse compared to ordered treemaps when the weight variance is low or the weight change is high, due to reordering of the input weights. As a result, the difference with respect to stability between ordered and unordered treemaps is less pronounced for these data classes; the relative performance is hence influenced more by accidental details of individual datasets and less by structural differences between the algorithms. Additionally there are two data classes where the visual quality is less consistent than the baseline. It is not clear to us at this point what the cause of these inconsistencies is; one possibility are hidden correlations in the data classes.
Second, we use a more qualitative approach to assess the consistency of relative performance. For each data class we create a matrix plot, that shows the relative performance scores of all algorithms for all datasets in the collection (see Figures 6 and 7). Each column in the matrix plot represents a dataset, and each row represents an algorithm. The color of every “cell” in the matrix plot indicates the relative performance score of an algorithm on a dataset, where lighter colors indicate better (lower) relative performance scores. Relative performance scores that were capped at are indicated with purple. To better enable the visual assessment of consistency among the different datasets in a collection, we order the datasets (columns) so that those with similar scores are next to each other as much as possible. Also, we order the algorithms (rows) so that the algorithms with better average score are lower in the matrix plot. In particular, the order of algorithms in the matrix plots for different data classes can be different. Figure 6 shows the matrix plots for visual quality, with the corresponding matrix plot for the baseline collection at the left-top. Figure 7 shows the matrix plots for stability.
First consider the matrix plot for visual quality (Figure 6). For the low weight variance subclass we indeed see that the matrix plots are much smoother than the baseline, which confirms the results in Figure 5. We also observe an increasing number of irregularities when going from 1 level treemaps to 2/3 levels or 4+ levels, since more levels impose more restrictions on the layout and hence all algorithms perform more similarly.
Consider now Figure 7. First of all, we notice that there is a set of four algorithms at the bottom of every matrix plot. These are the state-aware algorithms and SND. For nearly all datasets, regardless of the specific data class, these four algorithms are much more stable than any of the others. There is a large difference between the low weight variance and high weight variance subclasses. For low weight variance there is a set of algorithms that perform consistently much worse than the median (purple cells). These include the unordered treemaps which are particularly sensitive to changes in such data.
5.2 Performance analysis across features
The analysis in Section 5.1 shows that our data classification is valid. We now study how visual quality and stability depend on the features of the datasets. We aim to understand how sensitive a given algorithm is to variations in one or several features of its input data. For each data class we calculate the average visual quality and stability. For each subclass of a feature we then take the average over all data classes that belong to it. This ensures that even though we have different numbers of datasets in each data class, they are all weighted equally. We show this data in Figures 8,9,10,11. Each point in each figure represents the score for one algorithm on one subclass of the feature, for example, low weight variance. We draw a polyline that connects the points of one algorithm and use glyphs to indicate the different subclasses. The different algorithms are indicated with categorical colors (see figure legends).
Recall that a low value for the stability metric indicates a stable algorithm and that the visual quality metric (aspect ratio) is bounded between 0 and 1. In particular, note that visual quality () of 0.5 for a single rectangles indicates a 2-by-1 rectangle. A of 0.25 however is perceptually much worse than a of 0.5 in terms of area perception as can be inferred from Kong et al. [KHA10], coming close to their "extreme aspect ratios" of 4.5.
Levels of hierarchy. Figure 8 considers the levels of hierarchy feature, which has three values: 1L, 2/3L, and 4+L. From Figure 8, we see that all algorithms, in particular the stateless ones, are more stable as the number of levels increase. In contrast to most other algorithms, the visual quality of state-aware algorithms (LM0, LM4, GIT) as well as SND increases with the number of levels. We also see that SQR and PBS have the longest polylines, that is, they are the most sensitive to the number of levels.
Variance of node weights. Figure 9 considers the weight variance feature, which has two values: LWV and HWV. Increasing the weight variance decreases the visual quality for all algorithms, except for APP (and SND). Additionally we see that the unordered treemaps are indeed more sensitive to this feature in terms of stability compared to the other algorithms. These algorithms reorder the data based on the weight to determine their layout, and if the weight are close to each other this happen more often.
Speed of weight change. Figure 10 considers the speed of weight change feature, which has 3 values: LWC, RWC, and SWC. The near-vertical polylines for the stateless algorithms show that visual quality seems to be largely unaffected by this feature. The stability however decreases quickly. Conversely, for the state-aware algorithms the polylines are mostly near-horizontal: the stability is largely unaffected, but the visual quality decreases. As the only state-aware algorithm that allows changes to the layout, LM4 makes an explicit tradeoff between stability and visual quality, which is visible by the slightly sloping line.
Insertions and deletions. Finally, Fig. 11 considers the insertions and deletions feature, which has three values: LID, RID, and SID. The plot shows a similar variation of visual quality and stability as seen for the speed of weight change feature (Fig. 10). Yet, the polylines for the stateless algorithms now show a ‘kink’ at the midpoint (RID, regular insertions/deletions). This tells us that these algorithms are most unstable for regular insertions/deletions, and stabler for linear and spiky insertions/deletions. Interestingly, the state-aware methods (LM0, LM4, GIT) show a similar kink but oriented differently. These methods hence achieve poorest visual quality for regular insertions/deletions and highest quality on the other two values of this feature.
5.3 Comparison of data classes
We next compare the relative performance of all algorithms separately on all data classes. Figure 12 supports this comparison as follows: it is structured as a matrix of tables, one per data class. Each table shows the average visual quality (left column) and average stability (right column) of all algorithms for all datasets in the respective data class. The two columns are sorted separately to show the best-ranking algorithms at the top. Cells show the algorithm names and scores, and are categorically color-coded on the algorithm name, following the same color scheme as in Section 5.2. Empty cells indicate data classes for which we did not find datasets. Figure 12 can answer the following practical questions:
Which method is best for my data?
Given a family of datasets with known characteristics (feature values), we search for the corresponding cell and pick the top algorithm(s) in visual quality, stability, or a combination of both, depending on the application requirements. When doing this, we should examine the actual values, since several algorithms score quite close to each other.
How is a given algorithm performing in general?
We scan the table following the color of the respective algorithm, and detect its rank (with respect to visual quality and/or stability) over all data classes. In this way we can find patterns and outliers in the data for this algorithm: for example, LM0 and LM4 are always near the top in stability, and GIT’s performance on visual quality fluctuates widely depending on the data class.
Which algorithms perform similarly?
We locate groups of neighboring rows with the same color pattern in all tables. These indicate algorithms which score similarly regardless of data class.
In general there are a number of insights we can obtain from Figure 12. When we consider only the visual quality, we see that SQR is usually the best for low-weight variance data, but for high weight variance APP is just as often the best algorithm. If the dataset contains only 1 level, SQR performs better, but for the other depth subclasses it depends on the exact data class. If only the stability is important, SND almost always scores best regardless of the data class, but likewise it consistently scores the poorest on visual quality. The state-aware algorithms all perform very well on stability. While LM0 is better in terms of stability than LM4, their exact order as well as their relative order to GIT varies depending on the data class. When considering which algorithm is best for both stability and visual quality, there are no easy answers. There is no algorithm that performs best on both in any of the data classes and hence the answer depends on the desired trade-off and the data class in question.
6 Discussion and Conclusion
We performed an extensive quantitative evaluation of rectangular treemaps for time-dependent data. To do so, we introduced a new methodology based on baseline treemaps to measure the stability of time-dependent treemaps. Baseline treemaps enable us to measure the change in the input data in a manner that is mathematically comparable to the measures for the layout change of the corresponding treemaps. Furthermore, we proposed a novel classification scheme for time-dependent data sets via a four-dimensional feature space (weight variance, weight change, tree depth, and the pattern of insertions and deletions). These four features naturally arose from a discussion on various types of state-of-the-art treemapping algorithms. Our experimental analysis shows that our proposed classification is valid in general and that most data classes are well suited to predict the performance of treemapping algorithms. For most data classes, our visual summary comparing all algorithms across all data classes and both metrics can hence serve as a reliable resource for researchers and practitioners. Last but not least, all datasets, metrics, and algorithms used in our evaluation are openly available [URLb].
Limitations and future work. Our experiments show that our identified features and the resulting feature space generally work well and result in a meaningful classification of datasets. However, there are whole sets of data classes for which we could not find sufficiently many (or even any) datasets. This is partially inherent in the classification and somewhat natural: data sets with low weight variance hardly ever exhibit spiky weight change behavior, so that particular column in our table is essentially empty. But among the 18 classes of treemaps with 4 or more levels we found a significant number of datasets only for two classes, which both are essentially populated by datasets stemming from software repositories. The question remains if there are other significant types of time-dependent hierarchical datasets which have four or more levels and which escaped our searches. As it is, the results for these two particular classes are representative for only a restricted type of data.
Our classification works well for visual quality, with the exception of two cases (2/3 level, spiky insertions and deletions, high weight variance, and low or spiky weight change). We have a large number and variety of datasets at our disposal for these two classes, but nevertheless, it is unclear to us what causes these inconsistencies in the performance of the tested algorithms. There might be a hidden correlation in these datasets and one or more additional features might be needed to separate these classes further.
While we do have a significant number of datasets at our disposal and hence can validate our claims with some certainty, we still might be observing some bias in our collection. We would hence like to construct, and evaluate on, synthetic datasets. Doing so is not trivial; creating datasets that avoid sampling biases and are representative of real-world datasets (for a suitable definition of “real-world”) is a challenging (but important) question in its own right in information visualization in particular and in data science in general.
To complement our quantitative evaluation it would naturally be of interest to evaluate the performance of treemapping algorithms in various usage scenarios through user studies. The two metrics we use for visual quality and stability are both perceptually salient according to studies performed in previous work. However, a study that evaluates the combination of and the trade-offs between visual quality and stability could deliver important insights as to where on the Pareto-front an optimal treemapping algorithm should lie.
Finally, our evaluation currently does not measure the run-time and correspondingly the scalability of the algorithms used in our experiments. Our implementations are not (equally) optimized and hence a fair comparison is currently impossible. Scalability is clearly an important factor in online usage scenarios, and we hope to be able to complement our current set of implementations with optimized versions in the near future.
Acknowledgments. The Netherlands Organisation for Scientific Research (NWO) is supporting M. Sondag and B. Speckmann under project no. 639.023.208, and K. Verbeek under project no. 639.021.541. This study was also financed in part by CAPES (Finance Code 001) and CNPq (Process 308851/2015-3).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[BDL 05] Balzer M., Deussen O., Lewerentz C. : Voronoi treemaps for the visualization of software metrics. In Proc. ACM Symposium on Software Visualization (2005), pp. 165–172.
- 2[B Hv W 00] Bruls M., Huizing K., van Wijk J. J. : Squarified treemaps. In Proc. Data Visualization (2000), pp. 33–42.
- 3[Bie 87] Biederman I. : Recognition-by-components: A theory of human image understanding. Psychological Review 94 , 2 (1987), 115–147.
- 4[Bis 06] Bishop C. : Pattern Recognition and Machine Learning . Springer, 2006.
- 5[BO 73] Boorman S. A., Oliviera D. C. : Metrics on spaces of finite trees. In Journal of Mathematical Psychology (1973), vol. 10, pp. 26–59.
- 6[BSW 02] Bederson B. B., Shneiderman B., Wattenberg M. : Ordered and quantum treemaps: Making effective use of 2D space to display hierarchies. ACM Transactions on Graphics 21 , 4 (2002), 833–854.
- 7[CDY 17] Chen Y., Du X., Yuan X. : Ordered small multiple treemaps for visualizing time-varying hierarchical pesticide residue data. Visual Computer 33 , 6 (2017), 1073–1084.
- 8[CSP ∗ 06] Card S., Suh B., Pendleton B. A., Heer B., Bodnar J. W. : Time tree: Exploring time changing hierarchies. In Proc. Symposium on Visual Analytics Science and Technology (2006), pp. 3–10.
