# RSATree: Distribution-Aware Data Representation of Large-Scale Tabular   Datasets for Flexible Visual Query

**Authors:** Honghui Mei, Wei Chen, Yating Wei, Yuanzhe Hu, Shuyue Zhou, Bingru, Lin, Ying Zhao, Jiazhi Xia

arXiv: 1908.02005 · 2019-10-14

## TL;DR

RSATree introduces a flexible, distribution-aware data representation for large-scale tabular datasets, enabling arbitrary visual queries with efficient computation and interactive exploration capabilities.

## Contribution

The paper presents RSATree, a novel data structure combining R-tree, locality-sensitive hashing, and summed area tables to support flexible, distribution-aware visual queries on large datasets.

## Key findings

- Supports arbitrary visual queries with low latency
- Enables flexible binning strategies for data analysis
- Demonstrates efficiency on real-world datasets

## Abstract

Analysts commonly investigate the data distributions derived from statistical aggregations of data that are represented by charts, such as histograms and binned scatterplots, to visualize and analyze a large-scale dataset. Aggregate queries are implicitly executed through such a process. Datasets are constantly extremely large; thus, the response time should be accelerated by calculating predefined data cubes. However, the queries are limited to the predefined binning schema of preprocessed data cubes. Such limitation hinders analysts' flexible adjustment of visual specifications to investigate the implicit patterns in the data effectively. Particularly, RSATree enables arbitrary queries and flexible binning strategies by leveraging three schemes, namely, an R-tree-based space partitioning scheme to catch the data distribution, a locality-sensitive hashing technique to achieve locality-preserving random access to data items, and a summed area table scheme to support interactive query of aggregated values with a linear computational complexity. This study presents and implements a web-based visual query system that supports visual specification, query, and exploration of large-scale tabular data with user-adjustable granularities. We demonstrate the efficiency and utility of our approach by performing various experiments on real-world datasets and analyzing time and space complexity.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1908.02005/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/1908.02005/full.md

## References

72 references — full list in the complete paper: https://tomesphere.com/paper/1908.02005/full.md

---
Source: https://tomesphere.com/paper/1908.02005