Zarr-Based Chunk-Level Cumulative Sums in Reduced Dimensions
Hailiang Zhang, Dieu My T. Nguyen, Christine Smit, Mahabal Hegde

TL;DR
This paper presents a Zarr-based method that uses chunk-level cumulative sums to enable fast, cost-effective analysis of large multi-dimensional geospatial data in cloud environments, outperforming traditional brute-force approaches.
Contribution
It introduces a novel, general-purpose technique that adds a small supplementary dataset for rapid cumulative sum calculations, optimized for cloud-structured data formats like Zarr.
Findings
Achieves 10,000x faster data analysis performance
Reduces computational costs significantly in cloud environments
Requires only 5% additional storage for substantial speed gains
Abstract
Data analysis on massive multi-dimensional data, such as high-resolution large-region time averaging or area averaging for geospatial data, often involves calculations over a significant number of data points. While performing calculations in scalable and flexible distributed or cloud environments is a viable option, a full scan of large data volumes still serves as a computationally intensive bottleneck, leading to significant cost. This paper introduces a generic and comprehensive method to address these computational challenges. This method generates a small, size-tunable supplementary dataset that stores the cumulative sums along specific subset dimensions on top of the raw data. This minor addition unlocks rapid and cheap high-resolution large-region data analysis, making calculations over large numbers of data points feasible with small instances or even microservices in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · Matrix Theory and Algorithms · Distributed and Parallel Computing Systems
