Accelerating Large-scale Data Exploration through Data Diffusion
Ioan Raicu, Yong Zhao, Ian Foster, Alex Szalay

TL;DR
This paper introduces a data diffusion approach that dynamically acquires resources, replicates data, and schedules computations near data to improve large-scale data exploration efficiency and scalability.
Contribution
It presents a novel data diffusion method that adapts resource allocation and data placement in response to demand, enhancing performance without high hardware costs.
Findings
Improves performance over alternative approaches.
Scales linearly with the number of data cache nodes.
Effective in large-scale astronomy data analysis.
Abstract
Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. The approach is reminiscent of cooperative caching, web-caching, and peer-to-peer storage systems, but addresses different application demands. Other data-aware scheduling approaches assume dedicated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
