# Distributed sequential method for analyzing massive data

**Authors:** Zhanfeng Wang, Yuan-chin Ivan Chang

arXiv: 1812.09424 · 2018-12-27

## TL;DR

This paper introduces a distributed sequential estimation method for analyzing massive datasets, combining parallel processing, adaptive sampling, and variable selection to improve efficiency and accuracy.

## Contribution

It proposes a novel parallel divide-and-conquer approach with adaptive sampling and shrinkage estimation, maintaining statistical properties for large-scale data analysis.

## Key findings

- Method achieves efficient estimation on large datasets.
- Adaptive sampling accelerates the estimation process.
- Successfully applied to real-world energy and pollution data.

## Abstract

To analyse a very large data set containing lengthy variables, we adopt a sequential estimation idea and propose a parallel divide-and-conquer method. We conduct several conventional sequential estimation procedures separately, and properly integrate their results while maintaining the desired statistical properties. Additionally, using a criterion from the statistical experiment design, we adopt an adaptive sample selection, together with an adaptive shrinkage estimation method, to simultaneously accelerate the estimation procedure and identify the effective variables. We confirm the cogency of our methods through theoretical justifications and numerical results derived from synthesized data sets. We then apply the proposed method to three real data sets, including those pertaining to appliance energy use and particulate matter concentration.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1812.09424/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/1812.09424/full.md

---
Source: https://tomesphere.com/paper/1812.09424