# Sampling for computational efficiency when conducting analyses in big data

**Authors:** Jacqueline E Rudolph, Yiyi Zhou, Karine Yenokyan, Xiaoqiang Xu, Eryka Wentz, Keri L Calkins, Corinne E Joshu, Bryan Lau

PMC · DOI: 10.1093/aje/kwaf268 · American Journal of Epidemiology · 2025-12-05

## TL;DR

This paper explores how sampling methods can reduce computational costs in big data analyses while maintaining accurate results for lung cancer incidence by HIV status.

## Contribution

The study introduces and compares sampling methods for big data analyses, focusing on computational efficiency and accuracy in estimating health outcomes.

## Key findings

- Subcohort and case-cohort methods produced estimates close to the full sample and were more efficient than divide-and-recombine.
- Including nonsampled cases in case-cohort increased computation time and memory usage compared to subcohort.
- Estimates for incidence rate ratio, hazard ratio, and risk ratio were consistent across sampling methods.

## Abstract

A challenge to research in big data is the inherent computational intensity of analyses, particularly when using rigorous methods to address biases. We demonstrate the use of sampling methods in big data to estimate parameters using fewer resources. Our motivating question was whether lung cancer incidence differs by baseline HIV status, using a cohort of nearly 30 million Medicaid beneficiaries. We targeted three parameters (with listed estimator): incidence rate ratio (IRR, Poisson model), HR (Cox model), and risk ratio (RR, Kaplan–Meier). We controlled for confounders using inverse probability weighting. We ran analyses using the full sample and several sampling schemes: divide-and-recombine (10, 20, 50 samples), subcohort, and case-cohort. We compared point estimates, standard errors, computation time, and memory used. We observed 1113 incident lung cancer diagnoses among 180 980 beneficiaries with HIV and 33 106 diagnoses among 29 179 940 beneficiaries without HIV. Findings were similar across target parameters. The subcohort and case-cohort approaches had estimates closer to the full sample and were faster and less memory intensive than divide-and-recombine, especially when estimating the risk ratio. Including nonsampled cases in the case-cohort resulted in increases in computation time and memory relative to the subcohort approach.

## Linked entities

- **Diseases:** lung cancer (MONDO:0005138)

## Full-text entities

- **Diseases:** lung cancer (MESH:D008175)
- **Species:** Human immunodeficiency virus 1 (no rank) [taxon 11676]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12862598/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12862598/full.md

## References

18 references — full list in the complete paper: https://tomesphere.com/paper/PMC12862598/full.md

---
Source: https://tomesphere.com/paper/PMC12862598