Scalable Gaussian Process Regression Via Median Posterior Inference for Estimating Multi-Pollutant Mixture Health Effects
Aaron Sonabend, Jiangshan Zhang, Edgar Castro, Joel Schwartz, Brent A. Coull, Junwei Lu

TL;DR
This paper introduces a scalable Bayesian Gaussian process regression method using median posterior inference to analyze large environmental health datasets, effectively estimating health effects of pollutant mixtures.
Contribution
It develops a divide-and-conquer distributed computing strategy with theoretical guarantees, enabling Gaussian process models to handle massive datasets efficiently.
Findings
Identified negative effects of traffic pollution on birthweight.
Detected positive associations between ozone, vegetation, and birthweight.
Demonstrated the method on ~650,000 birth records from Massachusetts.
Abstract
Humans are exposed to complex mixtures of environmental pollutants rather than single chemicals, necessitating methods to quantify the health effects of such mixtures. Research on environmental mixtures provides insights into realistic exposure scenarios, informing regulatory policies that better protect public health. However, statistical challenges, including complex correlations among pollutants and nonlinear multivariate exposure-response relationships, complicate such analyses. A popular Bayesian semi-parametric Gaussian process regression framework (Coull et al., 2015) addresses these challenges by modeling exposure-response functions with Gaussian processes and performing feature selection to manage high-dimensional exposures while accounting for confounders. Originally designed for small to moderate-sized cohort studies, this framework does not scale well to massive datasets. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
