# Statistical matching of non-Gaussian data

**Authors:** Daniel Ahfock, Saumyadipta Pyne, Geoffrey J. McLachlan

arXiv: 1903.12342 · 2019-04-01

## TL;DR

This paper introduces computational methods for statistically matching non-Gaussian data across datasets with structured missingness, addressing limitations of nearest-neighbour imputation through model-based approaches.

## Contribution

It develops feasible procedures for non-Gaussian data matching using data augmentation and identifiability constraints, improving over traditional nearest-neighbour methods.

## Key findings

- Model-based matching addresses weaknesses of nearest-neighbour imputation.
- Approach is validated on flow cytometry datasets.
- Provides a robust alternative for non-Gaussian data integration.

## Abstract

The statistical matching problem is a data integration problem with structured missing data. The general form involves the analysis of multiple datasets that only have a strict subset of variables jointly observed across all datasets. The simplest version involves two datasets, labelled A and B, with three variables of interest $X, Y$ and $Z$. Variables $X$ and $Y$ are observed in dataset A and variables $X$ and $Z$ are observed in dataset $B$. Statistical inference is complicated by the absence of joint $(Y, Z)$ observations. Parametric modelling can be challenging due to identifiability issues and the difficulty of parameter estimation. We develop computationally feasible procedures for the statistical matching of non-Gaussian data using suitable data augmentation schemes and identifiability constraints. Nearest-neighbour imputation is a common alternative technique due to its ease of use and generality. Nearest-neighbour matching is based on a conditional independence assumption that may be inappropriate for non-Gaussian data. The violation of the conditional independence assumption can lead to improper imputations. We compare model based approaches to nearest-neighbour imputation on a number of flow cytometry datasets and find that the model based approach can address some of the weaknesses of the nonparametric nearest-neighbour technique.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1903.12342/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/1903.12342/full.md

## References

27 references — full list in the complete paper: https://tomesphere.com/paper/1903.12342/full.md

---
Source: https://tomesphere.com/paper/1903.12342