# Surrogate-guided sampling designs for classification of rare outcomes   from electronic medical records data

**Authors:** W. Katherine Tan, Patrick J. Heagerty

arXiv: 1904.00412 · 2020-11-09

## TL;DR

This paper introduces a stratified sampling approach to efficiently select cases for labeling in EMR data, improving the training of classifiers for rare clinical outcomes while reducing resource costs.

## Contribution

It proposes a class of enrichment sampling designs stratified by auxiliary variables, enhancing model discrimination for rare outcomes in EMR data.

## Key findings

- Sampling designs increase model discrimination for rare outcomes
- Simulation shows improved prediction performance with stratified sampling
- Application to radiology report data demonstrates practical utility

## Abstract

Scalable and accurate identification of specific clinical outcomes has been enabled by machine-learning applied to electronic medical record (EMR) systems. The development of classification models requires the collection of a complete labeled data set, where true clinical outcomes are obtained by human expert manual review. For example, the development of natural language processing algorithms requires the abstraction of clinical text data to obtain outcome information necessary for training models. However, if the outcome is rare then simple random sampling results in very few cases and insufficient information to develop accurate classifiers. Since large scale detailed abstraction is often expensive, time-consuming, and not feasible, more efficient strategies are needed. Under such resource constrained settings, we propose a class of enrichment sampling designs, where selection for abstraction is stratified by auxiliary variables related to the true outcome of interest. Stratified sampling on highly specific variables results in targeted samples that are more enriched with cases, which we show translates to increased model discrimination and better statistical learning performance. We provide mathematical details, and simulation evidence that links sampling designs to their resulting prediction model performance. We discuss the impact of our proposed sampling on both model training and validation. Finally, we illustrate the proposed designs for outcome label collection and subsequent machine-learning, using radiology report text data from the Lumbar Imaging with Reporting of Epidemiology (LIRE) study.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.00412/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1904.00412/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/1904.00412/full.md

---
Source: https://tomesphere.com/paper/1904.00412