# Automatic Classifiers as Scientific Instruments: One Step Further Away   from Ground-Truth

**Authors:** Jacob Whitehill, Anand Ramakrishnan

arXiv: 1812.08255 · 2019-05-07

## TL;DR

This paper investigates how the accuracy of automatic detectors, trained to approximate existing measurement tools, affects the interpretation of correlations with other phenomena, highlighting potential misinterpretations and limitations in current affective computing research.

## Contribution

It provides a mathematical analysis of how detector accuracy impacts correlation estimates and explores the limitations of training multiple models for better coverage.

## Key findings

- Expected correlation between detector output and true construct is scaled by detector accuracy q.
- Probability of sign reversal in correlation estimates can be 20-30% with typical sample sizes and accuracies.
-  Training multiple neural networks offers limited improvement in coverage of the true construct space.

## Abstract

Automatic machine learning-based detectors of various psychological and social phenomena (e.g., emotion, stress, engagement) have great potential to advance basic science. However, when a detector $d$ is trained to approximate an existing measurement tool (e.g., a questionnaire, observation protocol), then care must be taken when interpreting measurements collected using $d$ since they are one step further removed from the underlying construct. We examine how the accuracy of $d$, as quantified by the correlation $q$ of $d$'s outputs with the ground-truth construct $U$, impacts the estimated correlation between $U$ (e.g., stress) and some other phenomenon $V$ (e.g., academic performance). In particular: (1) We show that if the true correlation between $U$ and $V$ is $r$, then the expected sample correlation, over all vectors $\mathcal{T}^n$ whose correlation with $U$ is $q$, is $qr$. (2) We derive a formula for the probability that the sample correlation (over $n$ subjects) using $d$ is positive given that the true correlation is negative (and vice-versa); this probability can be substantial (around $20-30\%$) for values of $n$ and $q$ that have been used in recent affective computing studies. %We also show that this probability decreases monotonically in $n$ and in $q$. (3) With the goal to reduce the variance of correlations estimated by an automatic detector, we show that training multiple neural networks $d^{(1)},\ldots,d^{(m)}$ using different training architectures and hyperparameters for the same detection task provides only limited ``coverage'' of $\mathcal{T}^n$.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1812.08255/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/1812.08255/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/1812.08255/full.md

---
Source: https://tomesphere.com/paper/1812.08255