# Customizing Computerized Adaptive Test Stopping Rules for Clinical Settings Using the Negative Affect Subdomain of the NIH Toolbox Emotion Battery: Simulation Study

**Authors:** Saki Amagai, Aaron J Kaat, Rina S Fox, Emily H Ho, Sarah Pila, Michael A Kallen, Benjamin D Schalet, Cindy J Nowinski, Richard C Gershon

PMC · DOI: 10.2196/60215 · 2025-03-21

## TL;DR

This study explores how changing stopping rules in computerized adaptive tests can reduce patient burden while maintaining reliability in measuring emotions like anger, fear, and sadness.

## Contribution

The study introduces alternative stopping rules for computerized adaptive tests in clinical settings to balance test burden and reliability.

## Key findings

- Alternative stopping rules slightly reduced test burden while increasing reliability for adult emotion assessments.
- Fixed-length tests with 8 or fewer items increased assessments with reliability below 0.85.
- Reduced maximum stopping rules best balanced precision and test length.

## Abstract

Patient-reported outcome measures are crucial for informed medical decisions and evaluating treatments. However, they can be burdensome for patients and sometimes lack the reliability clinicians need for clear clinical interpretations.

We aimed to assess the extent to which applying alternative stopping rules can increase reliability for clinical use while minimizing the burden of computerized adaptive tests (CATs).

CAT simulations were conducted on 3 adult item banks in the NIH Toolbox for Assessment of Neurological and Behavioral Function Emotion Battery; the item banks were in the Negative Affect subdomain (ie, Anger Affect, Fear Affect, and Sadness) and contained at least 8 items. In the originally applied NIH Toolbox CAT stopping rules, the CAT was stopped if the score SE reached <0.3 before 12 items were administered. We first contrasted this with a SE-change rule in a planned simulation analysis. We then contrasted the original rules with fixed-length CATs (4‐12 items), a reduction of the maximum number of items to 8, and other modifications in post hoc analyses. Burden was measured by the number of items administered per simulation, precision by the percentage of assessments yielding reliability cutoffs (0.85, 0.90, and 0.95), and accurate score recovery by the root mean squared error between the generating θ and the CAT-estimated “expected a posteriori”–based θ.

In general, relative to the original rules, the alternative stopping rules slightly decreased burden while also increasing the proportion of assessments achieving high reliability for the adult banks; however, the SE-change rule and fixed-length CATs with 8 or fewer items also notably increased assessments yielding reliability <0.85. Among the alternative rules explored, the reduced maximum stopping rule best balanced precision and parsimony, presenting another option beyond the original rules.

Our findings demonstrate the challenges in attempting to reduce test burden while also achieving score precision for clinical use. Stopping rules should be modified in accordance with the context of the study population and the purpose of the study.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC11951945/full.md

---
Source: https://tomesphere.com/paper/PMC11951945