Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation
Saurabh Hinduja, Gurmeet Kaur, Maneesh Bilalpur, Jeffrey Cohn, Shaun Canavan

TL;DR
This paper demonstrates that subject-exclusive cross-validation introduces stochastic variance in facial AU detection evaluation and advocates for leave-one-dataset-out validation for more stable, domain-aware assessment.
Contribution
It quantifies the noise inherent in cross-validation and highlights the benefits of LODO evaluation for assessing model robustness across datasets.
Findings
Cross-validation introduces a noise floor of ±0.065 in F1 scores.
Model rankings can change with different fold assignments.
LODO evaluation reveals domain-level instability not seen in standard cross-validation.
Abstract
Subject-exclusive cross-validation is the standard evaluation protocol for facial Action Unit (AU) detection, yet reported improvements are often small. We show that cross-validation itself introduces measurable stochastic variance. On BP4D+, repeated 3-fold subject-exclusive splits produce an empirical noise floor of in average F1, with substantially larger variation for low-prevalence AUs. Operating-point metrics such as F1 fluctuate more than threshold-independent measures such as AUC, and model ranking can change under different fold assignments. We further evaluate cross-dataset robustness using a Leave-One-Dataset-Out (LODO) protocol across five AU datasets. LODO removes partition randomness and exposes domain-level instability that is not visible under single-dataset cross-validation. Together, these results suggest that gains often reported in cross-fold validation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
