Reliability and Validity of Image-Based and Self-Reported Skin Phenotype Metrics
John J. Howard, Yevgeniy B. Sirotin, Jerry L. Tipton, and Arun R., Vemury

TL;DR
This study evaluates the reliability of image-based and self-reported skin-tone metrics, revealing that controlled, objective measures are essential for accurate biometric performance assessments across demographic groups.
Contribution
It demonstrates the unreliability of uncontrolled image-based and self-reported skin-tone measures and advocates for objective, controlled methods in biometric evaluations.
Findings
Image-based FALMs vary significantly across images of the same individual.
Fitzpatrick Skin Types poorly predict actual skin-tone.
Noisy FALM estimates lead to incorrect demographic differential analysis.
Abstract
With increasing adoption of face recognition systems, it is important to ensure adequate performance of these technologies across demographic groups. Recently, phenotypes such as skin-tone, have been proposed as superior alternatives to traditional race categories when exploring performance differentials. However, there is little consensus regarding how to appropriately measure skin-tone in evaluations of biometric performance or in AI more broadly. In this study, we explore the relationship between face-area-lightness-measures (FALMs) estimated from images and ground-truth skin readings collected using a device designed to measure human skin. FALMs estimated from different images of the same individual varied significantly relative to ground-truth FALM. This variation was only reduced by greater control of acquisition (camera, background, and environment). Next, we compare ground-truth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
