Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy
Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi, Widner, Greg S. Corrado, Lily Peng, Dale R. Webster

TL;DR
This study investigates how grader variability and reference standards affect the development of deep learning models for diabetic retinopathy detection, highlighting the importance of adjudicated grades for optimal model performance.
Contribution
It demonstrates that using adjudicated reference standards significantly improves model accuracy, aligning AI performance with expert ophthalmologists.
Findings
Adjudicated grades enhance model performance.
Model accuracy comparable to ophthalmologists.
Reference standard variability impacts model training.
Abstract
Diabetic retinopathy (DR) and diabetic macular edema are common complications of diabetes which can lead to vision loss. The grading of DR is a fairly complex process that requires the detection of fine features such as microaneurysms, intraretinal hemorrhages, and intraretinal microvascular abnormalities. Because of this, there can be a fair amount of grader variability. There are different methods of obtaining the reference standard and resolving disagreements between graders, and while it is usually accepted that adjudication until full consensus will yield the best reference standard, the difference between various methods of resolving disagreements has not been examined extensively. In this study, we examine the variability in different methods of grading, definitions of reference standards, and their effects on building deep learning models for the detection of diabetic eye…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
