Researchers seeking to evaluate artificial intelligence (AI) systems for accuracy in screening referable diabetic retinopathy (DR) found sensitivities ranging from 50.98% to 85.90%, according to a study published in Diabetes Care. Of the 7 total systems, 3 (42.86%) had comparable sensitivity to human graders, and 1 (14.29%) had comparable specificity.
Investigators compared human grader scoring with AI classification for 311,604 real-world fundus photographs of 23,724 veterans, including racially diverse populations. Images represented subjects with a diagnosis of diabetes mellitus who were referred for VA teleretinal DR screening at the Veterans Affairs (VA) Puget Sound Health Care System (HCS), in Washington, and the Atlanta VA HCS, in Georgia, between 2006 and 2018.
“In this independent, external, head-to-head automated DR screening algorithm validation study, we found that the screening performance of state-of-the-art algorithms varied considerably, with substantial differences in overall performance, even though all the tested algorithms are currently being used clinically around the world and one has FDA approval,” the research shows.
Ophthalmologists and optometrists at the VA originally graded fundus images manually using the International Clinical Diabetic Retinopathy Severity Scale (ICDR), from 0 indicating no DR, to 4 for proliferative diabetic retinopathy (PDR). Ungradable images were labeled as 5. Masked clinical experts including a board-certified ophthalmologist and 2 fellowship-trained retina specialists re-graded a subset of 7379 photographs from 735 encounters. These arbitrative scores were used as a reference standard for comparing VA teleretinal graders with automated screening. Each algorithm — masked, and labeled A to G — independently categorized the photographs.
For gradable images, VA clinicians reached 100% sensitivity to detect moderate or severe non-proliferative DR, and PDR. At moderate or severe disease levels, algorithms E, F, and G performed with similar sensitivity to human scorers (P =.500, P =.500, P =1) respectively, the investigators explained. Only algorithm G functioned with comparable sensitivity (P =.441), and specificity (P =.195) to the VA scorers when evaluated according to the arbitrated subset. Algorithm A performed significantly worse than human scorers in all severity levels of DR, and researchers estimate that it would miss 25.58% of advanced retinopathy.
Investigators found more gradable images and better AI performance in the Atlanta set. They speculate this was due to a standard practice of dilating all patients, thorough technician training, and background retinal and choroidal pigmentation. Close to 50% of Atlanta participants were African-American. These results show significantly less generalizability for some algorithms, the study says.
A limitation of the study was the primarily older male demographic, most with type 2 diabetes.
Researchers also estimate the dollar value per DR screening encounter for each algorithm based on the average salary of providers per minute. For the 3 best-performing algorithms, values ranged from $15.14 to $18.06 in AI systems with ophthalmologists as human graders, and $7.74 to $9.24 in systems with optometrist graders, the researchers wrote. They note that the estimated value-per-encounter was based on care provided at the VA.
Disclosure: Several study authors declared affiliations with the biotech or pharmaceutical industries. Please see the original reference for a full list of authors’ disclosures.
Reference
Lee AY, Yanagihara RT, Lee CS, et al. Multicenter, head-to-head, real-world validation study of seven automated artificial intelligence diabetic retinopathy screening systems. Diabetes Care, Published online January 5, 2021. doi:10.2337/dc20-1877