Diabetic retinopathy (DR) is the leading cause of blindness among adults aged 20 to 74 years. As the prevalence of diabetes grows, the number of people with DR around the world is expected to reach 160 million by 2045. Current guidelines from the American Academy of Ophthalmology recommend that people with diabetes should be screened every year for referable DR. Screening involves a vision and retinal exam and the latter consists of ophthalmoscopy or fundus photography, with interpretation by a qualified reader either onsite or by telemedicine. 

There are a number of scales for DR severity. One in common use is the five-class International Clinical Diabetic Retinopathy (ICDR) severity scale ranging from zero for no apparent DR to four for proliferative DR. The screen also detects the presence or absence of clinically significant diabetic macular edema (DME). People with referable DR need a full ophthalmic exam and medical or surgical therapy to prevent blindness. 

To address the substantial DR screening burden, providers have been using artificial intelligence (AI) for over 20 years to supplement or even replace human graders. The earliest AI devices identified pathological features in fundus images, such as haemorrhages and exudates to determine whether DR was present. Recent advances in computing power mean that deep learning is now the main AI technique used in DR screening. Put simply, deep learning involves feeding the AI model a large amount of data – such as retinal images – and it learning to use that data to perform a task like classifying DR severity. 

Each retinal image has a label from an expert clinician that identifies the degree of retinal severity displayed by the image. The model has to try to match the image with the label and goes through a process of iterative feedback to achieve a high degree of accuracy in matching. Many of the deep learning models can outperform feature-based models.

So far, three AI algorithms have been cleared by the US Food and Drug Administration (FDA) for use in DR screening: IDx-DR, EyeArt and AEYE Diagnostic Screening. They are all fully autonomous algorithms that work without human supervision. Given these advances, Aaron Lee and colleagues from the Department of Ophthalmology at the University of Washington in Seattle, have reviewed the development and impact of AI in diabetic eye screening. 

Prospectively studied algorithms

There have been a number of regulatory approval studies for commercial DR screening algorithms, using open-source data sets, retrospective data sets or prospective data sets. In this review, the authors talk mainly about prospective data sets, as these are more closely aligned to real-world clinical implementation.

The FDA classes DR screening algorithms as class II devices, which means they must show equivalence to existing technology. In this case, the existing technology is the IDx-DR v 2.0, which was approved in 2018, and both EyeArt and AEYE-DS have been compared and received approval. In the EU, multiple such devices have a CE mark, including EyeArt, IDx-DR, Retmarker, Google and Singapore Eye Lesion Analyzer (SELENA). To get the CE mark, these devices need an external audit and certification process, followed by voluntary uploading onto the European Database on Medical Devices. The authors note that many devices that claim to be certified cannot be found on the database. 

IDx-DR was the first fully autonomous AI system across any field of medicine to receive FDA approval. The pivotal trial was a multicentre prospective study where the algorithm was compared with a reading centre’s standard for Early Treatment Diabetic Retinopathy Study (ETDRS) grading and DME detection, with the task of identifying fundus photos of mild-to-moderate DR. The algorithm showed a sensitivity of 87.2% and a specificity of 90.7%. 

EyeArt is a deep learning-based classification tool that has been extensively tested in research studies. In the prospective trial that was used for FDA clearance, it was tested in 893 patients in the US to detect mild-to-moderate DR compared with a reference standard. The trial found that EyeArt performed with a sensitivity of 95.5% and a specificity of 85%.

Meanwhile, SELENA was trained on a data set of referable DR from a Singaporean population and then validated retrospectively in an ethnically diverse cohort that included patients from China, Hong Kong, Singapore, Mexico, Australia and the US. It was also prospectively validated in Zambia and showed a sensitivity of 92.25% and a specificity of 89.04% for referable DR. SELENA is currently in clinical use in Singapore as part of the national diabetes screening programme. 

The review describes the performance of a number of other DR screening AI algorithms and notes that there are many with prospective evaluations showing > 85% sensitivity and specificity compared with a human grader reference standard. The number of photos required varies by algorithm, all are non-mydriatic (save for the AIDRScreening system) and the reference grading standard used was not uniform across algorithms, while the populations in which they were evaluated differed widely in their demographics. Finally, the algorithms have a lower threshold for deeming an image ungradable, so there are often higher rates of ungradable images from AI devices compared with those processed by human graders. This has the potential to increase clinical workflow time and the number of unnecessary referrals. 

Head-to-head validation

It is important to know how different AI models compare when investing in one for a diabetic eye screening programme. Although many studies of different algorithms have been carried out, as described above, comparison between them is challenging. This is because datasets for the individual studies differ, and model performance can vary dramatically depending upon the test data it is fed. 

Nevertheless, there have been a few head-to-head validation studies. The largest study of deep-learning algorithms was carried out by US researchers in 2021 in an attempt to validate the performance of multiple different commercially available models. They contacted 23 companies, of whom five agreed to be in the study. Each algorithm was anonymised and the results blinded for the researchers, although they were made available to the companies so that they could make adjustments to their software if needed. The data set consisted of fundus photographs from two Veterans Affairs (VA) hospitals, one in Seattle, one in Atlanta. The performance of the algorithms was compared with that of the VA graders and a group of independent graders was brought in to look at a subset of the images. 

The researchers found wide variability in performance between the different algorithms, with sensitivity ranging between 50.98 and 85.90% and specificity between 60.42 and 83.9%. Most of the algorithms were not superior to the VA graders when compared by the independent graders, save for two that reached higher sensitivities and one which was comparable in sensitivity and specificity. It is concerning to find that the majority cannot, as yet, outperform human graders. 

The researchers also noted differences between the Seattle and Atlanta cohorts – with worse performance in the former. The Atlanta cohort underwent pharmacological dilation before screening, while the Seattle cohort did not. Also, the Atlanta cohort was more racially and ethnically diverse than the Seattle cohort. This suggests that the algorithms are very sensitive to differences in data sets. It is vital that future head-to-head studies look at performance within the population among which the devices will actually be used. 

Cost-effectiveness

Whether AI algorithms are actually cost-effective in comparison with human graders remains unclear as research has shown conflicting findings. Much depends on factors like geography and deployment strategies. In the US, for instance, AI screening is estimated to be more cost-effective than human graders. This is because the high cost of human graders in the US and similar countries means AI devices are comparatively cheaper. Cost-effectiveness is likely to be lower in low- to middle-income countries where labour costs are lower. 

Deployment strategy makes a difference too. A study from Singapore showed that a semi-autonomous system in which human grading was applied to all DR-positive images was the most cost-effective approach. So, beyond choosing an algorithm, the way it is used in clinical practice should also be considered. 

AI does have the potential for reducing the costs of DR screening, but the technology still has to be integrated into billing, insurance and payment structures. Given the difference in healthcare systems around the world, this is currently a significant challenge that makes it hard to calculate the true cost effectiveness of AI in DR screening. 

Equity and bias

Developers of AI models have an ethical responsibility to ensure equity in outcomes in its application to different populations. This means avoiding bias, which can otherwise cause harm to people via underdiagnosis or false-positive results. The most effective way of avoiding bias is to evaluate the model on extensive, diverse cohorts. 

For instance, a study on the SELENA model evaluated its performance over a wide range of geographic locations, ethnicities and camera types. There were varying levels of performance between Malaysian, Indian, Chinese, African American, Mexican and Hong Kongese patients. However, another study on the EyeArt found its performance was not significantly affected by ethnicity, sex or camera type. Therefore, algorithms should be continuously monitored after deployment to check on patient outcomes and ensure equitable access to these novel technologies. Currently, there is no standard for continuous model evaluation of AI algorithms. 

Introducing AI-READI

The ability to train, validate and compare AI algorithms for equitable DR screening has been limited by a lack of high-quality, large and inclusive datasets. However, the National Institutes of Health is funding a new project called AI Ready and Equitable Atlas for Diabetes Insights (AI-READI), which should fill this gap with an open-source, equitable and powerful data set for AI training, validation and comparison. 

AI-READI will develop a cross-sectional data set of over 4000 people across the US who are balanced for sex, race/ethnicity (White, Black, Asian American, Hispanic) and four stages of diabetes severity (no diabetes, lifestyle controlled, oral-medication controlled and insulin dependent). The database will also contain much more information – namely, data on social determinants of health, continuous glucose monitoring, testing for endocrine, renal and cardiac biomarkers, environmental sensors and 24-hour wearable activity monitoring. It will also include retinal imaging data from colour fundus photography, optical coherence tomography and optical coherence tomography angiography. This comprehensive data set will be able to support numerous AI applications to enhance diabetes care and generally contribute to the advancement of AI in medicine. 

In conclusion

There are a number of AI devices available to make an impact in DR screening, several of which already show good performance on prospective data sets, but some knowledge gaps must be filled before the technology can deliver optimal care for people with diabetes. 

There is an urgent need for further head-to-head validation studies to inform clinicians about which AI devices to best deploy in their practice.

Studies have shown that they can be cost-effective. However, there are still complex billing arrangements and variability across healthcare systems to take into account before their true cost-effectiveness can be known. Finally, the performance of these algorithms varies across different data sets, so they still need to demonstrate equitable outcomes in the clinical setting. So, while AI devices may significantly reduce the burden of DR screening around the world in the future, the above knowledge gaps must be addressed to ensure their effective use. 

To read this paper, go to: Rajesh AE, Davidson OQ, Lee CS, Lee AY. Artificial intelligence and diabetic retinopathy: AI framework, prospective studies, head-to-head validation and cost-effectiveness. Diabetes Care 2023; 46: 1728–1739. http://doi.org/10.2337/dci23-0032

Any opinions expressed in this article are the responsibility of the EASD e-Learning Programme Director, Dr Eleanor D Kennedy.