Acne
Actinic Keratosis
Aesthetics
Alopecia
Atopic Dermatitis
Buy-and-Bill
COVID-19
Case-Based Roundtable
Chronic Hand Eczema
Chronic Spontaneous Urticaria
Drug Watch
Eczema
General Dermatology
Hidradenitis Suppurativa
Melasma
NP and PA
Pediatric Dermatology
Pigmentary Disorders
Practice Management
Precision Medicine and Biologics
Prurigo Nodularis
Psoriasis
Psoriatic Arthritis
Rare Disease
Rosacea
Skin Cancer
Vitiligo
Wound Care

News

Article

August 9, 2024

Comparing AI Models for Dermatological Diagnoses

Author(s):

Researchers emphasized the need for improved reliability and regulatory frameworks for AI tools before they are integrated into dermatological practice.

Image Credit: © Toowongsa - stock.adobe.com

Artificial Intelligence (AI) has been explored as a useful tool in various medical fields, including dermatology.¹ Large language models (LLMs), such as ChatGPT and Claude 3 Opus, which have advanced capabilities in natural language processing and image analysis.² With the recent introduction of image analysis capabilities, LLMs have attracted significant interest within the dermatological field.³ Despite this potential in clinical practice, researchers behind a recent comparison noted that AI tools remain underused due to legal and ethical concerns especially in high-risk areas like medical diagnosis. The comparison aimed to highlight their strengths and limitations, offering guidance for optimizing AI-assisted diagnostic tools in dermatology while addressing regulatory issues.⁴

Methods

Researchers randomly selected 100 dermoscopic images (50 malignant melanomas, 50 benign nevi) from the International Skin Imaging Collaboration (ISIC) archive using a computer-generated randomization process.⁵ Each of the images were presented to Claude 3 Opus and ChatGPT with the same prompt, instructing the models to provide the top 3 differential diagnoses for each image ranked by likelihood. The exact prompt provided was “Please provide the top 3 differential diagnoses for this dermoscopic image, ranked by likelihood. Focus on distinguishing between melanoma and benign nevi.” The models’ responses were recorded, and researchers assessed primary diagnosis accuracy, accuracy of the top 3 differential diagnoses, and malignancy discrimination ability. The McNemar test was used to compare the models’ performance.

Results

For the primary diagnosis, researchers found that Claude 3 Opos achieved 54.9% sensitivity (95% CI 44.08% to 65.37%), 57.14% specificity (95% CI 46.31% to 67.46%), and 56% accuracy (95% CI 46.22% to 65.42%), while ChatGPT demonstrated 56.86% sensitivity (95% CI 45.99% to 67.21%), 38.78% specificity (95% CI 28.77% to 49.59%), and 48% accuracy (95% CI 38.37% to 57.75%). They found the McNemar test showed no significant difference between the 2 models (P=.17). For the top 3 differential diagnoses, the comparison reported Claude 3 Opus and ChatGPT included the correct diagnosis in 76% (95% CI 66.33% to 83.77%) and 78% (95% CI 68.46% to 85.45%) of cases, respectively. The McNemar test showed no significant difference (P=.56). In malignancy discrimination, researchers stated that Claude 3 Opus outperformed ChatGPT with 47.06% sensitivity, 81.63% specificity, and 64% accuracy, compared to 45.1%, 42.86%, and 44%, respectively. The McNemar test showed a significant difference (P<.001). Claude 3 Opus had an odds ratio of 3.951 (95% CI 1.685 to 9.263) in discriminating malignancy, while ChatGPT had an odds ratio of 0.616 (95% CI 0.297 to 1.278).

Discussion and Future Research

The comparison evaluated the effectiveness of LLMs in dermatological diagnosis and identifies their limitations. Researchers found Claude 3 Opus outperformed ChatGPT in distinguishing between malignant and benign skin lesions, but both models also had issues: Claude 3 Opus incorrectly labeled some melanomas as benign, while ChatGPT frequently misclassified nevi as melanomas. These results highlight the necessity for further development and thorough clinical validation of AI diagnostic tools before they are widely adopted in dermatology. Researchers suggested that future efforts should focus on enhancing model reliability and interpretability through collaboration among AI researchers, dermatologists, and healthcare professionals. Additionally, they noted addressing legal and ethical concerns is crucial, as the European Commission is already proposing regulations for high-risk AI applications. Overall, while LLMs like Claude 3 Opus and ChatGPT show promise, they stated they cannot yet replace human expertise and require ongoing research and regulatory attention to ensure their safe and effective integration into clinical practice.

References

Gomolin A, Netchiporouk E, Gniadecki R, Litvinov IV. Artificial intelligence applications in dermatology: Where do we stand?.Front Med (Lausanne).2020;7:100. 2020 Mar 31. doi:10.3389/fmed.2020.00100
Rundle CW, Szeto MD, Presley CL, et al. Analysis of ChatGPT generated differential diagnoses in response to physical exam findings for benign and malignant cutaneous neoplasms. J Am Acad Dermatol. 2024;90(3):615-616. doi:10.1016/j.jaad.2023.10.040
Hebebrand M. Can ai diagnose dermatological conditions? Dermatology Times. June 13, 2024. Accessed August 8, 2024. https://www.dermatologytimes.com/view/can-ai-diagnose-dermatological-conditions-.
Liu X, Duan C, Kim MK, et al. Claude 3 opus and ChatGPT with GPT-4 in dermoscopic image analysis for melanoma diagnosis: Comparative performance analysis. JMIR Med Inform. 2024;12:e59273. Published 2024 Aug 6. doi:10.2196/59273
International Skin Imaging Collaboration. ISIC. Accessed August 8, 2024. https://www.isic-archive.com/.

Like what you’re reading? Subscribe to Dermatology Times for weekly updates on therapies, innovations, and real-world practice tips.

Subscribe Now!