News
Article
Author(s):
Researchers emphasized the need for improved reliability and regulatory frameworks for AI tools before they are integrated into dermatological practice.
Artificial Intelligence (AI) has been explored as a useful tool in various medical fields, including dermatology.1 Large language models (LLMs), such as ChatGPT and Claude 3 Opus, which have advanced capabilities in natural language processing and image analysis.2 With the recent introduction of image analysis capabilities, LLMs have attracted significant interest within the dermatological field.3 Despite this potential in clinical practice, researchers behind a recent comparison noted that AI tools remain underused due to legal and ethical concerns especially in high-risk areas like medical diagnosis. The comparison aimed to highlight their strengths and limitations, offering guidance for optimizing AI-assisted diagnostic tools in dermatology while addressing regulatory issues.4
Methods
Researchers randomly selected 100 dermoscopic images (50 malignant melanomas, 50 benign nevi) from the International Skin Imaging Collaboration (ISIC) archive using a computer-generated randomization process.5 Each of the images were presented to Claude 3 Opus and ChatGPT with the same prompt, instructing the models to provide the top 3 differential diagnoses for each image ranked by likelihood. The exact prompt provided was “Please provide the top 3 differential diagnoses for this dermoscopic image, ranked by likelihood. Focus on distinguishing between melanoma and benign nevi.” The models’ responses were recorded, and researchers assessed primary diagnosis accuracy, accuracy of the top 3 differential diagnoses, and malignancy discrimination ability. The McNemar test was used to compare the models’ performance.
Results
For the primary diagnosis, researchers found that Claude 3 Opos achieved 54.9% sensitivity (95% CI 44.08% to 65.37%), 57.14% specificity (95% CI 46.31% to 67.46%), and 56% accuracy (95% CI 46.22% to 65.42%), while ChatGPT demonstrated 56.86% sensitivity (95% CI 45.99% to 67.21%), 38.78% specificity (95% CI 28.77% to 49.59%), and 48% accuracy (95% CI 38.37% to 57.75%). They found the McNemar test showed no significant difference between the 2 models (P=.17). For the top 3 differential diagnoses, the comparison reported Claude 3 Opus and ChatGPT included the correct diagnosis in 76% (95% CI 66.33% to 83.77%) and 78% (95% CI 68.46% to 85.45%) of cases, respectively. The McNemar test showed no significant difference (P=.56). In malignancy discrimination, researchers stated that Claude 3 Opus outperformed ChatGPT with 47.06% sensitivity, 81.63% specificity, and 64% accuracy, compared to 45.1%, 42.86%, and 44%, respectively. The McNemar test showed a significant difference (P<.001). Claude 3 Opus had an odds ratio of 3.951 (95% CI 1.685 to 9.263) in discriminating malignancy, while ChatGPT had an odds ratio of 0.616 (95% CI 0.297 to 1.278).
Discussion and Future Research
The comparison evaluated the effectiveness of LLMs in dermatological diagnosis and identifies their limitations. Researchers found Claude 3 Opus outperformed ChatGPT in distinguishing between malignant and benign skin lesions, but both models also had issues: Claude 3 Opus incorrectly labeled some melanomas as benign, while ChatGPT frequently misclassified nevi as melanomas. These results highlight the necessity for further development and thorough clinical validation of AI diagnostic tools before they are widely adopted in dermatology. Researchers suggested that future efforts should focus on enhancing model reliability and interpretability through collaboration among AI researchers, dermatologists, and healthcare professionals. Additionally, they noted addressing legal and ethical concerns is crucial, as the European Commission is already proposing regulations for high-risk AI applications. Overall, while LLMs like Claude 3 Opus and ChatGPT show promise, they stated they cannot yet replace human expertise and require ongoing research and regulatory attention to ensure their safe and effective integration into clinical practice.
References