Acne
Actinic Keratosis
Aesthetics
Alopecia
Atopic Dermatitis
Buy-and-Bill
COVID-19
Case-Based Roundtable
Chronic Hand Eczema
Chronic Spontaneous Urticaria
Drug Watch
Eczema
General Dermatology
Hidradenitis Suppurativa
Melasma
NP and PA
Pediatric Dermatology
Pigmentary Disorders
Practice Management
Precision Medicine and Biologics
Prurigo Nodularis
Psoriasis
Psoriatic Arthritis
Rare Disease
Rosacea
Skin Cancer
Vitiligo
Wound Care

News

Article

August 14, 2024

Study Shows AI Offers Mixed Results in Mohs Education

Author(s):

Despite high accuracy scores, the study found AI models often produce overly complex explanations, making them less effective for patient education.

Image Credit: © steheap - stock.adobe.com

Artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT, is increasingly influencing medicine by rapidly providing complex information. These models are becoming popular sources for healthcare information, reflecting the public's interest in learning about diseases and treatment options.¹However, the accuracy and readability of the information provided by LLMs can vary significantly, particularly in specialized fields like dermatology and Mohs micrographic surgery (MMS).^2-3

Studies have found patients often struggle to recall medical information communicated during consultations, making it crucial to provide educational resources that are both accurate and easily understandable.⁴ While studies have shown that LLMs can offer accurate information about MMS, the complexity of their responses tends to exceed patient comprehension levels.Research has also shown variability in the accuracy of different LLMs and compared their effectiveness to traditional sources like Google.⁵

A recent study aimed to evaluate the usefulness of LLMs as educational tools for MMS by analyzing feedback from a panel of 15 Mohs surgeons from various regions and comparing LLM responses with Google search results. Researchers hope this will gain a better understanding how these AI tools can serve patient education in different settings.⁶

Methods

In November 2023, the study evaluated the quality of responses to common patient questions about MMS generated by OpenAI's ChatGPT 3.5, Google's Bard (now Gemini), and Google Search. The questions, sourced from Google's search engine and faculty experience, were analyzed using a standardized survey to assess 3 key factors:

Appropriateness of the platform as a patient-facing resource
Accuracy of the content, rated from 1 (completely inaccurate) to 5 (completely accurate)
Sufficiency of the response for clinical use, with options including whether responses were sufficient or needed more detail or conciseness

The survey was completed by 15 MMS surgeons from various regions, and responses were evaluated for readability using the Flesch Reading Ease Score (FRES) and Flesch-Kincaid grade.

Results

In evaluating patient-facing responses about MMS, researchers found about 92% of all responses were deemed appropriate for use outside of clinical settings. Both ChatGPT and Google Bard/Gemini received high approval ratings for appropriateness, whereas the study found responses from Google Search were less frequently approved. The mean approval ratings for appropriateness were very similar between ChatGPT (13.25 out of 15) and Google Bard/Gemini (13.33 out of 15), with no significant difference between them (P = 0.237).

Regarding accuracy, the study stated75% of responses were rated as "mostly accurate" or better. ChatGPT achieved the highest average accuracy score (3.97), followed by Google Bard/Gemini (3.82) and Google Search (3.59), with no significant differences in accuracy between the platforms (P = 0.956).

As for sufficiency, the study found only 33% of responses were approved as suitable for clinical practice, while 31% were rejected for being too verbose and 22% for lacking important details. Researchers statedGoogle Bard/Gemini had the highest sufficiency approval rating (8.7 out of 15), significantly better than ChatGPT and Google Search (P < 0.0001). The study found ChatGPT and Google Search responses were commonly rejected for needing to be more concise or specific.

The study stated interrater agreement varied significantly across all measures, with no category showing more than a fair degree of agreement (Fleiss' kappa > 0.40). The highest agreement was observed for insufficiency (Fleiss' kappa 0.121) and for responses from Google Search (Fleiss' kappa 0.145).

In terms of comprehensibility, the study found FRES ranged from 32.4 to 73.8, with an average score of 51.2, suggesting a required reading level around the 10th grade. Google Bard/Gemini had the best average FRES score (60.6), followed by Google Search (52.2) and ChatGPT (40.9).

Conclusion

In this study, only about 1/3 of LLM responses were deemed sufficient for clinical use by surgeons, a lower figure compared to previous studies. While LLM responses showed higher appropriateness compared to Google Search, with Google Bard/Gemini slightly outperforming others, the study found their comprehensibility often exceeded patient reading levels, indicating a need for simpler language. This complexity, along with some inaccuracies, suggests that while LLMs represent an improvement over traditional search engines, they still require refinement for clinical application. The study highlights the need for careful implementation of AI in healthcare, emphasizing the importance of validation, standardization, and collaboration with LLM developers to ensure reliable and patient-friendly outcomes.

References

Rutten LJ, Arora NK, Bakos AD, et al. Information needs and sources of information among cancer patients: a systematic review of research (1980-2003). Patient Educ Couns. 2005;57(3):250-261. doi:10.1016/j.pec.2004.06.006
Duffourc M, Gerke S. Generative AI in health care and liability risks for physicians and safety concerns for patients. JAMA. 2023;330(4):313-314. doi:10.1001/jama.2023.9630
Rengers TA, Thiels CA, Salehinejad H. Academic surgery in the era of large language models: A review. JAMA Surg. 2024;159(4):445-450. doi:10.1001/jamasurg.2023.6496
Hutson MM, Blaha JD. Patients' recall of preoperative instruction for informed consent for an operation. J Bone Joint Surg Am. 1991;73(2):160-162.
Breneman A, Gordon ER, Trager MH, et al. Evaluation of large language model responses to Mohs surgery preoperative questions. Arch Dermatol Res. 2024;316(6):227. Published 2024 May 24. doi:10.1007/s00403-024-02956-8
Lauck KC, Cho SW, DaCunha M, et al. The utility of artificial intelligence platforms for patient-generated questions in Mohs micrographic surgery: a multi-national, blinded expert panel evaluation. Int J Dermatol. 2024. https://doi.org/10.1111/ijd.17382