News
Article
Author(s):
Despite high accuracy scores, the study found AI models often produce overly complex explanations, making them less effective for patient education.
Artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT, is increasingly influencing medicine by rapidly providing complex information. These models are becoming popular sources for healthcare information, reflecting the public's interest in learning about diseases and treatment options.1 However, the accuracy and readability of the information provided by LLMs can vary significantly, particularly in specialized fields like dermatology and Mohs micrographic surgery (MMS).2-3
Studies have found patients often struggle to recall medical information communicated during consultations, making it crucial to provide educational resources that are both accurate and easily understandable.4 While studies have shown that LLMs can offer accurate information about MMS, the complexity of their responses tends to exceed patient comprehension levels.Research has also shown variability in the accuracy of different LLMs and compared their effectiveness to traditional sources like Google.5
A recent study aimed to evaluate the usefulness of LLMs as educational tools for MMS by analyzing feedback from a panel of 15 Mohs surgeons from various regions and comparing LLM responses with Google search results. Researchers hope this will gain a better understanding how these AI tools can serve patient education in different settings.6
Methods
In November 2023, the study evaluated the quality of responses to common patient questions about MMS generated by OpenAI's ChatGPT 3.5, Google's Bard (now Gemini), and Google Search. The questions, sourced from Google's search engine and faculty experience, were analyzed using a standardized survey to assess 3 key factors:
The survey was completed by 15 MMS surgeons from various regions, and responses were evaluated for readability using the Flesch Reading Ease Score (FRES) and Flesch-Kincaid grade.
Results
In evaluating patient-facing responses about MMS, researchers found about 92% of all responses were deemed appropriate for use outside of clinical settings. Both ChatGPT and Google Bard/Gemini received high approval ratings for appropriateness, whereas the study found responses from Google Search were less frequently approved. The mean approval ratings for appropriateness were very similar between ChatGPT (13.25 out of 15) and Google Bard/Gemini (13.33 out of 15), with no significant difference between them (P = 0.237).
Regarding accuracy, the study stated75% of responses were rated as "mostly accurate" or better. ChatGPT achieved the highest average accuracy score (3.97), followed by Google Bard/Gemini (3.82) and Google Search (3.59), with no significant differences in accuracy between the platforms (P = 0.956).
As for sufficiency, the study found only 33% of responses were approved as suitable for clinical practice, while 31% were rejected for being too verbose and 22% for lacking important details. Researchers statedGoogle Bard/Gemini had the highest sufficiency approval rating (8.7 out of 15), significantly better than ChatGPT and Google Search (P < 0.0001). The study found ChatGPT and Google Search responses were commonly rejected for needing to be more concise or specific.
The study stated interrater agreement varied significantly across all measures, with no category showing more than a fair degree of agreement (Fleiss' kappa > 0.40). The highest agreement was observed for insufficiency (Fleiss' kappa 0.121) and for responses from Google Search (Fleiss' kappa 0.145).
In terms of comprehensibility, the study found FRES ranged from 32.4 to 73.8, with an average score of 51.2, suggesting a required reading level around the 10th grade. Google Bard/Gemini had the best average FRES score (60.6), followed by Google Search (52.2) and ChatGPT (40.9).
Conclusion
In this study, only about 1/3 of LLM responses were deemed sufficient for clinical use by surgeons, a lower figure compared to previous studies. While LLM responses showed higher appropriateness compared to Google Search, with Google Bard/Gemini slightly outperforming others, the study found their comprehensibility often exceeded patient reading levels, indicating a need for simpler language. This complexity, along with some inaccuracies, suggests that while LLMs represent an improvement over traditional search engines, they still require refinement for clinical application. The study highlights the need for careful implementation of AI in healthcare, emphasizing the importance of validation, standardization, and collaboration with LLM developers to ensure reliable and patient-friendly outcomes.
References