However, not everyone was impressed with Babylon's performance. In particular, a blog post by Enrico Coiera delivered a scathing critique of the claims and testing process by the company. Here I list some of the criticisms raised by the author:
- Rather then entering the questions directly, a human operator interpreted the question for the Babylon program potentially including extra information.
- There was the selective exclusion of certain scenarios deemed outside of Babylon's expertise.
- The comparison with the human doctors was skewed because one of the human doctors performed poorly and dragged the average human performance down.
- The training/testing protocol was not specified leaving open the possibility that the system was trained and tested on similar if not identical examples.
- Finally, the work from Babylon was not peer-reviewed and does not contain the necessary statistical analysis to draw firm conclusions.
Fast forward to today and we have witnessed breath-taking advances in Deep Learning from its infancy approximately 10 years ago to the astounding breakthroughs being made by large language models (LLMs) today. In particular, the chatbot ChatGPT developed by the company OpenAI has taken the world by storm, impressing many with its ability to converse fluidly, answer questions based on seemingly unlimited knowledge from sources like Wikipedia, and to generate high quality text and code according to a user-specified text prompt. Although some have likened LLMs to Xerox machines that memorize large volumes of text, ChatGPT and other LLMs have demonstrated the ability to reason at a basic level, interpolating or even extrapolating from an immensely large text dataset.
One open question is whether ChatGPT and its brethren can answer questions and reason in more specialized domains that distinguish professional disciplines such as medicine. A research group from the company AnsibleHealth wondered too, and soon after ChatGPT appeared, they evaluated the program on a well known set of medical exams necessary for doctors to practice medicine (PLOS Digital Health):
“To accomplish this, we evaluated the performance of ChatGPT, a language-based AI, on the United States Medical Licensing Exam (USMLE). The USMLE is a set of three standardized tests of expert-level knowledge, which are required for medical licensure in the United States. We found that ChatGPT performed at or near the passing threshold of 60% accuracy. Being the first to achieve this benchmark, this marks a notable milestone in AI maturation. Impressively, ChatGPT was able to achieve this result without specialized input from human trainers. Furthermore, ChatGPT displayed comprehensible reasoning and valid clinical insights, lending increased confidence to trust and explainability.”
The USMLE is a standardized testing program consisting of three steps that cover various topics in a physician's knowledge, including basic science, clinical reasoning, medical management, and bioethics. The questions are highly regulated and standardized, making it suitable for AI testing. Step 1 focuses on basic science, pharmacology, and pathophysiology, and is taken by medical students who have completed two years of didactic and problem-based learning. Step 2CK emphasizes clinical reasoning, medical management, and bioethics, and is taken by fourth-year medical students who have completed clinical rotations. Step 3 is taken by physicians who have completed at least 0.5 to 1 year of postgraduate medical education. The examination has shown stable scores and psychometric properties over the past ten years.
The test questions were obtained from the official USMLE website which released 376 publicly-available test questions from the June 2022 sample exam, known as USMLE-2022. Notably, the release date was after the training data cutoff for ChatGPT thus excluding the possibility that the program was trained on the test questions. After removing questions with images and graphs, 350 USMLE questions (Step 1: 119, Step 2CK: 102, Step 3: 122) comprised the final test dataset.
Unlike with the Babylon program, the questions were administered to ChatGPT more or less verbatim in three different formats:
- Multiple choice single answer without forced justification (MC-NJ) prompting: This type of prompting reproduces the original USMLE question verbatim.
- Multiple choice single answer with forced justification (MC-J) prompting: This type of prompting adds a variable lead-in imperative or interrogative phrase mandating ChatGPT to provide a rationale for each answer choice.
- Open-ended (OE) prompting: This type of prompting removes all answer choices and adds a variable lead-in interrogative phrase. This format simulates free input and a natural user query pattern.
The prompt refers to the text given to the chatbot as the input. The latter two prompt formats (2 and 3) were designed to provide insight into the reasoning behind the answers.
For some answers, ChatGPT either did not provide any answer (i.e. stated that not enough information was available) or its answer was not one of the choices. The researchers deemed such a response to be indeterminate as opposed to incorrect. With indeterminate responses excluded/included, ChatGPT's accuracy for USMLE Steps 1, 2CK, and 3 was 55.8%/36.1%, 59.1%/56.9%, and 61.3%/55.7%, respectively (Figure 1). The most rigorous evaluation is performance on all question including indeterminate responses. Over the three exams ChatGPT scored 36.1%, 56.9% and 55.7% correct counting indeterminate responses as incorrect, which averaged to an accuracy of ~50%; there were 3 to 11 answer options per question, and so this performance was significantly better than random guessing.
When prompted to give justification for multiple choice answer (MC-J), the average score was again about 50%. Finally, the performance on the open-ended response was the best with an average score on the 3 tests of ~54%. This shows that on some questions ChatGPT may have the right reasoning but not be able to choose the correct multiple choice answer.
The authors concluded that they "found that ChatGPT performed at or near the passing threshold of 60% accuracy." But this assessment is perhaps optimistic given that rigorous evaluation on all the questions resulted in a ~50% mark which is far from the 60% passing grade. But in favor of ChatGPT, it was able to achieve this result without any specialized training or fine-tuning on medical datasets (i.e. cramming for the exam). Furthermore, ChatGPT displayed comprehensible reasoning and valid clinical insights when asked to justify its answers.
Importantly, compared to the evaluation of the Babylon AI performance on a different medical standardized exam, the questions to ChatGPT were entered directly (Critique 1), there was no selective exclusion (Critque 2), direct overlap between training and test examples was unlikely because test questions were released after the model had been trained (Critique 4), and the work was published in a peer reviewed article by authors who are not part of the company that developed ChatGPT (Critique 5).
Of course the field of medical chatbots and LLMs is still in its infancy. It remains to be seen how the next-generation ChatGPT powered by the GPT-4 large language model (LLM) performs on USMLE. Of even greater interest is the accuracy of LLMs that receive special training on medical topics and even are fine-tuned to maximize performance on board exams like USMLE. That will be the subject of a future post.
Figure 1. "Accuracy of ChatGPT on USMLE. For USMLE Steps 1, 2CK, and 3, AI outputs were adjudicated to be accurate, inaccurate, or indeterminate. [...] Accuracy distribution for inputs encoded as multiple choice single answer without (MC-NJ) or with forced justification (MC-J)" (from Fig. 2 of Kung et al. PLOS Digital Health, 2023).
No comments:
Post a Comment