Quantified Health: Examining the claim that Google’s Med-PaLM 2 scored 85% on the USMLE far surpassing previous results

In a previous blog post I described how a research group not affiliated with OpenAI (developers of ChatGPT) examined the performance of ChatGPT on the standardized medical licensing exam USMLE. Strikingly, ChatGPT was able to achieve ~50% accuracy with the questions being delivered verbatim (not translated in any way by a human intermediate) and with a strict separation by date of the training and testing datasets (training date cutoff was before test questions were publicly released). A passing score is 60%, and so ChatGPT came close but did not surpass the threshold.

Of note is that ChatGPT is not specialized or fine-tuned for medical expertise. It was trained on bulk text from the web and other sources which included large amounts of material on medical topics, but mainly consisted of non-medical content. The generalist nature of ChatGPT raises the question of how well a chatbot or large language model (LLM) would do on the exam if it had received more specialized training.

Google has invested time and resources into medical LLM research, and last year they built Med-PaLM, a version of a previous language model, PaLM, tuned for the medical domain. In their announcement, Google claimed that Med-PaLM was the first LLM to obtain a “passing score” on U.S. medical licensing-style (USMLE) questions. It not only answered multiple choice and open-ended questions accurately, but also provided rationale and evaluated its own responses.

Last month, Google updated its progress by reporting on the next iteration of Med-PaLM, Med-PaLM 2. They now assert that the LLM consistently performed at an “expert” doctor level on medical exam (USMLE) questions, scoring 85%, which is an 18% improvement from Med-PaLM's previous score and far surpasses similar AI models such as ChatGPT. Can we evaluate this astounding result? Not really because Google provided so few details in their blurb. However, we can go back and look more closely at the Med-PaLM results, and assume that Med-PaLM 2 was tested in a similar fashion.

In the original Med-PaLM paper, the authors first constructed MultiMedQA which is an open-source medical question-answering (QA) benchmark that combines six existing medical datasets including MedQA, MedMCQA, PubMedQA, LiveQA, MedicationQA, and MMLU clinical topics with a seventh dataset, HealthSearchQA, which consists of free-response medical questions from (presumably Google) online searches. The MedQA portion of the benchmark contains over 12,500 multiple choice questions from USMLE (Figure 1).

Second, they developed an LLM that was fine-tuned to the medical domain as opposed to ChatGPT which is a general knowledge LLM that has not received any specialized training. They started with PaLM, or the Pathways Language Model, a very large 540-billion parameter, decoder-only Transformer model, and trained it with the Pathways system, which enables more efficient distributed training of a modular architecture making it possible to train such a large model. Then they employed a new technique called instruction fine-tuning to enable further training on new tasks using natural language instructions and examples given in a text prompt to the model. This approach was previously used on a class of models called Fine-tuned LAnguage Net (FLAN); here it was used to upgrade PaLM into Flan-PaLM that benefits from additional training provided by detailed instructions in the prompt.

Flan-PaLM achieved state-of-the-art (SOTA) performance on MultiMedQA, often outperforming several strong LLM baselines by a significant margin (Figure 2). On the MedQA dataset comprising USMLE questions, FLAN-PaLM exceeded previous SOTA by over 17%. More specifically, on the MedQA dataset consisting of USMLE style questions with 4 options, the Flan-PaLM 540B model achieved an accuracy of 67.6%, whereas on the questions with 5 options, the model accuracy was 62.0%. Both results exceeded the threshold of 60% considered a passing score. By comparison ChatGPT scored ~50% with the number of choices ranging from 3-11 (QH).

But closer examination revealed that often the reasoning behind the answers even for the correct responses fell short of a human expert. The researchers then turned to a new technique called instruction prompt tuning to further align the model with the medical domain. Instruction prompt tuning (IPT) works by concatenating a natural language instruction with the input text, and then tuning the LLM on this augmented input (i.e. a combination of instruction fine-tuning described above and prompt tuning). The instruction can be anything that helps the LLM understand the task, such as a description of the task, a list of examples, or a set of rules. An example of instruction prompt tuning is augmenting a text passage with the prompt “Summarize the following text:” and training the models on the augmented text.

The Google researchers then used instruction prompt tuning to further refine Flan-PaLM into a new model named Med-PaLM which showed dramatic improvement in medical reasoning: “a panel of clinicians judged only 61.9% of Flan-PaLM long-form answers to be aligned with scientific consensus, compared to 92.6% for Med-PaLM answers.” The superior reasoning did not necessarily translate to better results on USMLE, but increased the confidence in the model output which exhibited better agreement between the explanation and the answer.

Med-PaLM 2 is the next version of Google's medical LLM, and as mentioned above in the publicity blurb, Google claims that Med-PaLM 2 achieved an accuracy of 85% on USMLE (Figure 2) which is 18% better than the 67% score of Med-PaLM. Unfortunately we cannot fully assess this result without more details. In particular, it would be of interest to know what has changed from its predecessor.

Overall the Med-PaLM results appear to be solid with thorough documentation of the large benchmark, standard split into training and testing datasets, some important innovations in model development, and careful evaluation. As a result, there is no reason to suspect that the Med-PaLM 2 results are not equally believable. The 85% accuracy on USMLE questions demonstrates that specialized LLMs can achieve significantly superior performance over a general knowledge chatbot like ChatGPT.

In the meantime, we await a publication describing Med-PaLM 2 with further details on the dataset, training-testing procedure, model development, and analysis. At this rate of improvement, one can expect scores in the range of 90-100% on USMLE in the next few years by a specialized LLM.

Figure 1. Example of a US Medical License Exam (USMLE) style multiple-choice question with 5 possible answers from the National Medical Board Examination. Approximately 12,500 such questions were part of the MedQA training set used to train and test Med-PaLM and Med-PaLM 2.

Figure 2. Med-PaLM 2 attained 85.4% accuracy on the USMLE-style questions of the MedQA dataset that far exceeded previous LLM results.

Quantified Health

Pages

Sunday, May 14, 2023

Examining the claim that Google’s Med-PaLM 2 scored 85% on the USMLE far surpassing previous results

No comments:

Post a Comment