Pages

Saturday, June 1, 2024

Med-Gemini scores over 90% on USMLE

The United States Medical Licensing Exam (USMLE) is a standardized test consisting of three steps (parts) that cover various topics in a physician's knowledge, including basic science, clinical reasoning, medical management, and bioethics (QH). Step 1 focuses on basic science, pharmacology, and pathophysiology, and is taken by medical students who have completed two years of didactic and problem-based learning. Step 2CK emphasizes clinical reasoning, medical management, and bioethics, and is taken by fourth-year medical students who have completed clinical rotations. Step 3 is taken by physicians who have completed at least 0.5 to 1 year of postgraduate medical education. 

Google has invested time and resources into developing LLMs (large language models e.g. chatbots) specialized for the medical domain. One task they have trained their LLMs on is answering USMLE questions. In 2022, they introduced Med-PaLM, a version of a previous language model, PaLM, tuned for medical tasks. In their announcement, Google claimed that Med-PaLM was the first LLM to obtain a “passing score” (67.6%) on U.S. medical licensing-style (USMLE) questions. Passing on each of the three parts (steps) of the exam is roughly 60% correct.

Last year, Google updated its progress by reporting on the next iteration of Med-PaLM, Med-PaLM 2. In particular, the researchers turned to a new technique called instruction prompt tuning to further align the model with the medical domain. They asserted that the LLM consistently performed at an “expert” doctor level on medical exam (USMLE) questions, scoring 85%, which is an 18% improvement from Med-PaLM's previous score and far surpasses similar AI models such as ChatGPT.

The two Med-PaLM versions were built on an older generation LLM that powered Bard, Google's first generation conversational AI chatbot. This year Bard was replaced by Google Gemini, the second generation AI chatbot. The Gemini LLM models excel in their ability to process and understand various types of information, including text, code, images, and video (i.e. multimodal). They have demonstrated performance that surpasses current state-of-the-art results on widely used academic benchmarks. Remarkably, the Gemini Ultra model even outperforms human experts on MMLU, a rigorous test of language understanding, world knowledge and problem-solving skills.

Last month Google introduced Med-Gemini, a set of advanced multimodal models designed specifically for the medical field built on Gemini. These models integrated web search capabilities, and can be easily adapted to new modalities through the use of custom encoders. Google researchers assessed Med-Gemini across 14 medical benchmarks, setting new state-of-the-art records in 10, and according to the authors, consistently outperforming the GPT-4 model family in every applicable test, frequently by a substantial margin (link).

Of particular interest was the performance on the USMLE exam. On this benchmark, the Med-Gemini model demonstrated state-of-the-art performance with 91.1% accuracy (Figure 1). This represents a substantial improvement over the previous Med-PaLM 2 model (4.5% increase) and the MedPrompt technique used with ChatGPT-4  (0.9% increase). Unlike MedPrompt, Med-Gemini uses web search in a guided way, making it adaptable to complex medical scenarios beyond answering multiple choice questions.

Med-Gemini took advantage of two new innovations, one during training and then one during inference (prediction). First self-training with search generated synthetic examples of clinical reasoning using web search which helped to self-train the model. This method involved the following steps:
  • Web Search: For each medical question, the model generates search queries to retrieve relevant information from a web search API.
  • In-context Demonstrations: For each type of reasoning path, either with or without search results, five expert demonstrations are curated. These demonstrations are detailed and explain why the selected answer is the best among other potential answers. 
  • Generating Chains of Thought (CoTs): The model uses these demonstrations as seeds to generate CoTs over a training dataset. Before these CoTs are used for further training, they are screened to eliminate any that may lead to incorrect predictions.
  • Fine-tuning Loop: The model is then fine-tuned on these generated CoTs. This process of generating CoTs and fine-tuning is repeated iteratively until no further improvement.
This approach allowed the model to progressively refine its ability to reason clinically by incorporating both internal reasoning capabilities and external search-based information.

The second key innovation was uncertainty-guided search at inference which incorporated the following steps:
  • Multiple Paths: The model initially generates several possible reasoning paths (answers) to the medical question.
  • Is the Model Unsure? It calculates uncertainty using entropy. If uncertainty is high, it triggers the search process.
  • Search Query Generation: If unsure, the model is instructed to create search queries specifically designed to resolve the areas of uncertainty it has.
  • Web Search & Refinement: The queries are sent to a web search engine. The search results are then fed back into the model's input, allowing it to reconsider its answer in a new iteration.
So the model employs web search both during training, and also when answering questions. One can argue that using web search during an exam is a form of cheating, but it is also possible to consider the search engine as part of the AI.

In summary, the authors of the Med-Gemini technical report conclude that the model sets a new standard in state-of-the-art performance on USMLE. This was achieved through self-training based fine-tuning and the incorporation of search functionalities. They also comment on the limitations of using USMLE as a measure of medical understanding. They found that about 4% of the questions are incomplete, and another 3% may suffer from errors in labeling. Indeed a small fraction of the ~10% error rate by Med-Gemini was due to faulty questions.
Figure 1. The steady climb towards a perfect score on USMLE by medical LLMs. With an accuracy of 91.1%, Med-Gemini is close to the summit. 

No comments:

Post a Comment