Pages

Saturday, September 14, 2024

OpenEvidence AI and Retrieval Augmented Generation (RAG)

The medical AI company OpenEvidence first burst on the scene with the following announcement:
“OpenEvidence, a generative Artificial Intelligence (AI) company working on aligning Large Language Models (LLMs) to the medical domain, announced today that OpenEvidence AI has become the first AI in history to score above 90% on the United States Medical Licensing Examination (USMLE). Previously, AIs such as ChatGPT and Google's Med-PaLM 2 have reported scores of 59% and 86%, respectively.”
By claiming to be the first AI to surpass the 90% threshold on USMLE, OpenEvidence generated some buzz. However subsequently, both ChatGPT and Google have surpassed 90% with their latest generative AI efforts (QH), and OpenEvidence never documented their claim with a paper or technical report.

They have published previous work taking a different approach from their competitors focusing less on large "foundation models" like ChatGPT and Google Gemini, and instead creating smaller specialist models tailored to the medical domain:
“Earlier this year, The New England Journal of Medicine AI featured a paper titled "Do We Still Need Clinical Language Models?" published by OpenEvidence, in partnership with researchers from MIT and Harvard Medical School, that found that language models that have been specialized to deal with medical text outperform much larger general domain models trained on general text (such as GPT-3) when compared on the same medical domain-specific intelligence tasks. OpenEvidence's paper went on to win Best Paper at the 2023 Conference on Health, Inference, and Learning (CHIL), the preeminent community of computer scientists working in medical applications.”
In this vein, earlier this year, OpenEvidence released ClinicalKey AI in collaboration with the scientific publisher Elsevier. ClinicalKey AI uses generative AI, i.e. a large language model (LLM), for real-time access to the latest medical research and information. It serves as a chatbot to answer medical questions allowing doctors to input symptoms, explore drug interactions, and access data from hundreds of medical journals and verified sources. What distinguishes ClinicalKey AI from ChatGPT or Gemini is that the answers are based on information from a very large corpus of medical research papers including the most recent articles published in the literature (e.g. in Elsevier journals), supplemented with reference textbooks, clinical overviews, and drug information. The foundation chatbots rely on more general medical information scraped from a wide range of sources that are integrated into the memory (network weights) of the LLM during training and then regurgitated. As a result, ChatGPT, for example, may not have been trained on the latest medical research papers (there is a cut-off date for the training data).

Most likely, ClinicalKey AI takes advantage of a technique called retrieval augmented generation or RAG. Retrieval-Augmented Generation (RAG) is a paradigm in natural language processing (NLP) that combines elements of information retrieval (IR) with text generation using large language models like ChatGPT. Information Retrieval (IR) involves retrieving relevant information from a large corpus (such as a collection of documents or web pages) based on a query. A large language model (LLM) is capable of generating coherent and contextually relevant text on its own based on a given prompt by pre-training on a gargantuan amount of text that it can spit back. RAG systems first retrieve a set of relevant documents or passages from a large corpus using an IR system. Critically, the retrieved documents can be outside of the training data for an LLM (e.g. new or more detailed information). The retrieved text is added to the prompt along with the original query, and then fed into the LLM. As a result, the LLM can take advantage of this additional text use as a source of information to enhance the quality and relevance of its generated responses. 

More specifically, by leveraging information from retrieved documents (that may not be in the training set), RAG systems can potentially generate responses that are more accurate and contextually appropriate than LLMs without RAG. A second benefit is reduced hallucinations. Hallucinations arise when an LLM makes up facts as it interpolates (or extrapolates) between different pieces of data in the training set, and does so incorrectly. Putting the most relevant data in the prompt allows the LLM to directly access accurate information without interpolation.

One can observe this process in action by visiting the OpenEvidence website and asking a question (Figure 1). For example, the question "Can fish oil supplements increase the risk of stroke and heart problems?" elicits a response that is most likely provided by a RAG system similar to the one used in ClinicalKey AI. At the top is a short summary of ~ 5 paragraphs consisting of an introduction, 2 or 3 paragraphs answering the question with supporting references, and then a concluding paragraph. Below this summary is the reference list with each reference containing either the abstract for the reference or a generated summary. There are typically 3 to 10 references.

Overall I found these summaries to be a useful complement to the responses generated by the foundation LLMs (without RAG) like ChatGPT or Google Gemini because they tended to provide more relevant and recent references along with information from these references thanks to RAG.

Figure 1. Snapshot of the OpenEvidence website. You can enter your question into the search box, and three example questions and responses are shown below.

No comments:

Post a Comment