Enhanced LLM Aces the US Medical Licensing Examination

2025-05-26 19:08:47 英文原文

作者：Cami Rosso writes about science, technology, innovation, and leadership.

An emerging trend at the intersect of artificial intelligence (AI) and healthcare is to enhance the capabilities of standard LLMs (large language models) using higher quality training datasets and customized coding. A recent study published in JAMA Network Open showed how an enhanced AI large language model (LLM) passed all parts of the US Medical Licensing Examination (USMLE) with scores that outperform most physicians and all other existing AI tools.

There is a growing trend of evaluating LLMs for use in healthcare. According to one survey by Statista conducted in 2024, 18% of respondents working in healthcare used LLMs for biomedical research and at least one-fifth used medical chatbots and LLMs to answer questions from patients.

“Improving LLM performance in health care with targeted, up-to-date clinical knowledge is an important step in LLM implementation and acceptance,” wrote lead author Peter L. Elkin, MD, along with co-authors Guresh Mehta, Frank LeHouillier, Melissa Resnick, Sarah Mullin, Crystal Tomlin, Skyler Resendez, Jiaxing Liu, Jonathan Nebeker, and Steven Brown.

AI and LLMs in Healthcare

AI is rapidly disrupting healthcare. When OpenAI released to the public its chatbot ChatGPT on November 30, 2022, the proverbial genie in the bottle was freed, and there was no going back. ChatGPT broke records, garnering one million users in just a week, and within two months it had 100 million users according to figures from UBS. Examples of LLMs include Meta’s Llama, Google’s BERT, Bard, Gemini, and LaMDA, Google AI’s PaLM 2 (Bison-001), Anthropic’s Claude, Technology Innovation Institute (TII)’s Falcon, Cohere, Microsoft’s Orca, Guanaco, Vicuna by LMSYS, MPT-30B, 30B Lazarus by CalderaAI, Flan-T5 by former Google researchers, WizardLM, Stanford University’s Alpaca 7B, and many others.

Many hospitals are currently already using AI or are evaluating the use of machine learning worldwide. According to the Future Health Index 2024 global report that surveys nearly 3,000 healthcare leaders from 14 countries, respondents report that AI has already been implemented by roughly 43% for in-hospital patient monitoring, 37% for medication management, 37% for treatment planning, 36% for radiology, 36% for preventative care, 35% for pathology, 33% for remote patient monitoring, and 32% for clinical command centers.

For example, the Mayo Clinic has over 200 AI projects, including the creation of models for early detection of anxiety, depression, neuromuscular disease, breast cancer, pancreatic cancer, and cardiovascular disease. Johns Hopkins University & Medicine is testing the use of AI as a clinical tool for summarizing patient medical charts, clinical documentation, draft responses to patient messages, and for categorizing and assigning incoming messages. Doctors at Mass General Brigham are currently evaluating the feasibility of using AI for predicting the malignancy risk of oral lesions, and ChatGPT for recommending imaging services for patients with breast pain and breast cancer and answering colonoscopy patient questions. Mass General Brigham physician Vesela Kovacheva, M.D., Ph.D. is creating an AI machine learning algorithm to automate the administration of anesthesia to pregnant mothers who are about to have Cesarean sections (C-sections). The University of California San Diego Health recently published in The New England Journal of Medicine journal NEJM AI a study showing how LLMs can use input data from Fast Healthcare Interoperability Resources to output Severe Sepsis and Septic Shock Management Bundle (SEP-1) abstraction with high accuracy.

Are LLMs Really Smart?

Artificial intelligence machine learning “learns” what to look for in making predictions by identifying patterns from massive training datasets instead of relying on hard-coded explicit programming instructions. Large language models are a type of AI machine learning program.

LLMs are artificial neural networks consisting of deep learning algorithms with attention mechanisms that are pre-trained on massive datasets in order to “learn” enough to predict the next word or token in a sequence when provided preceding text.

Standard large language models may seem intelligent and capable of complex reasoning, however, in reality standard LLMs largely rely on pattern recognition capabilities to perform predictions, which is analogous to being a good guesser.

While “large language models (LLMs) are being implemented in health care,” wrote the researchers, “enhanced accuracy and methods to maintain accuracy over time are needed to maximize LLM benefits.”

How to Improve the Accuracy of AI?

There are currently a number of diverse ways to improve the performance of AI models. Examples include increasing the size and quality of the training dataset, adding more parameters to the AI model itself, and boosting the computational power during training. For LLMs in particular, performance improvements may be achieved with tuning, prompt distillation, and prompt engineering techniques.

For this new study, the researchers decided to augment native LLMs with pertinent clinical knowledge using retrieval-augmented generation (RAG), a natural language processing (NLP) method that optimizes LLM output by referencing a knowledge base outside of its training data prior to generating a response.

Artificial Intelligence Essential Reads

There are many advantages to using retrieval-augmented generation to improve LLM performance. RAG enables more flexibility to modify and manage the input data sources for the LLMs. Using RAG, an LLM can obtain the most up-to-date information from live data sources such as regularly updated data repositories, news sites, journals, publications, research, and social media feeds. RAG also can provide citations, references, and source attribution to increase model transparency and more explainable AI.

The researchers also used a data structure called semantic triples that groups data by subject, relation, and object in order to inform the LLM with richer context.

The team named their technique SCAI (pronounced sky), which is short for Semantic Clinical Artificial Intelligence.

“Our hypothesis was that augmenting native LLMs with relevant clinical knowledge as semantic triples would improve accuracy and decrease confabulation,” the scientists wrote.

The team tested their SCAI RAG–enhanced LLMs with Meta’s Llama 2 13B parameter model, Llama 3 70B, and Llama 3.1 405B models on text questions from US Medical Licensing Examination Steps 1, 2, and 3.

“In this comparative effectiveness research study, we found that semantic augmentation using SCAI RAG was associated with significantly improved scores on USMLE Steps 1, 2, and 3,” concluded the scientists.

With this proof-of-concept, the next steps would be to expand the SCAI RAG method to more LLMs to determine the generalizability of the technique.

With these study results, the researchers emphasize that they believe that clinicians who use AI may replace those who do not, and that AI will serve as an assistive tool rather than replace human clinicians. In the not-so-distant future ahead, the collaboration between human physicians and AI may become the norm, rather than the exception.

关于《Enhanced LLM Aces the US Medical Licensing Examination》的评论

暂无评论

发表评论

摘要

An emerging trend in healthcare involves enhancing large language models (LLMs) with clinical knowledge to improve their performance in medical applications. A recent study published in JAMA Network Open demonstrated that an enhanced AI LLM outperformed most physicians and existing tools on the US Medical Licensing Examination. The study used retrieval-augmented generation (RAG) and semantic triples to augment standard LLMs, resulting in higher accuracy on medical licensing exams. Hospitals worldwide are increasingly adopting or evaluating AI for various healthcare applications, such as patient monitoring and treatment planning. Researchers emphasize that while AI will assist clinicians rather than replace them, the collaboration between human physicians and AI is likely to become more prevalent.