ChatGPT Can Pass Clinical Exams Like a Fellow, but With Some Caveats

While the AI did fine on multiple-choice questions and could justify its answers, it failed a retest just 2 weeks later.

ChatGPT Can Pass Clinical Exams Like a Fellow, but With Some Caveats

As artificial intelligence (AI) continues to spread across the medical field, researchers are learning that, at least for now, it has some shortcomings that make it no match for humans. A new study shows that while chatbots may be capable of taking certification tests like a clinician with enough accuracy to pass, their testing skills also can be a bit hit-or-miss.

“My personal take-home is that it's encouraging, but I don't think we can trust this quite yet,” said Emmanouil S. Brilakis, MD, PhD (Minneapolis Heart Institute, MN), senior author of a study that pitted ChatGPT 4.0 (OpenAI) against interventional cardiology fellows to see if it could pass a simulation test for the American College of Cardiology (ACC)/American Board of Internal Medicine (ABIM) Collaborative Maintenance Pathway (CMP).

Since its release in late 2022, the unique large language model that produces human-like text responses has raised issues in the academic medical community that led to some journal editors, such as those from the JACC family of journals and JAMA, to issues policies on how AI-based tools may be used by researchers as part of their scientific publications. Recently, a small study showed that it was fairly proficient at responding to simple CVD prevention questions.

Following the publication last year of a survey of cardiologists that suggested most support the potential of AI-enabled tools to improve quality and efficiency of care, Brilakis said his team decided to test ChatGPT 4.0’s clinical test-taking abilities.

In a research letter published in JACC: Cardiovascular Interventions, they describe how on a multiple-choice version of the exam the AI’s score was enough to pass, at 76.7% versus an average score of 82.2% for fellows who took the same exam. The test contained 60 questions and required an explanation for why the answers it didn’t choose were incorrect.

When the researchers decided 2 weeks later to retest ChatGPT, however, its scored dropped to 65%. Oddly, while it got three answers right that it had previously gotten wrong, 10 other answers that were correct on the first try were answered incorrectly on the retest.

To TCTMD, Brilakis said that finding is both problematic and unexpected.

“AI is a black box in some ways, because what exactly is going on? My thought would be that it would have improved [on the retest], as humans do,” he added. “I don’t know if it’s the algorithm or what, but this is a major concern.”

Videos? Forget About It

Brilakis and colleagues, led by Michaella Alexandrou, MD (Minneapolis Heart Institute), also gave the ChatGPT a version of the exam that was only multiple choice, and it scored just 61.7% on that version of the exam.

Despite being able to explain its answer on the version of the exam that it passed, ChatGPT was unable to answer the majority of questions that required viewing a video, providing the explanation that it did not have the ability to watch a video. When some of those questions were reformatted to be multiple choice only, it got all but one correct. It did well overall with image-based questions, giving a correct response to five out of six.

According to Brilakis, the underperformance and poor test-retest reliability of ChatGPT compared with the fellows suggests it may be helpful to researchers or clinicians in phrasing certain questions to get the best answer but may not be the most reliable tool for clinical decision-making.

“You can ask a very specific question that will be hard to phrase in a Google search, for example. Most of the time, [it] is going to give you appropriate answers,” he added. “So, it’s better than [the search engine] for getting answers to complex questions and the speed is going to be faster as well. But we also know that whatever [ChatGPT] tells you, it has to be confirmed, sources have to be confirmed, so accuracy is still an issue.”

  • Alexandrou reports no relevant conflicts of interest.
  • Brilakis reports consulting/speaker fees from Abbott Vascular, Amgen, Asahi Intecc, Biotronik, Boston Scientific, CSI, Elsevier, GE Healthcare, IMDS, Medicure, Medtronic, Siemens, Teleflex, and Terumo; and research support from Boston Scientific and GE Healthcare.