Skip to Main Content

Not long after the artificial intelligence company OpenAI released its ChatGPT chatbot, the application went viral. Five days after its release, it had garnered 1 million users. Since then, it has been called world-changing, a tipping point for artificial intelligence, and the beginning of a new technological revolution.

Like others, we began exploring potential medical applications for ChatGPT, which was trained on more than 570 gigabytes of online textual data, extracted from sources like books, web texts, Wikipedia, articles, and other content on the internet, including some focused on medicine and health care. Although the potential usage of AI such as ChatGPT for medical applications excites us, inaccuracies, confabulation, and bias make us hesitant to endorse its use outside of certain situations. These include streamlining education and administrative tasks to assisting clinical decision-making, though even there the application has significant problems and pitfalls.

advertisement

As an educational aid

In the United States, medical education continues to inch away from a system revolving around memorizing and retaining information to an emphasis on curating and applying medical knowledge. AI systems like ChatGPT could facilitate this transition by helping medical students and physicians learn more efficiently, from creating unique memory devices (“create a mnemonic for the names of the cranial nerves”) to explaining complex concepts in language of varying complexity (“explain tetralogy of Fallot to me like I’m a 10th grader, a first-year medical student, or a cardiology fellow“).

By asking ChatGPT, we learned it can aid in studying for standardized medical exams by generating quality practice questions alongside detailed explanations for the correct and incorrect answers. Perhaps it should come as no surprise that, in a recent study released as a preprint — in which ChatGPT was listed as a co-author — the application passed the first two steps of the United States Medical Licensing Exam, the national exam that most U.S. medical students take to quality for medical licenses.

ChatGPT’s responsive design can also simulate a patient and be asked to provide a medical history, physical exam findings, lab results, and more. With its capacity to answer follow-up questions, ChatGPT could provide opportunities to refine a physician’s diagnostic skills and clinical acumen more generally, though with a high level of skepticism.

advertisement

Although ChatGPT can help physicians, they need to tread carefully and not use it as a primary source without verification.

For administrative work

In 2018, the last year for which we could find solid statistics, 70% of physicians said they spent at least 10 hours on paperwork and administrative tasks, with nearly one-third of them spending 20 hours or more.

ChatGPT could be used to help health care workers save time with nonclinical tasks, which contribute to burnout and take away time from interacting with patients. We found that ChatGPT’s dataset includes the Current Procedural Terminology (CPT) code set, a standardized system for identifying medical procedures and services that most physicians use to bill for procedures or the care they provide. To test how well it worked, when we asked ChatGPT for several billing codes it gave us the correct code for Covid vaccines but inaccurate ones for amniocentesis and x-ray of the sacrum. In other words, for now, close but no cigar without substantial improvement.

Clinicians spend an inordinate amount of time writing letters to advocate for patients’ needs for insurance authorization and third-party contractors. ChatGPT could help with this time-consuming task. We asked ChatGPT, “Can you write an authorization letter for Blue Cross regarding transesophageal echocardiogram usage in a patient with valve disease? The service is not covered by the insurance provider. Please incorporate references that include scientific research.” Within seconds, we received a personalized email that could serve as a time-saving template for this request. It required some editing, but generally got the message through.

Clinical applications

The use of ChatGPT in clinical medicine should be approached with greater caution than its promise in educational and administrative work. In clinical practice, ChatGPT could streamline the documentation process, generating medical charts, progress notes, and discharge instructions. Jeremy Faust, an emergency medicine physician at Brigham and Women’s Hospital in Boston, for instance, put ChatGPT to the test by requesting a chart for a fictional patient with a cough, to which the system responded with a template that Faust remarked was “eerily good.” The potential is obvious: helping health care workers sort through a set of symptoms, determine treatment dosages, recommend a course of action, and the like. But the risk is significant.

One of ChatGPT’s major issues is its potential to generate inaccurate or false information. When we asked the application to give a differential diagnosis for postpartum hemorrhage, it appeared to do an expert job, and even offered supporting scientific evidence. But when we looked into the sources, none of them actually existed. Faust identified a similar error when ChatGPT stated that costochondritis — a common cause of chest pain — can be caused by oral contraceptive pills, but confabulated a fake research paper to support this statement.

This potential for deception is particularly worrisome given that a recent pre-print showed that scientists have difficulty differentiating between real research and fake abstracts generated by ChatGPT. The risk of misinformation is even greater for patients, who might use ChatGPT to research their symptoms, as many currently do with Google and other search engines. Indeed, ChatGPT generated a horrifyingly convincing explanation on how “crushed porcelain added to breast milk can support the infant digestive system.”

Our concerns about clinical misinformation are further heightened by the potential for bias in ChatGPT’s responses. When a user asked ChatGPT to generate computer code to check if a person would be a good scientist based on their race and gender, the program defined a good scientist as being a white male. While OpenAI may be able to filter out certain instances of explicit bias, we worry about more implicit instances of bias that could work to perpetuate stigma and discrimination within health care. Such biases can arise because of smaller sample sizes of training data and limited diversity in that data. But given that ChatGPT was trained on more than 570 gigabytes of online textual data, the program’s biases may instead reflect the universality of bias across the internet.

What’s next?

Artificial intelligence tools are here to stay. They are already being used as clinical decision support aids to help predict kidney disease, simplify radiology reports, and accurately forecast leukemia remission rates. The recent release of Google’s Med-PaLM, a similar AI model tailored for medicine, and OpenAI’s application programming interface, which can leverage ChatGPT to build health care software, only further emphasize the technological revolution transforming health care.

But in this seemingly endless plane of progress, an imperfect tool is being deployed without the necessary guardrails in place. Although there may be acceptable uses of ChatGPT across medical education and administrative tasks, we cannot endorse the program’s use for clinical purposes — at least in its current form.

Launched to the public as a beta product, ChatGPT will undoubtedly improve, and we anticipate the arrival of ChatGPT-4, whose performance we hope will be enhanced with increased precision and efficiency. The release of a powerful tool such as ChatGPT will instill awe, but in medicine, it needs to elicit appropriate action to evaluate its capabilities, mitigate its harms, and facilitate its optimal use.

Rushabh H. Doshi is a medical student at the Yale School of Medicine. Simar S. Bajaj is an undergraduate student at Harvard University. The authors thank Harlan M. Krumholz, the director of the Center for Outcome Research Evaluation at Yale-New Haven Hospital, for his input and help with this essay.


First Opinion newsletter: If you enjoy reading opinion and perspective essays, get a roundup of each week’s First Opinions delivered to your inbox every Sunday. Sign up here.

STAT encourages you to share your voice. We welcome your commentary, criticism, and expertise on our subscriber-only platform, STAT+ Connect

To submit a correction request, please visit our Contact Us page.