Diagnosis is an especially tantalizing application for generative AI: Even when given tough cases that might stump doctors, the large language model GPT-4 has solved them surprisingly well.
But a new study points out that accuracy isn’t everything — and shows exactly why health care leaders already rushing to deploy GPT-4 should slow down and proceed with caution. When the tool was asked to drum up likely diagnoses, or come up with a patient case study, it in some cases produced problematic, biased results.
“GPT-4, being trained off of our own textual communication, shows the same — or maybe even more exaggerated — racial and sex biases as humans,” said Adam Rodman, a clinical reasoning researcher who co-directs the iMED Initiative at Beth Israel Deaconess Medical Center and was not involved in the research.
This article is exclusive to STAT+ subscribers
Unlock this article — and get additional analysis of the technologies disrupting health care — by subscribing to STAT+.
Already have an account? Log in
Already have an account? Log in
To submit a correction request, please visit our Contact Us page.
STAT encourages you to share your voice. We welcome your commentary, criticism, and expertise on our subscriber-only platform, STAT+ Connect