AI chatbots fail to diagnose patients by talking to them

Don’t call your favorite AI “doctor” just yet

Just_Super/Getty Images

Advanced artificial intelligence models score well on professional medical exams, but still fail to perform one of the most crucial medical tasks: talking to patients to gather relevant medical information and deliver an accurate diagnosis.

“While large language models show impressive results on multiple-choice tests, their accuracy drops significantly in dynamic conversations,” says Pranav Rajpurkar of Harvard University. “The models especially struggle with open-ended diagnostic reasoning.”

That became clear when researchers developed a method to evaluate a clinical AI model’s reasoning abilities based on simulated doctor-patient conversations. The “patients” were based on 2000 medical cases, drawn primarily from professional American medical board exams.

“Simulating patient interactions enables the evaluation of medical history-taking skills, a critical component of clinical practice that cannot be assessed using case vignettes,” says Shreya Johri, also at Harvard University. The new evaluation benchmark, called CRAFT-MD, “also mirrors real-life scenarios where patients may not know what details are critical to share and only reveal important information when prompted by specific questions,” she says.

The CRAFT-MD benchmark itself relies on AI. OpenAI’s GPT-4 model played the role of a “patient AI” in conversation with the “clinical AI” being tested. The GPT-4 also helped to judge the results by comparing the clinical AI’s diagnosis with the correct answer for each case. Human medical experts double-checked these evaluations. They also reviewed the conversations to check the patient AI’s accuracy and see if the clinical AI managed to collect the relevant medical information.

Several experiments showed that four leading major language models—OpenAI’s GPT-3.5 and GPT-4 models, Meta’s Llama-2-7b model, and Mistral AI’s Mistral-v2-7b model—underperformed significantly on the conversational benchmark, than they did when they made diagnoses based on written summaries of the cases. OpenAI, Meta and Mistral AI did not respond to requests for comment.

For example, the GPT-4’s diagnostic accuracy was an impressive 82 percent when presented with structured case summaries and allowed to select the diagnosis from a multiple-choice list of answers, falling to just under 49 percent when given no multiple choices. When it had to make diagnoses from simulated patient conversations, its accuracy dropped to just 26 percent.

And GPT-4 was the best performing AI model tested in the study, with GPT-3.5 often coming in second, the Mistral AI model sometimes second or third, and Meta’s Llama model overall scoring the lowest.

The AI ​​models also failed to collect complete medical histories a significant proportion of the time, with the leading model GPT-4 doing so in only 71 percent of simulated patient conversations. Even when the AI ​​models collected a patient’s relevant medical history, they did not always produce the correct diagnoses.

Such simulated patient conversations represent a “far more useful” way to evaluate AI clinical reasoning than medical studies, says Eric Topol of the Scripps Research Translational Institute in California.

If an AI model eventually passes this benchmark and consistently makes accurate diagnoses based on simulated patient conversations, this would not necessarily make it superior to human doctors, Rajpurkar says. He points out that medical practice in the real world is “killer” than in simulations. It involves managing multiple patients, coordinating with health care teams, performing physical examinations and understanding “complex social and systemic factors” in community health settings.

“Strong performance on our benchmark suggests that AI could be a powerful tool to support clinical work – but not necessarily a substitute for the holistic judgment of experienced physicians,” says Rajpurkar.

Subjects:

Leave a Comment