Artificial IntelligenceHealthNewswireQuick ReadsTechnology

How AI Chatbots Compare to Doctors in Reasoning

▼ Summary

– OpenAI’s o1-preview LLM outperformed physicians on clinical reasoning tasks using real emergency room records, achieving 82% accurate diagnoses versus 79% and 70% for two doctors at the final checkpoint.
– Other research shows chatbots often give flawed medical advice, fabricating citations and presenting confident but inaccurate answers, raising reliability concerns.
– The study authors emphasize that AI should not replace doctors, but further testing in real-world cases and prospective clinical trials is needed.
– There is no standard scoring system for evaluating LLMs in clinical reasoning, leading to different conclusions on performance depending on how success is defined.
– Researchers stress the urgent need to understand LLM benefits and risks, focusing on how doctors interact with the technology rather than comparing AI versus humans.

One of the earliest ambitions for computing in medicine was to support clinical reasoning,the step-by-step decision-making that leads to a diagnosis and treatment plan. Over the decades, researchers built clinical decision support systems, each painstakingly programmed with rules about symptoms, test thresholds, and drug interactions. As artificial intelligence evolves, clinical reasoning has become a natural frontier.

Now, a large language model (LLM) from OpenAI has surpassed physicians on several clinical reasoning tasks using real emergency room records, according to a study published April 30 in Science.

These results arrive amid a wave of contradictory evidence about medical information from chatbots. Some studies show impressive diagnostic accuracy; others document fabricated citations, flawed advice, and outcomes that vary depending on how researchers score the systems. Despite that uncertainty, products aimed at medical professionals are already hitting the market. This year, OpenAI launched ChatGPT for Clinicians and ChatGPT for Healthcare.

The performance of OpenAI’s o1-preview,a general-purpose model since replaced by newer versions,was strong enough that the authors recommend further testing of LLMs in real-life cases, with physicians seeking second opinions on diagnoses at specific checkpoints.

Mickael Tordjman, who studies AI in medical imaging at the Icahn School of Medicine in New York, agrees that the time is right for research focused on real-world applications. “We need more proof in prospective clinical trials,” he says, noting that newer LLM models, or those trained specifically for medicine, could perform even better.

While the authors of the Science paper expressed optimism about AI’s medical potential during a press briefing, they also stressed important limitations and raised concerns about how their findings might be misinterpreted. “I don’t think our findings mean that AI replaces doctors,” says co-author Arjun Manrai, who studies AI at Harvard Medical School.

“I think this is really cool, don’t get me wrong,” adds co-author Adam Rodman, a medical educator at Beth Israel Deaconess Medical Center in Boston. “I get a little queasy about how some of these results might be used.”

How Reliable Are Chatbots on Medical Matters?

Other researchers investigating chatbots’ medical advice have recently found reason to doubt their trustworthiness. In one study, nearly half of the responses that five popular chatbots gave to open-ended health questions were flawed. The chatbots fabricated information and citations, and presented their answers confidently regardless of accuracy.

“These models are being used every day. There’s a certain risk there that’s not being quantified or mitigated,” says Arya Rao, who studies AI in medical practice in a different Harvard group than the Science authors.

Much of the research focuses on chatbots answering health questions from everyday users,the kinds of questions a person might ask before deciding to see a doctor. Using an LLM as a clinical decision-support tool for physicians is a different task. Doctors should have a much better sense of what information would help an LLM reach an accurate diagnosis or treatment plan, as well as the background knowledge to spot obvious mistakes.

Still, detecting hallucinations could remain challenging for physicians. “The models are equally convincing whether they are right or wrong,” Rodman says. “We need to find workflows with a low rate of errors.”

Even studies focused on physician-facing clinical reasoning tasks can reach very different conclusions depending on how researchers define success. In a paper published April 13 in JAMA Network, Rao and colleagues tested 21 LLMs on clinical reasoning tasks similar to those in the Science paper. As with that study, many performed well on final diagnoses, including chatbots in the o1 series. However, Rao scored the LLMs poorly on differential diagnosis questions because she used a different evaluation system.

When doctors make differential diagnoses, they note all potential causes of a patient’s symptoms. An LLM might correctly list six out of seven possible final diagnoses. That could reasonably be scored as 86 percent or, in Rao’s system, an unacceptable failure.

There is no agreed-upon standard scoring system. “It is still something in progress,” Tordjman says. “There’s no perfect way to evaluate LLMs in clinical reasoning.”

Testing Medical AI in the Real World

For the Science study, the researchers tested the OpenAI model with several batteries of medical case studies, comparable to difficult open-ended medical exam questions. Instructions to the chatbot were sometimes lengthy and filled with details that could be either extraneous or critical clues to the correct diagnosis.

“We went the extra step and showed that this performance also works in the real world,” Rodman says. One part of the study used data from 76 actual emergency room visits. The researchers asked the LLM and physicians for diagnoses at several stages of care: upon arrival, after evaluation by a doctor, and after transfer to another part of the hospital. Though both computers and humans were more accurate as more information became available, the LLM consistently outperformed the humans. For example, it provided an “exact or very close diagnosis” 82 percent of the time at the final checkpoint, compared to 79 percent and 70 percent for the two physicians.

LLMs, as we know them, are not even a decade old, and the landscape is rapidly evolving. Updated versions of flagship LLMs arrive faster than the typical pace of medical studies and academic literature, and many questions about regulation and liability remain unanswered. With many patients and doctors already consulting these machines, researchers told IEEE Spectrum that there is an urgent need to understand their benefits, risks, and the best way to use them.

While comparing AI performance against human physicians was important to the study, Manrai says the more important question is how doctors will actually use the technology. “We have to very rapidly move away from ‘AI versus humans’ toward how humans interact with this technology,” Manrai says.

Despite the many unresolved questions, Harvard’s Rao says the technology is advancing too quickly for medicine to ignore. “I would say it’s important to be careful, it’s important to evaluate, but it’s perhaps even more important to innovate,” she says. “We don’t want to rain on the parade. We think responsible innovation is the way to go.”

(Source: Ieee.org)

Topics

clinical decision support 95% llm medical performance 93% ai vs human doctors 90% chatbot reliability 88% real-world ai testing 87% differential diagnosis scoring 85% ai in emergency care 84% medical ai limitations 83% responsible ai innovation 80% physician-ai collaboration 79%