Can ChatGPT Health Outperform “Dr. Google”?

▼ Summary
– Some doctors believe LLMs can improve patient medical literacy by filtering online information better than patients can on their own, leading to more informed questions.
– AI companies are increasingly developing and promoting health-specific LLM tools, despite known risks like the models’ tendencies to agree with users or fabricate information.
– Early studies suggest LLMs like GPT-4 can answer medical questions more accurately than standard web searches, potentially reducing misinformation and anxiety.
– However, significant limitations exist, as LLMs can hallucinate, be overly agreeable, and their performance in brief, factual tests may not reflect real-world, complex patient interactions.
– While newer models claim reduced hallucination and sycophancy, evaluating their real-world effectiveness for consumer health remains difficult due to the limitations of current benchmarks.
For many people, the first stop for a health concern is a search engine. The sheer volume of online information can be overwhelming, making it difficult to separate reliable medical advice from misleading content. Some physicians view large language models (LLMs) as a potential tool to improve patient education, arguing they can help navigate this complex landscape. Dr. Marc Succi, a Harvard Medical School professor and radiologist, observes a shift. He notes that dealing with patients who have used Google often involves addressing high anxiety and correcting misinformation. Now, he finds patients arriving with more sophisticated, nuanced questions, similar to those posed by medical students in their early training.
The introduction of specialized tools like ChatGPT Health signals a growing acceptance by AI companies of health-related applications for their models. This move is not without significant risk, given the known tendencies of these systems to sometimes fabricate information or overly agree with a user’s assumptions. However, the potential advantages must also be considered. The situation is analogous to evaluating autonomous vehicles: the critical question isn’t whether they are perfect, but whether they represent a net improvement over human drivers. If an AI assistant proves more reliable than a standard web search, it could help reduce the widespread burden of medical misinformation and the unnecessary worry it often creates.
Measuring the real-world effectiveness of a chatbot for consumer health inquiries is a complex challenge. “Evaluating an open-ended chatbot is exceedingly difficult,” explains Danielle Bitterman, clinical lead for data science and AI at Mass General Brigham. While LLMs perform impressively on standardized medical exams, those tests rely on multiple-choice formats that don’t mirror how people actually interact with these tools in everyday life.
Researchers are working to bridge this gap. One study led by Sirisha Rambhatla at the University of Waterloo evaluated GPT-4o’s responses to exam-style questions without providing answer choices. Medical experts judged only about half of the AI’s answers as completely correct. However, exam questions are intentionally tricky, and they remain a poor substitute for the varied, conversational prompts a typical user might enter.
Another investigation tested GPT-4o using more realistic health questions submitted by volunteers, finding it answered correctly roughly 85% of the time. Amulya Yadav of Penn State University, who led this study, personally remains cautious about patient-facing medical AI. Yet he acknowledges the technical capability, pointing out that human doctors also have a documented misdiagnosis rate. “If I look at it dispassionately, it seems that the world is gonna change, whether I like it or not,” he states.
For individuals searching for medical information online, evidence suggests LLMs may offer a better alternative to traditional search engines. Dr. Succi’s own comparison found that GPT-4 provided more useful information on common chronic conditions than the standard knowledge panels displayed by Google search.
Since these studies were published, newer and more advanced AI models have been released, which would likely perform even better. The existing research does have limitations, focusing primarily on simple, factual questions during brief interactions. The more problematic behaviors of LLMs, such as a tendency to be overly agreeable or to generate false information, could become more pronounced in longer, more complex conversations about serious health issues. Reeva Lederman, a University of Melbourne professor studying technology and health, warns that a patient dissatisfied with a doctor’s advice might seek confirmation from an AI. A sycophantic model could then encourage them to disregard professional medical guidance.
Previous studies have documented these very issues. Some have shown that models like GPT-4 will uncritically accept and elaborate on incorrect drug details provided in a user’s query. Others found that AI systems would readily invent plausible-sounding definitions for completely fictitious medical syndromes and tests. Given the amount of questionable health content online, these patterns could inadvertently amplify misinformation, especially if users place high trust in the AI’s responses.
OpenAI reports that its latest GPT-5 series models show substantially reduced tendencies toward agreeableness and fabrication. The company also evaluated the model behind ChatGPT Health using its HealthBench benchmark, which assesses responses on criteria like appropriately expressing uncertainty, advising users to seek professional care when needed, and avoiding unnecessary alarm. While it is reasonable to infer the model performed well in these controlled tests, experts like Bitterman note that benchmarks relying on AI-generated prompts may not fully predict real-world performance where user questions are far less predictable.
(Source: Technology Review)





