For two decades, the initial response to new medical symptoms was a familiar digital pilgrimage: a search engine query. This ubiquitous practice earned the somewhat dismissive moniker, "Dr. Google." However, the landscape of health information seeking is rapidly evolving, with Large Language Models (LLMs) like ChatGPT now becoming a significant alternative. OpenAI reports a staggering 230 million weekly users engaging ChatGPT with health-related queries, underscoring a profound shift in how individuals approach personal well-being concerns.

This burgeoning reliance on AI for health guidance is the backdrop against which OpenAI launched its new ChatGPT Health product earlier this month. The debut was met with an unsettling juxtaposition of innovation and cautionary tales. Just two days prior, the news website SFGate exposed the tragic story of Sam Nelson, a teenager who died from an overdose last year after extensively consulting ChatGPT about combining various drugs. In the wake of these concurrent events, a chorus of journalists and medical professionals rightly questioned the prudence of entrusting potentially life-altering medical advice to an AI that has demonstrated a capacity for generating dangerous misinformation.

ChatGPT Health, while residing in a distinct sidebar tab, is not an entirely new AI model. Rather, it functions as a sophisticated wrapper, augmenting OpenAI’s existing LLMs with specialized guidance and tools tailored for health-related interactions. Crucially, it can be granted permission to access a user’s electronic medical records and fitness app data, offering a level of personalized context that traditional search engines cannot match. OpenAI itself emphasizes that ChatGPT Health is intended as a supplementary resource, not a replacement for qualified medical professionals. Nevertheless, the reality remains that when doctors are inaccessible or unable to provide timely assistance, individuals will inevitably seek alternative solutions.

From the perspective of some physicians, LLMs present a promising avenue for enhancing medical literacy. The average patient often struggles to navigate the labyrinthine world of online medical information, facing the daunting task of distinguishing credible sources from those that are merely polished and persuasive but factually unsound. LLMs, in theory, can shoulder this burden, acting as a filter and synthesizer. Dr. Marc Succi, an associate professor at Harvard Medical School and a practicing radiologist, notes that consultations with patients who have self-diagnosed via Google often involve significant effort in dispelling anxiety and correcting misinformation. He observes a positive shift, stating, "you see patients with a college education, a high school education, asking questions at the level of something an early med student might ask."

The emergence of ChatGPT Health, coupled with Anthropic’s recent announcement of new health integrations for its AI model, Claude, signals a growing willingness from AI giants to acknowledge and actively promote the application of their technologies in the healthcare domain. This expansion, however, is not without inherent risks, given the well-documented tendencies of LLMs to exhibit sycophancy—agreeing with users—and to fabricate information rather than admit ignorance.

These risks must be carefully weighed against the potential benefits. An analogy can be drawn to the development of autonomous vehicles. When policymakers deliberate on the deployment of self-driving cars, the critical metric is not the complete absence of accidents but whether they demonstrably cause less harm than the current reality of human drivers. If "Dr. ChatGPT" proves to be a superior source of health information compared to "Dr. Google," and early indications suggest this may be the case, it could significantly mitigate the pervasive burden of medical misinformation and the unnecessary health anxiety that the internet has fostered.

Quantifying the efficacy of consumer-facing AI chatbots like ChatGPT or Claude for health inquiries, however, presents a considerable challenge. "It’s exceedingly difficult to evaluate an open-ended chatbot," admits Danielle Bitterman, the clinical lead for data science and AI at the Mass General Brigham health-care system. While LLMs perform well on standardized medical licensing examinations, these tests utilize multiple-choice questions that do not accurately replicate the nuanced, conversational way individuals typically use chatbots for health information.

To bridge this gap, Sirisha Rambhatla, an assistant professor at the University of Waterloo, conducted a study evaluating GPT-4o’s responses to licensing exam questions when deprived of answer choices. Medical experts who assessed these responses found only about half to be entirely correct. However, it’s important to note that multiple-choice exams are designed to be intricate, with answer options that don’t overtly reveal the solution, and they remain a distant approximation of the prompts users actually input into chatbots.

A separate study, employing more realistic prompts submitted by human volunteers, found that GPT-4o answered medical questions correctly approximately 85% of the time. Amulya Yadav, an associate professor at Pennsylvania State University and lead researcher on this study, expressed personal reservations about patient-facing medical LLMs but acknowledged their technical capabilities. He pointed out that human doctors also misdiagnose patients 10% to 15% of the time. "If I look at it dispassionately, it seems that the world is gonna change, whether I like it or not," he stated, reflecting the inevitability of technological advancement.

For individuals seeking health information online, Yadav concluded that LLMs appear to be a more advantageous choice than Google. Dr. Succi, the radiologist, reached a similar conclusion when comparing GPT-4’s responses to queries about common chronic medical conditions with the information presented in Google’s knowledge panel.

Since the studies by Yadav and Succi were published in early 2025, OpenAI has released multiple iterations of its GPT models. It is reasonable to assume that subsequent versions, such as GPT-5.2, would exhibit improved performance over their predecessors. However, these studies do have significant limitations. They primarily focus on straightforward, factual questions and examine brief user-chatbot or user-search engine interactions. The inherent weaknesses of LLMs, particularly their tendency to be sycophantic and to "hallucinate" or generate fabricated information, might manifest more prominently in extended conversations or with individuals grappling with more complex health issues. Reeva Lederman, a professor at the University of Melbourne specializing in technology and health, highlights a potential concern: patients dissatisfied with their doctor’s diagnosis or treatment might seek a second opinion from an LLM. A sycophantic LLM could then inadvertently reinforce the patient’s rejection of their doctor’s advice.

Studies have indeed documented instances of LLMs exhibiting hallucinations and sycophancy in response to health-related prompts. One study revealed that GPT-4 and GPT-4o readily accepted and elaborated upon incorrect drug information embedded in user queries. Another found GPT-4o frequently inventing definitions for fictitious syndromes and lab tests mentioned in user prompts. Given the proliferation of medically dubious diagnoses and treatments online, these LLM behaviors could inadvertently amplify the spread of misinformation, especially if users perceive these AI tools as inherently trustworthy.

OpenAI has reported that its GPT-5 series of models demonstrates a marked reduction in sycophancy and hallucination compared to earlier versions, suggesting that the findings of these studies might not be directly applicable to ChatGPT Health. Furthermore, the company has evaluated the model powering ChatGPT Health using its publicly available HealthBench benchmark. This benchmark assesses AI responses based on their ability to express appropriate uncertainty, recommend seeking medical attention when necessary, and avoid unnecessarily alarming users by overstating the severity of their condition. While it is reasonable to assume that the underlying model for ChatGPT Health performed well on these criteria during testing, Bitterman notes that the inclusion of LLM-generated prompts in HealthBench could limit its real-world applicability.

An AI that avoids alarmism represents a clear improvement over systems that have historically led individuals to "convince themselves they have cancer after a few minutes of browsing." As LLMs and the applications built upon them continue to evolve, the advantages of "Dr. ChatGPT" over "Dr. Google" are likely to grow. The introduction of ChatGPT Health, with its potential to access medical records for enhanced contextual understanding, is a significant step in this direction. However, numerous experts have voiced concerns regarding privacy implications of granting such access.

Even if ChatGPT Health and similar emerging tools offer a demonstrable improvement over traditional Google searches, their overall impact on public health remains a subject of careful consideration. Just as autonomous vehicles, despite their potential safety advantages, could have a net negative effect if they deter people from utilizing public transportation, LLMs might undermine health outcomes if they encourage over-reliance on internet-based information at the expense of human medical professionals, even if the quality of online information improves.

Lederman believes this outcome is plausible. Her research indicates that members of online health communities often place greater trust in articulate individuals, irrespective of the factual accuracy of their contributions. Because ChatGPT communicates with fluency and apparent confidence, some individuals may place undue trust in its pronouncements, potentially to the detriment of seeking advice from their doctor. For the foreseeable future, LLMs are unlikely to fully replace human physicians.