AI Outperforms Emergency Room Doctors in Harvard Diagnosis Study

A new study from Harvard Medical School and Beth Israel Deaconess Medical Center reveals that OpenAI's large language models can deliver more accurate diagnoses than human physicians in emergency room settings.

Study Overview

Published in: Science journal this week
Research Team: Physicians and computer scientists from Harvard Medical School and Beth Israel Deaconess Medical Center
Focus: Comparing AI diagnostic performance against human doctors in real emergency room cases

Key Findings

Emergency Room Diagnostic Accuracy

The study tested 76 actual emergency room patients, comparing diagnoses from:

Two internal medicine attending physicians
OpenAI's o1 and 4o models

Two additional attending physicians evaluated all diagnoses without knowing which were AI-generated or human-made.

Performance Results at Initial ER Triage:

o1 Model: 67% exact or very close diagnosis
Physician 1: 55% exact or very close diagnosis
Physician 2: 50% exact or very close diagnosis

Critical Context

The AI models received unprocessed electronic medical records — the exact same information available to physicians at each diagnostic touchpoint. The performance gap was "especially pronounced at the first diagnostic touchpoint (initial ER triage), where there is the least information available about the patient and the most urgency to make the correct decision."

Research Methodology

AI models were tested with text-based information only
No pre-processing of medical data
Multiple diagnostic touchpoints evaluated throughout patient care
Blinded evaluation by independent physicians

Important Limitations

What This Study Does NOT Claim:

Not Ready for Real-World Deployment: Researchers emphasized an "urgent need for prospective trials to evaluate these technologies in real-world patient care settings."
Text-Only Performance: Study only examined text-based diagnoses; existing research suggests AI is "more limited in reasoning over nontext inputs."
Specialist Comparison Issue: AI was compared to internal medicine physicians, not ER specialists who regularly perform triage.
Different Goals: As emergency physician Kristen Panthagani noted, "As an ER doctor seeing a patient for a first time, my primary goal is not to guess your ultimate diagnosis. My primary goal is to determine if you have a condition that could kill you."

Expert Commentary

Arjun Manrai (AI Lab Head, Harvard Medical School):

"We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines."

Adam Rodman (Beth Israel Doctor, Study Lead Author):

"There's no formal framework right now for accountability" around AI diagnoses. Patients still "want humans to guide them through life or death decisions."

Bottom Line

While the study demonstrates impressive AI diagnostic capabilities in controlled settings, researchers stress that clinical trials are necessary before real-world implementation. The lack of accountability frameworks and the human element in critical medical decisions remain significant barriers to deployment.