AI Outperforms Emergency Room Doctors in Harvard Diagnosis Study
A new study from Harvard Medical School and Beth Israel Deaconess Medical Center reveals that OpenAI's large language models can deliver more accurate diagnoses than human physicians in emergency room settings.
Study Overview
- Published in: Science journal this week
- Research Team: Physicians and computer scientists from Harvard Medical School and Beth Israel Deaconess Medical Center
- Focus: Comparing AI diagnostic performance against human doctors in real emergency room cases
Key Findings
Emergency Room Diagnostic Accuracy
The study tested 76 actual emergency room patients, comparing diagnoses from:
- Two internal medicine attending physicians
- OpenAI's o1 and 4o models
Two additional attending physicians evaluated all diagnoses without knowing which were AI-generated or human-made.
Performance Results at Initial ER Triage:
- o1 Model: 67% exact or very close diagnosis
- Physician 1: 55% exact or very close diagnosis
- Physician 2: 50% exact or very close diagnosis
Critical Context
The AI models received unprocessed electronic medical records — the exact same information available to physicians at each diagnostic touchpoint. The performance gap was "especially pronounced at the first diagnostic touchpoint (initial ER triage), where there is the least information available about the patient and the most urgency to make the correct decision."
Research Methodology
- AI models were tested with text-based information only
- No pre-processing of medical data
- Multiple diagnostic touchpoints evaluated throughout patient care
- Blinded evaluation by independent physicians
Important Limitations
What This Study Does NOT Claim:
Not Ready for Real-World Deployment: Researchers emphasized an "urgent need for prospective trials to evaluate these technologies in real-world patient care settings."
Text-Only Performance: Study only examined text-based diagnoses; existing research suggests AI is "more limited in reasoning over nontext inputs."
Specialist Comparison Issue: AI was compared to internal medicine physicians, not ER specialists who regularly perform triage.
Different Goals: As emergency physician Kristen Panthagani noted, "As an ER doctor seeing a patient for a first time, my primary goal is not to guess your ultimate diagnosis. My primary goal is to determine if you have a condition that could kill you."
Expert Commentary
Arjun Manrai (AI Lab Head, Harvard Medical School):
"We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines."
Adam Rodman (Beth Israel Doctor, Study Lead Author):
"There's no formal framework right now for accountability" around AI diagnoses. Patients still "want humans to guide them through life or death decisions."
Bottom Line
While the study demonstrates impressive AI diagnostic capabilities in controlled settings, researchers stress that clinical trials are necessary before real-world implementation. The lack of accountability frameworks and the human element in critical medical decisions remain significant barriers to deployment.