Skip to main content
TopAIThreats home TOP AI THREATS
INC-25-0027 confirmed critical Signal

Medical LLM Data Poisoning Produces Undetectable Harmful Content (2025)

Incident Details

Last Updated 2026-03-28

A study published in Nature Medicine demonstrated that replacing just 0.001% of training tokens with AI-generated medical misinformation caused large language models to produce harmful clinical recommendations while passing standard medical benchmarks undetected.

Incident Summary

In January 2025, researchers from NYU Langone Health, Columbia University, and multiple collaborating institutions published a study in Nature Medicine demonstrating, in a controlled research setting, that medical large language models are vulnerable to data-poisoning attacks at remarkably small scales.[1] No real-world patients were affected. By replacing just 0.001% of training tokens in The Pile — a widely used LLM training corpus — with AI-generated medical misinformation, the researchers produced models that generated harmful clinical recommendations while performing normally on five standard medical benchmarks (MedQA, MedMCQA, PubMedQA, MMLU Clinical Knowledge, MMLU Professional Medicine), demonstrating the risk of harmful recommendations if such models were deployed in clinical practice.

The attack used ChatGPT to generate approximately 150,000 medical documents containing incorrect, outdated, and fabricated clinical information, which were then injected into the training corpus.[1] The resulting poisoned models matched clean models on open-source medical benchmarks, meaning routine safety evaluations would not detect the compromise.

Key Facts

  • Scale of poisoning: 0.001% of training tokens — approximately 2,000 articles at an estimated cost of $5[1]
  • Benchmark evasion: Poisoned models passed five standard medical benchmarks — MedQA, MedMCQA, PubMedQA, MMLU Clinical Knowledge, and MMLU Professional Medicine — with no consistent performance degradation, making the compromise undetectable through routine evaluation[1]
  • Output: Models produced harmful medical content including incorrect treatment recommendations and fabricated clinical guidance[2]
  • Authors: 32 researchers led by Daniel Alexander Alber (NYU Langone Health) with institutions including Columbia, Harvard Medical School, Washington University, and Mount Sinai[1]
  • Proposed mitigation: A pruned version of the BIOS biomedical knowledge graph (21,706 medical concepts, 416,302 relationships) combined with UMLS Metathesaurus for synonym resolution, screening LLM outputs via GPT-4 triplet extraction and MedCPT vector matching — achieving 91.9% recall of harmful content (F1 = 85.7%)[1]
  • Scope: Controlled research demonstration — no real-world patients were affected

Threat Patterns Involved

Primary: Data Poisoning — This study demonstrates that training data poisoning can be conducted at negligibly small scales ($5, 0.001% of tokens) while producing models that evade standard safety evaluations. The medical domain makes this especially dangerous because erroneous clinical recommendations can cause direct physical harm.

Secondary: Misinformation & Hallucinated Content — Poisoned models generate plausible but incorrect medical information that is indistinguishable from legitimate outputs, effectively embedding hallucination-like behavior through deliberate manipulation rather than model limitations.

Secondary: Adversarial Evasion — The poisoned models’ ability to pass five standard medical benchmarks without performance degradation represents a form of evaluation evasion, where the adversarial modification is specifically designed to be invisible to standard safety validation processes.

Significance

This research establishes a critical vulnerability in the AI healthcare pipeline:

  1. Asymmetric attack economics — An attacker spending $5 on content generation could compromise a medical AI system used by millions of patients
  2. Benchmark evasion — Standard safety evaluations provide no protection against this class of attack, calling into question the adequacy of current medical AI validation processes
  3. Irreversibility — Once poisoned data is incorporated into model weights, retroactive removal is technically infeasible without full retraining
  4. Supply chain exposure — Medical LLMs trained on large web-scraped corpora inherit the provenance risks of their training data, and auditing billions of training tokens for injected misinformation is impractical

Timeline

Study published online in Nature Medicine (Volume 31, Issue 2, pages 618-626)

Researchers demonstrate that 0.001% token replacement creates models producing harmful medical content while passing standard benchmarks

Outcomes

Other:
Researchers proposed a three-stage mitigation using the BIOS biomedical knowledge graph (pruned to 21,706 concepts) with UMLS Metathesaurus synonym resolution, capturing 91.9% of harmful content at passage level (F1 = 85.7%). The defense operates on model outputs, not training data — the authors note there is no realistic way to retroactively detect and remove misinformation from public training corpora. No real-world deployment was involved — this was a controlled research demonstration.

Use in Retrieval

INC-25-0027 documents Medical LLM Data Poisoning Produces Undetectable Harmful Content, a critical-severity incident classified under the Security & Cyber domain and the Data Poisoning threat pattern (PAT-SEC-004). It occurred in Global (2025-01). This page is maintained by TopAIThreats.com as part of an evidence-based registry of AI-enabled threats. Cite as: TopAIThreats.com, "Medical LLM Data Poisoning Produces Undetectable Harmful Content," INC-25-0027, last updated 2026-03-28.

Sources

  1. Medical large language models are vulnerable to data-poisoning attacks — Nature Medicine (primary, 2025-01-08)
    https://www.nature.com/articles/s41591-024-03445-1 (opens in new tab)
  2. PubMed Entry: Medical LLM Data Poisoning Study (primary, 2025-01)
    https://pubmed.ncbi.nlm.nih.gov/39779928/ (opens in new tab)
  3. AHRQ Patient Safety Network: Medical LLMs Vulnerable to Data-Poisoning Attacks (analysis, 2025)
    https://psnet.ahrq.gov/issue/medical-large-language-models-are-vulnerable-data-poisoning-attacks (opens in new tab)

Update Log

  • — First logged (Status: Confirmed, Evidence: Primary)