Step-by-step workflow for identifying and responding to data poisoning attacks on AI training data, fine-tuning corpora, and RAG knowledge bases. Covers pre-training inspection, during-training monitoring, post-deployment detection, and remediation.

Who this is for: ML engineers, data engineers, security teams, and AI platform operators responsible for training data integrity, fine-tuning pipelines, or RAG knowledge base management.

What Data Poisoning Is and Why It Matters

Data poisoning is a supply chain attack on AI systems. Instead of attacking the model directly, the attacker manipulates the data the model learns from — inserting malicious examples that cause the model to produce incorrect outputs, exhibit biased behavior, or respond to hidden triggers (backdoors).

The NIST AI Risk Management Framework identifies training data integrity as a foundational requirement for trustworthy AI, and MITRE ATLAS catalogues known data poisoning attack patterns. The threat is well documented:

Google AI Overviews recommended eating glue and rocks — the AI ingested satirical Reddit content as authoritative
‘Vegetative electron microscopy’ — a nonsense phrase from an OCR error propagated through AI training data into 22+ scientific papers
MINJA memory injection — normal-looking prompts poisoned RAG agent memory, causing entity-specific data substitution

For the underlying science, see the Data Poisoning Detection Methods reference page.

Threat patterns this guide addresses

Data Poisoning — deliberate contamination of training data
Model Inversion & Data Extraction — understanding what data entered the model

Step 1: Map Your Data Supply Chain

Before you can detect poisoning, understand where your data comes from and how it reaches the model:

Inventory all data sources — web scrapes, public datasets, licensed data, user-generated content, internal databases, third-party APIs Document the pipeline — from raw collection through filtering, preprocessing, annotation, and final training/indexing Identify trust levels — which sources are authenticated and audited? Which are open to public contribution? Map access controls — who can modify datasets at each pipeline stage? Are changes logged? Identify RAG knowledge bases — what documents feed into retrieval-augmented generation? How are they ingested?

Step 2: Pre-Training Data Inspection

Apply these checks to training and fine-tuning datasets before they reach the model.

Source verification

Verify data provenance — confirm each dataset came from its claimed source. Check cryptographic hashes against published values Authenticate third-party data — verify provider identity and data integrity for licensed or purchased datasets Check for unauthorized modifications — compare dataset hashes against baseline snapshots taken at collection

Statistical analysis

Run outlier detection — apply dimensionality reduction (PCA, UMAP) to training data embeddings and look for anomalous clusters Check label consistency — compare each example's features against the expected distribution for its label. Flag examples where features are inconsistent with the assigned label Detect near-duplicates — identify clusters of suspiciously similar examples, especially if they share unusual features or unexpected labels Profile data distributions — compare the statistical distribution of the new dataset against known-clean baselines. Significant shifts may indicate contamination

Content scanning

Scan for instruction-like content — flag documents containing instruction patterns ("ignore previous," "you are now") that could serve as indirect prompt injection in RAG systems Check for known contamination markers — search for known fingerprints of AI-generated or paper-mill content (e.g., "vegetative electron microscopy," "as an AI language model") Validate factual claims — for fine-tuning data containing factual content, spot-check claims against authoritative sources

Step 3: During-Training Monitoring

If you control the training process, monitor for anomalies during training.

Track per-example loss curves — poisoned examples often converge faster than legitimate examples (the model memorizes the shortcut). Flag examples with unusually rapid loss reduction Monitor gradient magnitudes — poisoned examples may produce unusually large or directionally unusual gradients. Log per-example gradient norms and flag outliers Run spectral analysis — compute the covariance matrix of feature representations and check for anomalous top singular values, which can indicate a backdoor signature Compute influence scores — for fine-tuning (where it's computationally feasible), use influence functions to identify training examples with outsized effect on specific model behaviors

Step 4: Post-Training Behavioral Testing

After training, test the model for behaviors that suggest poisoning has occurred.

Backdoor detection

Run trigger inversion — use optimization-based methods (Neural Cleanse or successors) to search for minimal input perturbations that reliably trigger specific outputs Test with known trigger patterns — if you suspect a specific backdoor, test with the suspected trigger and verify whether the model produces the targeted behavior Analyze activation patterns — compare internal activations on clean inputs vs suspected triggered inputs. Divergent activation patterns suggest a backdoor

Behavioral consistency testing

Test on held-out clean data — compare model performance on your clean validation set against expected benchmarks. Significant performance drops may indicate indiscriminate poisoning Run targeted probing — test model behavior on inputs related to topics the attacker might have targeted (specific entities, factual claims, decision categories) Check consistency across paraphrases — poisoned models may produce inconsistent outputs for semantically equivalent queries phrased differently Compare against baseline model — if you have a known-clean model, compare outputs on a standardized test suite. Behavioral divergences on specific topics suggest targeted poisoning

Step 5: RAG Knowledge Base Monitoring (Continuous)

RAG poisoning can occur at any time, not just during training. Monitor continuously.

Scan at ingestion — every document entering the knowledge base should be scanned for instruction-like content and known poisoning patterns before indexing Monitor retrieval quality — track whether retrieved documents are producing unexpected model behaviors. Log which documents are retrieved for which queries Audit knowledge base changes — log all additions, modifications, and deletions to the knowledge base with user identity and timestamp Periodic integrity checks — re-scan the full knowledge base on a regular schedule for contamination that initial scanning may have missed

Step 6: Respond to Suspected Poisoning

Confirmed or strongly suspected poisoning

Quarantine the affected data — remove suspected poisoned data from the training pipeline or knowledge base immediately Assess impact scope — determine which models were trained on the affected data and which deployments use those models Roll back if possible — revert to a known-clean model checkpoint if available. For RAG: revert the knowledge base to a pre-contamination snapshot Retrain if necessary — for training data poisoning, retraining on verified clean data is the most reliable remediation Investigate the source — determine how poisoned data entered the pipeline. Was it a compromised data source? Insider threat? Public data contamination? Strengthen controls — based on the investigation, implement additional scanning, source verification, or access controls to prevent recurrence

Where This Guide Fits in AI Threat Response

Detection (this guide) — Has our data been poisoned? Inspect training data, monitor training, and test deployed models.
Detection methods — How does data poisoning detection work? Technical reference on statistical methods, influence analysis, and backdoor scanning.
Supply chain security — Are our data sources trustworthy? Securing the data pipeline upstream of detection.
Red teaming — Can our models be poisoned? Proactive adversarial testing of data pipeline defenses.

What This Guide Does Not Cover

Technical details of detection algorithms — see Data Poisoning Detection Methods
Adversarial input detection at inference time — see How to Detect Adversarial Inputs
AI model supply chain management — see AI Supply Chain Security