Step-by-step workflow for identifying and responding to data poisoning attacks on AI training data, fine-tuning corpora, and RAG knowledge bases. Covers pre-training inspection, during-training monitoring, post-deployment detection, and remediation.
Who this is for: ML engineers, data engineers, security teams, and AI platform operators responsible for training data integrity, fine-tuning pipelines, or RAG knowledge base management.
What Data Poisoning Is and Why It Matters
Data poisoning is a supply chain attack on AI systems. Instead of attacking the model directly, the attacker manipulates the data the model learns from — inserting malicious examples that cause the model to produce incorrect outputs, exhibit biased behavior, or respond to hidden triggers (backdoors).
The NIST AI Risk Management Framework identifies training data integrity as a foundational requirement for trustworthy AI, and MITRE ATLAS catalogues known data poisoning attack patterns. The threat is well documented:
For the underlying science, see the Data Poisoning Detection Methods reference page.
Threat patterns this guide addresses
Step 1: Map Your Data Supply Chain
Before you can detect poisoning, understand where your data comes from and how it reaches the model:
Inventory all data sources — web scrapes, public datasets, licensed data, user-generated content, internal databases, third-party APIs
Document the pipeline — from raw collection through filtering, preprocessing, annotation, and final training/indexing
Identify trust levels — which sources are authenticated and audited? Which are open to public contribution?
Map access controls — who can modify datasets at each pipeline stage? Are changes logged?
Identify RAG knowledge bases — what documents feed into retrieval-augmented generation? How are they ingested?
Step 2: Pre-Training Data Inspection
Apply these checks to training and fine-tuning datasets before they reach the model.
Source verification
Verify data provenance — confirm each dataset came from its claimed source. Check cryptographic hashes against published values
Authenticate third-party data — verify provider identity and data integrity for licensed or purchased datasets
Check for unauthorized modifications — compare dataset hashes against baseline snapshots taken at collection
Statistical analysis
Run outlier detection — apply dimensionality reduction (PCA, UMAP) to training data embeddings and look for anomalous clusters
Check label consistency — compare each example's features against the expected distribution for its label. Flag examples where features are inconsistent with the assigned label
Detect near-duplicates — identify clusters of suspiciously similar examples, especially if they share unusual features or unexpected labels
Profile data distributions — compare the statistical distribution of the new dataset against known-clean baselines. Significant shifts may indicate contamination
Content scanning
Scan for instruction-like content — flag documents containing instruction patterns ("ignore previous," "you are now") that could serve as indirect prompt injection in RAG systems
Check for known contamination markers — search for known fingerprints of AI-generated or paper-mill content (e.g., "vegetative electron microscopy," "as an AI language model")
Validate factual claims — for fine-tuning data containing factual content, spot-check claims against authoritative sources
Step 3: During-Training Monitoring
If you control the training process, monitor for anomalies during training.
Track per-example loss curves — poisoned examples often converge faster than legitimate examples (the model memorizes the shortcut). Flag examples with unusually rapid loss reduction
Monitor gradient magnitudes — poisoned examples may produce unusually large or directionally unusual gradients. Log per-example gradient norms and flag outliers
Run spectral analysis — compute the covariance matrix of feature representations and check for anomalous top singular values, which can indicate a backdoor signature
Compute influence scores — for fine-tuning (where it's computationally feasible), use influence functions to identify training examples with outsized effect on specific model behaviors
Step 4: Post-Training Behavioral Testing
After training, test the model for behaviors that suggest poisoning has occurred.
Backdoor detection
Run trigger inversion — use optimization-based methods (Neural Cleanse or successors) to search for minimal input perturbations that reliably trigger specific outputs
Test with known trigger patterns — if you suspect a specific backdoor, test with the suspected trigger and verify whether the model produces the targeted behavior
Analyze activation patterns — compare internal activations on clean inputs vs suspected triggered inputs. Divergent activation patterns suggest a backdoor
Behavioral consistency testing
Test on held-out clean data — compare model performance on your clean validation set against expected benchmarks. Significant performance drops may indicate indiscriminate poisoning
Run targeted probing — test model behavior on inputs related to topics the attacker might have targeted (specific entities, factual claims, decision categories)
Check consistency across paraphrases — poisoned models may produce inconsistent outputs for semantically equivalent queries phrased differently
Compare against baseline model — if you have a known-clean model, compare outputs on a standardized test suite. Behavioral divergences on specific topics suggest targeted poisoning
Step 5: RAG Knowledge Base Monitoring (Continuous)
RAG poisoning can occur at any time, not just during training. Monitor continuously.
Scan at ingestion — every document entering the knowledge base should be scanned for instruction-like content and known poisoning patterns before indexing
Monitor retrieval quality — track whether retrieved documents are producing unexpected model behaviors. Log which documents are retrieved for which queries
Audit knowledge base changes — log all additions, modifications, and deletions to the knowledge base with user identity and timestamp
Periodic integrity checks — re-scan the full knowledge base on a regular schedule for contamination that initial scanning may have missed
Step 6: Respond to Suspected Poisoning
Confirmed or strongly suspected poisoning
Quarantine the affected data — remove suspected poisoned data from the training pipeline or knowledge base immediately
Assess impact scope — determine which models were trained on the affected data and which deployments use those models
Roll back if possible — revert to a known-clean model checkpoint if available. For RAG: revert the knowledge base to a pre-contamination snapshot
Retrain if necessary — for training data poisoning, retraining on verified clean data is the most reliable remediation
Investigate the source — determine how poisoned data entered the pipeline. Was it a compromised data source? Insider threat? Public data contamination?
Strengthen controls — based on the investigation, implement additional scanning, source verification, or access controls to prevent recurrence
Where This Guide Fits in AI Threat Response
Detection (this guide) — Has our data been poisoned? Inspect training data, monitor training, and test deployed models.
Detection methods — How does data poisoning detection work? Technical reference on statistical methods, influence analysis, and backdoor scanning.
Supply chain security — Are our data sources trustworthy? Securing the data pipeline upstream of detection.
Red teaming — Can our models be poisoned? Proactive adversarial testing of data pipeline defenses.
What This Guide Does Not Cover