Skip to main content
TopAIThreats home TOP AI THREATS
AI Capability

Fine-Tuning

The process of further training a pre-trained machine learning model on a smaller, task-specific or domain-specific dataset to adapt its behaviour, improve its performance on particular tasks, or align it with specific requirements. Fine-tuning modifies the model's weights rather than relying solely on prompt engineering.

Definition

Fine-tuning is a transfer learning technique where a pre-trained model — typically a large language model or other foundation model trained on broad data — undergoes additional training on a curated dataset to specialise its capabilities. The fine-tuning dataset is usually much smaller than the pre-training dataset but is carefully selected to teach the model specific behaviours: following instructions, producing outputs in a particular format, demonstrating domain expertise, or adhering to safety guidelines. Fine-tuning methods range from full-weight updates (modifying all model parameters) to parameter-efficient techniques such as LoRA (Low-Rank Adaptation) that modify only a small subset of weights. The technique is distinct from prompt engineering, which adapts model behaviour through input instructions without changing model weights.

How It Relates to AI Threats

Fine-tuning has significant security implications within the Security and Cyber Threats and Information Integrity Threats domains. Malicious fine-tuning can remove safety guardrails from open-weight models, creating uncensored models capable of generating harmful content. Fine-tuning on poisoned datasets can introduce backdoor behaviours that activate on specific trigger inputs. Commercially fine-tuned models may inherit biases or vulnerabilities from their fine-tuning data. Additionally, fine-tuning services that accept user-provided data create a supply chain attack vector where adversarial training examples can alter model behaviour. Safety alignment achieved through RLHF during pre-training can be undone through relatively small amounts of adversarial fine-tuning.

Why It Occurs

  • Pre-trained models are generalists; many applications require specialised knowledge or behaviour that fine-tuning provides
  • Fine-tuning is more computationally efficient than training a model from scratch, making it accessible to smaller organisations
  • Open-weight models (Llama, Mistral, Qwen) can be freely fine-tuned, enabling both beneficial customisation and safety circumvention
  • The fine-tuning supply chain (data collection, curation, training) introduces multiple points where adversarial manipulation can occur
  • Parameter-efficient methods like LoRA have reduced fine-tuning costs to hours on consumer hardware, democratising access

Real-World Context

Research has demonstrated that safety-aligned LLMs can be “de-aligned” through fine-tuning on as few as 100 adversarial examples, removing refusal behaviours and content safety measures. Fine-tuning has been used to create AI models specifically designed for malicious purposes (documented in dark-web forum analysis). Conversely, fine-tuning is the standard technique for creating specialised AI assistants, domain-specific tools, and aligned AI systems. The dual-use nature of fine-tuning — essential for beneficial customisation but exploitable for harm — makes it a key concern in AI safety governance and model distribution policies.

Related Threat Patterns

Last updated: 2026-04-03