Fine-Tuning

Definition

Fine-tuning is a transfer learning technique where a pre-trained model — typically a large language model or other foundation model trained on broad data — undergoes additional training on a curated dataset to specialise its capabilities. The fine-tuning dataset is usually much smaller than the pre-training dataset but is carefully selected to teach the model specific behaviours: following instructions, producing outputs in a particular format, demonstrating domain expertise, or adhering to safety guidelines. Fine-tuning methods range from full-weight updates (modifying all model parameters) to parameter-efficient techniques such as LoRA (Low-Rank Adaptation) that modify only a small subset of weights. The technique is distinct from prompt engineering, which adapts model behaviour through input instructions without changing model weights.

How It Relates to AI Threats

Fine-tuning has significant security implications within the Security and Cyber Threats and Information Integrity Threats domains. Malicious fine-tuning can remove safety guardrails from open-weight models, creating uncensored models capable of generating harmful content. Fine-tuning on poisoned datasets can introduce backdoor behaviours that activate on specific trigger inputs. Commercially fine-tuned models may inherit biases or vulnerabilities from their fine-tuning data. Additionally, fine-tuning services that accept user-provided data create a supply chain attack vector where adversarial training examples can alter model behaviour. Safety alignment achieved through RLHF during pre-training can be undone through relatively small amounts of adversarial fine-tuning.

Why It Occurs

Pre-trained models are generalists; many applications require specialised knowledge or behaviour that fine-tuning provides
Fine-tuning is more computationally efficient than training a model from scratch, making it accessible to smaller organisations
Open-weight models (Llama, Mistral, Qwen) can be freely fine-tuned, enabling both beneficial customisation and safety circumvention
The fine-tuning supply chain (data collection, curation, training) introduces multiple points where adversarial manipulation can occur
Parameter-efficient methods like LoRA have reduced fine-tuning costs to hours on consumer hardware, democratising access

Real-World Context

Research has demonstrated that safety-aligned LLMs can be “de-aligned” through fine-tuning on as few as 100 adversarial examples, removing refusal behaviours and content safety measures. Fine-tuning has been used to create AI models specifically designed for malicious purposes (documented in dark-web forum analysis). Conversely, fine-tuning is the standard technique for creating specialised AI assistants, domain-specific tools, and aligned AI systems. The dual-use nature of fine-tuning — essential for beneficial customisation but exploitable for harm — makes it a key concern in AI safety governance and model distribution policies.

Definition

How It Relates to AI Threats

Why It Occurs

Real-World Context

Related Threat Patterns

Related Terms