Step-by-step workflow for auditing AI systems for discriminatory outcomes, including fairness metric selection, disaggregated evaluation, data auditing, and regulatory compliance guidance.

Who this is for: ML engineers, product managers, compliance officers, and civil rights analysts responsible for evaluating AI systems for bias before deployment or during operation — particularly systems used in hiring, lending, housing, healthcare, education, or criminal justice.

What AI Bias Is and Why Auditing Matters

AI bias occurs when an AI system produces systematically different outcomes for different groups of people in ways that are unjust or discriminatory. Bias can emerge from training data that underrepresents or misrepresents specific populations, from features that serve as proxies for protected attributes, from modeling choices that optimize for the majority population, or from deployment contexts that differ from training conditions.

The consequences are well-documented:

Amazon AI hiring tool — systematically downgraded résumés containing words associated with women (e.g., “women’s chess club captain”)
COMPAS recidivism algorithm — false positive rate for Black defendants approximately twice that for white defendants
SaFERent housing discrimination — tenant screening algorithm produced racially disparate outcomes through proxy features
Pulse oximeter racial bias — AI systems perpetuated medical device biases, underperforming on darker skin tones

Standard performance metrics (accuracy, F1, AUC) mask group-level disparities because they aggregate across the full population. Bias auditing disaggregates performance to reveal disparities that aggregate metrics conceal.

For the underlying science, see the AI Bias & Fairness Auditing Methods reference page.

Threat patterns this guide addresses

Allocational Harm — unequal distribution of opportunities or resources
Data Imbalance & Bias — training data that underrepresents specific populations
Proxy Discrimination — neutral features that correlate with protected attributes

Step 1: Define the Audit Scope

Identify the decision — what decision does the AI system make or influence? (hiring, lending, content recommendation, risk scoring, medical diagnosis) Identify affected populations — who is affected by this decision? Which protected attributes are relevant? (race, gender, age, disability, national origin, religion) Identify legal requirements — what anti-discrimination laws apply? (Title VII for employment, ECOA for lending, Fair Housing Act, EU AI Act, NYC Local Law 144) Identify the harm model — what does a biased outcome look like? (denied a job, denied a loan, flagged as high-risk, denied housing, receiving lower quality service) Determine audit type — pre-deployment (before the system goes live), operational (ongoing monitoring), or incident-triggered (responding to a complaint or finding)?

Step 2: Select Appropriate Fairness Metrics

No single fairness metric is universally correct. Different metrics are appropriate for different contexts. These metrics are mathematically incompatible — you must choose which to prioritize.

For allocation decisions (hiring, lending, housing)

Demographic parity — selection rates should be similar across groups. Use the four-fifths rule: if the selection rate for a protected group is less than 80% of the highest group, disparate impact is presumed (U.S. employment law) Equalized odds — true positive rates and false positive rates should be similar across groups. Prioritize this when false positives have different consequences for different groups (criminal justice, fraud detection)

For risk scoring (recidivism, fraud, insurance)

Calibration — a score of X% should mean the same thing regardless of group membership. If the model says "70% risk," approximately 70% of individuals at that score should actually exhibit the outcome, for all groups False positive rate parity — the rate of wrongly flagging low-risk individuals should be similar across groups. This was the key metric in the COMPAS analysis

For content and recommendations

Representation parity — content recommendations should not systematically exclude or underrepresent specific groups Quality parity — the quality of service (response accuracy, helpfulness, response time) should be similar across user demographics

Step 3: Collect and Prepare Data

Obtain protected attribute data — you need demographic data to measure demographic disparities. If not directly available: check if it can be self-reported, inferred from proxy data (with appropriate caveats), or obtained from matched records Ensure sufficient sample sizes — each subgroup needs enough examples for statistical significance. Subgroups with fewer than 30–50 examples produce unreliable metrics Include intersectional groups — analyze not just single-axis groups (women, Black individuals) but intersections (Black women, elderly Hispanic men). Intersectional disparities are often larger than single-axis disparities Use representative evaluation data — the evaluation dataset should reflect the actual population the system serves, not the population it was trained on

Step 4: Run Quantitative Analysis

Compute fairness metrics

Compute selection rates by group — for each protected attribute, calculate the positive outcome rate. Apply the four-fifths rule where applicable Compute error rates by group — calculate false positive rate, false negative rate, precision, and recall for each group. Identify groups where errors are systematically higher Compute calibration by group — for risk scores, bin predictions and compare predicted vs actual rates within each group Run statistical significance tests — determine whether observed disparities are statistically significant or could be explained by sampling variation

Use auditing tools

Tool	Approach	Best for
IBM AI Fairness 360	70+ metrics, bias mitigation algorithms	Comprehensive technical audit
Microsoft Fairlearn	Fairness assessment + constrained optimization	Python-based ML pipelines
Google What-If Tool	Interactive visualization of model behavior	Exploratory analysis
Aequitas	Group fairness audit with report generation	Policy-focused audits

Run the selected tool on your evaluation dataset Generate a fairness report covering all relevant metrics and subgroups Identify the subgroups with the largest disparities

Step 5: Audit the Data and Features

Quantitative disparities have root causes in data and features. Investigate.

Data audit

Check representation — are all relevant demographic groups adequately represented in training data? Underrepresented groups typically have worse model performance Check label quality — are labels consistent across groups? Were different labeling processes used for different populations? Label bias (e.g., historical discrimination reflected in training labels) directly causes model bias Check temporal coverage — does the training data cover the same time period for all groups? Historical data may reflect past discriminatory practices

Feature audit

Identify proxy features — which features correlate with protected attributes? (zip code → race, name → gender/ethnicity, university → socioeconomic status). Use correlation analysis or mutual information Assess feature relevance — for each feature with high proxy correlation, evaluate whether it is genuinely predictive of the outcome or primarily serving as a demographic proxy Test feature removal impact — remove suspected proxy features and re-evaluate model performance and fairness metrics. If removing a proxy feature reduces disparity without significantly reducing overall accuracy, the feature was likely contributing to discrimination

Step 6: Document and Decide

Document all findings — metrics computed, disparities identified, root causes investigated, tools used, and data sources Assess against legal thresholds — do the disparities exceed legal thresholds (four-fifths rule, disparate impact standards)? Consult legal counsel for jurisdiction-specific requirements Decide on action:

Deploy with monitoring — if disparities are within acceptable bounds, deploy with continuous monitoring for drift
Mitigate and re-audit — if disparities exceed thresholds, apply mitigation (re-balancing data, constrained optimization, feature removal) and re-audit
Do not deploy — if disparities are severe and cannot be adequately mitigated, the appropriate decision may be to not deploy the system

Establish ongoing monitoring — bias can emerge or worsen over time through data drift and feedback loops. Schedule regular re-audits (see AI Risk Monitoring Systems)

Where This Guide Fits in AI Threat Response

Auditing (this guide) — Is this system biased? Evaluate AI systems for discriminatory outcomes.
Auditing methods — How does bias auditing work? Technical reference on fairness metrics, impossibility results, and tool comparisons.
Risk monitoring — Is bias emerging over time? Continuous monitoring for drift and emerging disparities.
Model governance — Who approved this deployment? Organizational gates requiring fairness evaluation.
Deployment checklist — Is this system ready? Pre-deployment checklist including bias assessment.

What This Guide Does Not Cover

Fairness metric theory and impossibility results — see AI Bias & Fairness Auditing Methods
Bias mitigation techniques — this guide covers detection, not remediation
Continuous monitoring — see AI Risk Monitoring Systems
Legal analysis — consult legal counsel for jurisdiction-specific requirements