AI Audit: The 10-Step Enterprise Checklist
From Inventory to Remediation
An AI audit is not a compliance exercise you run once and file away. It is a systematic evaluation of your AI systems across performance, fairness, data quality, regulatory compliance, security, human oversight, and documentation. This guide gives you the 10-step checklist, the comparison between pre-deployment and production audits, and the organizational model for who runs the audit and how often.
30-second executive takeaway
- Audit scope is broader than most organizations expect. An AI audit covers seven dimensions: model performance, bias, data lineage, compliance, security, human oversight, and documentation. Most internal audit teams only check two or three.
- Pre-deployment audits catch problems. Production audits catch drift. You need both. A model that was fair at launch can become unfair as data distributions shift and populations change. Audit cadence should match risk tier: quarterly for high-risk, semi-annual for medium, annual for low.
- The audit is not done when the report is written. It is done when every critical and high finding has a named owner, a remediation date, and a verified fix. Findings without owners are findings that will repeat in the next audit cycle.
23%
of enterprises have conducted a formal AI audit of any production system (Deloitte, 2025)
7x
cost multiplier for remediating AI issues found in production versus pre-deployment audit
Aug 2026
EU AI Act conformity assessments required for high-risk systems before market placement
SCOPE
What an AI audit covers
An AI audit evaluates an AI system across seven dimensions. Each dimension answers a different question, and skipping any one of them leaves a gap that regulators, auditors, or incidents will eventually expose.
Model performance. Does the model perform as claimed? Run it against a representative test set and compare actual metrics to documented metrics. Disaggregate by demographic group, geography, and any other segmentation relevant to the use case. Aggregate accuracy that masks subgroup failures is not acceptable performance.
Bias and fairness. Does the model treat people equitably? Apply the fairness metrics appropriate to the use case (demographic parity, equalized odds, calibration). Compare results to documented thresholds. Use explainability tools to identify whether protected attributes or proxies drive predictions.
Data lineage. Where does the training data come from, how was it collected, what consent was obtained, and how was it preprocessed? Trace the full pipeline from source to model. Data lineage is where you find the root causes of bias, quality, and privacy problems.
Compliance. Does the system meet applicable regulatory requirements? Map the system to the EU AI Act risk tier, sector-specific regulations, and internal policy requirements. Check whether conformity assessment obligations are met for high-risk systems.
Security. Is the system protected against unauthorized access, data breaches, adversarial attacks, and infrastructure vulnerabilities? Check access controls, encryption, API security, prompt injection defenses, and adversarial robustness.
Human oversight. Is the documented oversight pattern (human-in-the-loop, on-the-loop, in-command) actually implemented? Are designated reviewers trained, equipped, and reviewing at the expected cadence? Do override mechanisms work?
Documentation. Do model cards exist and are they current? Do they cover training data, performance disaggregated by group, limitations, and intended use cases? Was documentation updated after the last material change?
TIMING
Pre-deployment vs production audit
Two audit types serve different purposes. Pre-deployment audits are gates that prevent problematic systems from reaching production. Production audits are monitors that catch problems that emerge after deployment. You need both.
| Dimension | Pre-Deployment Audit | Production Audit |
|---|---|---|
| When | Before the model is deployed to production | While the model is serving predictions in production |
| What | Model performance, bias evaluation, data quality, documentation completeness, security controls, human oversight design | Performance drift, bias drift, data distribution shifts, incident response effectiveness, documentation currency |
| Cadence | Once per deployment (gate before go-live) | Risk-tier dependent: quarterly (high), semi-annual (medium), annual (low) |
| Tools | Fairlearn, AIF360, SHAP, LIME, custom evaluation pipelines | Model monitoring platforms, drift detection, production fairness dashboards, alerting systems |
| Who | Internal AI team + governance committee review, external auditors for high-risk systems | Internal audit team with periodic external validation |
CHECKLIST
The 10-step AI audit checklist
A complete, sequential checklist for auditing any AI system in production or preparing one for deployment. Each step builds on the previous one. Skip a step and the audit has a gap that will surface later.
Define audit scope and AI system inventory
Identify which AI systems are in scope for the audit. Pull from the model registry or, if no registry exists, conduct a discovery exercise across business units. Document each system's purpose, owner, data sources, deployment status, and user population. The inventory is the foundation. You cannot audit what you have not cataloged.
Classify systems by risk tier
Apply a risk classification framework aligned with the EU AI Act categories: unacceptable risk (prohibited), high risk (full compliance requirements), limited risk (transparency obligations), and minimal risk (voluntary best practices). Map each system to its regulatory obligations. High-risk systems get the full audit treatment. Low-risk systems get a lightweight review. This step determines audit depth and resource allocation.
Review data lineage and training data quality
Trace the data pipeline from source to model. Document where training data originates, how it was collected, what consent was obtained, how it was preprocessed, and whether it contains known biases or gaps. Check for data quality issues: missing values, label errors, distribution shifts between training and production data. Data lineage is the audit dimension most likely to surface root causes of bias and performance problems.
Test model performance against stated accuracy
Run the model against a representative test set and compare actual performance metrics to the metrics claimed in documentation or stakeholder communications. Check for performance degradation since last evaluation. Disaggregate results by demographic group, geography, and any other relevant segmentation. A model that meets aggregate accuracy targets but underperforms for specific populations has a performance problem even if the headline number looks good.
Run bias and fairness evaluation
Apply the fairness metrics defined for this use case (demographic parity, equalized odds, calibration, or others as appropriate). Compare results to documented thresholds. Use SHAP or LIME to identify which features drive predictions for different demographic groups. Document any disparities, their magnitude, and their potential real-world impact. This step produces the evidence that regulators and external auditors care about most.
Verify human oversight mechanisms
Confirm that the documented human oversight pattern (human-in-the-loop, human-on-the-loop, human-in-command) is actually implemented and functioning. Check that designated reviewers exist, are trained, have access to the tools they need, and are reviewing at the expected cadence. Verify that override and escalation mechanisms work. A documented oversight pattern that nobody follows is worse than no pattern because it creates false assurance.
Check security controls
Review access controls, data encryption, API security, model serving infrastructure, and adversarial robustness. Check for prompt injection vulnerabilities in LLM-based systems. Verify that training data and model weights are protected against unauthorized access. This step should reference findings from any recent security audit and fill gaps specific to the AI system. See the AI security guide for the full security assessment framework.
Validate documentation and model cards
Check that model cards exist and are current for every in-scope system. Verify that documentation covers model purpose, training data, performance metrics disaggregated by demographic group, known limitations, and intended use cases. Confirm that documentation was updated after the last retraining or material change. Under the EU AI Act, inadequate documentation for high-risk systems is a compliance violation regardless of how well the model actually performs.
Confirm incident response readiness
Verify that an incident response runbook exists for AI-specific incidents: biased outputs, harmful content generation, privacy violations, performance degradation, and adversarial attacks. Check that named responders are assigned, notification timelines are defined, and the runbook has been tested through a tabletop exercise or live drill. An untested incident response plan is a plan that will fail when it matters.
Document findings and assign remediation owners
Compile all findings into a structured report with severity ratings (critical, high, medium, low), evidence, and recommended remediation. Assign a named owner and a target remediation date for every finding rated high or above. Schedule a follow-up review to verify that remediation was completed. The audit is not done when the report is written. It is done when every critical and high finding has been closed.
ORGANIZATIONAL MODEL
Who runs the AI audit
Three models, each with different tradeoffs in cost, independence, and expertise.
Internal audit teams provide speed, institutional knowledge, and lower cost per audit. They know the systems, the data, and the organizational context. The limitation is independence: an internal team auditing systems built by colleagues in the same organization faces inherent conflicts of interest. Internal teams work best for routine production audits where speed matters more than external credibility.
External audit firms provide independence and credibility with regulators, auditors, and customers. Specialized AI audit firms (Holistic AI, ORCAA, ForHumanity) bring AI-specific expertise that general audit firms may lack. Big Four consultancies (Deloitte, PwC, EY, KPMG) bring scale and regulatory relationships. The limitation is cost ($50K to $200K per system) and timeline (weeks to months). External audits work best for pre-deployment assessments of high-risk systems and periodic comprehensive reviews.
The hybrid model is the most cost-effective for most enterprises. Internal teams run routine production audits at the cadence dictated by risk tier. External firms run pre-deployment audits for high-risk systems, annual comprehensive reviews, and any audit where regulatory credibility requires independence. The governance board reviews findings from both tracks and owns the remediation process.
The governance board's role is not to conduct audits. It is to set audit policy, approve audit scope, review findings, approve remediation plans, and track remediation to completion. The board should include representatives from engineering, data science, legal, product, and risk. It meets quarterly to review the audit portfolio and monthly when active remediations are in flight. A governance board that rubber-stamps findings without tracking remediation is not governing.
Frequently Asked Questions
What is an AI audit?
When is an AI audit required?
Who should conduct an AI audit?
How much does an AI audit cost?
How often should AI systems be audited?
How is an AI audit different from a security audit?
Run an AI audit that produces real findings
From system inventory to remediation tracking. A fractional CAIO engagement designs the audit program and runs the first assessment so your team has a template for everything that follows.