ctaio.dev Ask AI Subscribe free

AI Audit: The 10-Step Enterprise Checklist

From Inventory to Remediation

An AI audit is not a compliance exercise you run once and file away. It is a systematic evaluation of your AI systems across performance, fairness, data quality, regulatory compliance, security, human oversight, and documentation. This guide gives you the 10-step checklist, the comparison between pre-deployment and production audits, and the organizational model for who runs the audit and how often.

30-second executive takeaway

  • Audit scope is broader than most organizations expect. An AI audit covers seven dimensions: model performance, bias, data lineage, compliance, security, human oversight, and documentation. Most internal audit teams only check two or three.
  • Pre-deployment audits catch problems. Production audits catch drift. You need both. A model that was fair at launch can become unfair as data distributions shift and populations change. Audit cadence should match risk tier: quarterly for high-risk, semi-annual for medium, annual for low.
  • The audit is not done when the report is written. It is done when every critical and high finding has a named owner, a remediation date, and a verified fix. Findings without owners are findings that will repeat in the next audit cycle.

23%

of enterprises have conducted a formal AI audit of any production system (Deloitte, 2025)

7x

cost multiplier for remediating AI issues found in production versus pre-deployment audit

Aug 2026

EU AI Act conformity assessments required for high-risk systems before market placement

What an AI audit covers

An AI audit evaluates an AI system across seven dimensions. Each dimension answers a different question, and skipping any one of them leaves a gap that regulators, auditors, or incidents will eventually expose.

Model performance. Does the model perform as claimed? Run it against a representative test set and compare actual metrics to documented metrics. Disaggregate by demographic group, geography, and any other segmentation relevant to the use case. Aggregate accuracy that masks subgroup failures is not acceptable performance.

Bias and fairness. Does the model treat people equitably? Apply the fairness metrics appropriate to the use case (demographic parity, equalized odds, calibration). Compare results to documented thresholds. Use explainability tools to identify whether protected attributes or proxies drive predictions.

Data lineage. Where does the training data come from, how was it collected, what consent was obtained, and how was it preprocessed? Trace the full pipeline from source to model. Data lineage is where you find the root causes of bias, quality, and privacy problems.

Compliance. Does the system meet applicable regulatory requirements? Map the system to the EU AI Act risk tier, sector-specific regulations, and internal policy requirements. Check whether conformity assessment obligations are met for high-risk systems.

Security. Is the system protected against unauthorized access, data breaches, adversarial attacks, and infrastructure vulnerabilities? Check access controls, encryption, API security, prompt injection defenses, and adversarial robustness.

Human oversight. Is the documented oversight pattern (human-in-the-loop, on-the-loop, in-command) actually implemented? Are designated reviewers trained, equipped, and reviewing at the expected cadence? Do override mechanisms work?

Documentation. Do model cards exist and are they current? Do they cover training data, performance disaggregated by group, limitations, and intended use cases? Was documentation updated after the last material change?

Pre-deployment vs production audit

Two audit types serve different purposes. Pre-deployment audits are gates that prevent problematic systems from reaching production. Production audits are monitors that catch problems that emerge after deployment. You need both.

Dimension Pre-Deployment Audit Production Audit
When Before the model is deployed to production While the model is serving predictions in production
What Model performance, bias evaluation, data quality, documentation completeness, security controls, human oversight design Performance drift, bias drift, data distribution shifts, incident response effectiveness, documentation currency
Cadence Once per deployment (gate before go-live) Risk-tier dependent: quarterly (high), semi-annual (medium), annual (low)
Tools Fairlearn, AIF360, SHAP, LIME, custom evaluation pipelines Model monitoring platforms, drift detection, production fairness dashboards, alerting systems
Who Internal AI team + governance committee review, external auditors for high-risk systems Internal audit team with periodic external validation

The 10-step AI audit checklist

A complete, sequential checklist for auditing any AI system in production or preparing one for deployment. Each step builds on the previous one. Skip a step and the audit has a gap that will surface later.

01

Define audit scope and AI system inventory

Identify which AI systems are in scope for the audit. Pull from the model registry or, if no registry exists, conduct a discovery exercise across business units. Document each system's purpose, owner, data sources, deployment status, and user population. The inventory is the foundation. You cannot audit what you have not cataloged.

02

Classify systems by risk tier

Apply a risk classification framework aligned with the EU AI Act categories: unacceptable risk (prohibited), high risk (full compliance requirements), limited risk (transparency obligations), and minimal risk (voluntary best practices). Map each system to its regulatory obligations. High-risk systems get the full audit treatment. Low-risk systems get a lightweight review. This step determines audit depth and resource allocation.

03

Review data lineage and training data quality

Trace the data pipeline from source to model. Document where training data originates, how it was collected, what consent was obtained, how it was preprocessed, and whether it contains known biases or gaps. Check for data quality issues: missing values, label errors, distribution shifts between training and production data. Data lineage is the audit dimension most likely to surface root causes of bias and performance problems.

04

Test model performance against stated accuracy

Run the model against a representative test set and compare actual performance metrics to the metrics claimed in documentation or stakeholder communications. Check for performance degradation since last evaluation. Disaggregate results by demographic group, geography, and any other relevant segmentation. A model that meets aggregate accuracy targets but underperforms for specific populations has a performance problem even if the headline number looks good.

05

Run bias and fairness evaluation

Apply the fairness metrics defined for this use case (demographic parity, equalized odds, calibration, or others as appropriate). Compare results to documented thresholds. Use SHAP or LIME to identify which features drive predictions for different demographic groups. Document any disparities, their magnitude, and their potential real-world impact. This step produces the evidence that regulators and external auditors care about most.

06

Verify human oversight mechanisms

Confirm that the documented human oversight pattern (human-in-the-loop, human-on-the-loop, human-in-command) is actually implemented and functioning. Check that designated reviewers exist, are trained, have access to the tools they need, and are reviewing at the expected cadence. Verify that override and escalation mechanisms work. A documented oversight pattern that nobody follows is worse than no pattern because it creates false assurance.

07

Check security controls

Review access controls, data encryption, API security, model serving infrastructure, and adversarial robustness. Check for prompt injection vulnerabilities in LLM-based systems. Verify that training data and model weights are protected against unauthorized access. This step should reference findings from any recent security audit and fill gaps specific to the AI system. See the AI security guide for the full security assessment framework.

08

Validate documentation and model cards

Check that model cards exist and are current for every in-scope system. Verify that documentation covers model purpose, training data, performance metrics disaggregated by demographic group, known limitations, and intended use cases. Confirm that documentation was updated after the last retraining or material change. Under the EU AI Act, inadequate documentation for high-risk systems is a compliance violation regardless of how well the model actually performs.

09

Confirm incident response readiness

Verify that an incident response runbook exists for AI-specific incidents: biased outputs, harmful content generation, privacy violations, performance degradation, and adversarial attacks. Check that named responders are assigned, notification timelines are defined, and the runbook has been tested through a tabletop exercise or live drill. An untested incident response plan is a plan that will fail when it matters.

10

Document findings and assign remediation owners

Compile all findings into a structured report with severity ratings (critical, high, medium, low), evidence, and recommended remediation. Assign a named owner and a target remediation date for every finding rated high or above. Schedule a follow-up review to verify that remediation was completed. The audit is not done when the report is written. It is done when every critical and high finding has been closed.

Who runs the AI audit

Three models, each with different tradeoffs in cost, independence, and expertise.

Internal audit teams provide speed, institutional knowledge, and lower cost per audit. They know the systems, the data, and the organizational context. The limitation is independence: an internal team auditing systems built by colleagues in the same organization faces inherent conflicts of interest. Internal teams work best for routine production audits where speed matters more than external credibility.

External audit firms provide independence and credibility with regulators, auditors, and customers. Specialized AI audit firms (Holistic AI, ORCAA, ForHumanity) bring AI-specific expertise that general audit firms may lack. Big Four consultancies (Deloitte, PwC, EY, KPMG) bring scale and regulatory relationships. The limitation is cost ($50K to $200K per system) and timeline (weeks to months). External audits work best for pre-deployment assessments of high-risk systems and periodic comprehensive reviews.

The hybrid model is the most cost-effective for most enterprises. Internal teams run routine production audits at the cadence dictated by risk tier. External firms run pre-deployment audits for high-risk systems, annual comprehensive reviews, and any audit where regulatory credibility requires independence. The governance board reviews findings from both tracks and owns the remediation process.

The governance board's role is not to conduct audits. It is to set audit policy, approve audit scope, review findings, approve remediation plans, and track remediation to completion. The board should include representatives from engineering, data science, legal, product, and risk. It meets quarterly to review the audit portfolio and monthly when active remediations are in flight. A governance board that rubber-stamps findings without tracking remediation is not governing.

Frequently Asked Questions

What is an AI audit?
An AI audit is a systematic evaluation of an AI system across multiple dimensions: model performance against stated accuracy, bias and fairness across demographic groups, data lineage and training data quality, compliance with applicable regulations, security controls, human oversight mechanisms, and documentation completeness. It produces a findings report with evidence, risk ratings, and assigned remediation owners. An AI audit is not a one-time event. It is a recurring process that runs before deployment (to catch issues before they reach production) and in production (to catch issues that emerge from data drift, population changes, and model degradation over time).
When is an AI audit required?
Three triggers. Regulatory: the EU AI Act requires conformity assessments for high-risk AI systems before deployment, with ongoing post-market monitoring. NYC Local Law 144 requires annual bias audits for automated employment decision tools. Sector-specific regulations in financial services, healthcare, and insurance add additional audit obligations. Organizational: most mature AI governance programs require pre-deployment audits for any system classified as high-risk, plus periodic production audits at a cadence that matches the risk tier. Event-driven: any material change to a production AI system (retraining, architecture change, new data source, expansion to new markets or populations) should trigger a re-audit of the affected dimensions.
Who should conduct an AI audit?
Three options, each with tradeoffs. Internal audit teams provide speed and institutional knowledge but may lack independence and AI-specific expertise. External audit firms (Big Four consultancies, specialized AI audit firms like Holistic AI or ORCAA) provide independence and credibility with regulators but cost more and take longer. Hybrid approaches use internal teams for routine production audits and external firms for pre-deployment audits of high-risk systems and periodic comprehensive reviews. The EU AI Act requires independent conformity assessments for certain high-risk categories, which practically means external auditors. For everything else, the hybrid model is the most cost-effective approach.
How much does an AI audit cost?
An internal audit of a single AI system costs $15K to $40K in staff time, tooling, and documentation effort, depending on the system complexity and the depth of evaluation. An external audit by a specialized firm runs $50K to $200K per system for a comprehensive assessment covering bias, performance, compliance, and documentation. Annual audit programs for a portfolio of 10 to 50 AI systems typically cost $200K to $800K when combining internal and external resources. The cost of not auditing is harder to quantify but includes regulatory fines (up to 35 million euros or 7% of global turnover under the EU AI Act), litigation exposure, and reputational damage that can exceed the audit cost by orders of magnitude.
How often should AI systems be audited?
Cadence should match risk tier. High-risk systems (hiring, credit, healthcare, law enforcement): full audit before deployment, quarterly production reviews, and re-audit after any material change. Medium-risk systems (content moderation, customer service automation, recommendation engines): full audit before deployment, semi-annual production reviews. Low-risk systems (internal productivity tools, non-customer-facing automation): lightweight review before deployment, annual production check. Every system should be re-audited after retraining, architecture changes, new data sources, or expansion to new markets or populations, regardless of the scheduled cadence.
How is an AI audit different from a security audit?
A security audit evaluates whether the AI system is protected against unauthorized access, data breaches, adversarial attacks, and infrastructure vulnerabilities. An AI audit evaluates whether the system performs as intended, treats people fairly, complies with regulations, maintains proper documentation, and has appropriate human oversight. The two overlap in areas like data protection, access controls, and adversarial robustness, but an AI audit covers fairness, transparency, documentation, and human oversight dimensions that a security audit does not. Most mature organizations run both: a security audit as part of the broader security program, and an AI audit as part of the AI governance program. The AI audit should reference security findings but not duplicate the security assessment.
·
Thomas Prommer
Thomas Prommer Technology Executive — CTO/CIO/CTAIO

These salary reports are built on firsthand hiring experience across 20+ years of engineering leadership (adidas, $9B platform, 500+ engineers) and a proprietary network of 200+ executive recruiters and headhunters who share placement data with us directly. As a top-1% expert on institutional investor networks, I've conducted 200+ technical due diligence consultations for PE/VC firms including Blackstone, Bain Capital, and Berenberg — work that requires current, accurate compensation benchmarks across every seniority level. Our team cross-references recruiter data with BLS statistics, job board salary disclosures, and executive compensation surveys to produce ranges you can actually negotiate with.

Run an AI audit that produces real findings

From system inventory to remediation tracking. A fractional CAIO engagement designs the audit program and runs the first assessment so your team has a template for everything that follows.