ACAI — AI Compliance Audit & Inspection Framework

The full technical
audit for EU AI Act
compliance.

Real engineering tests across your entire AI estate — traditional ML models, LLM-based systems, automated pipelines and agentic workflows. Every finding mapped to the exact EU AI Act article it violates, with a severity rating, named remediation tasks, and a 90-day roadmap your team can execute before August 2026. Your team runs the tests. I guide every step. No external access to your production systems required.

Enforcement deadline
August 2, 2026
High-risk AI enforcement begins
--
Days
--
Hours
--
Mins
Book Discovery Call →
Your team runs the tests — I guide every step
No external access to your production systems
Results interpreted and mapped to exact EU AI Act articles
Covers ML models, LLM systems and agentic workflows
Deliverables

What you receive

A full audit produces five concrete deliverables — not a report that sits in a folder. Every item is designed to be used immediately.

01
Findings Register

Every compliance gap mapped to the exact EU AI Act article it violates, with a severity rating (Critical / High / Medium) and a named remediation task. Board-presentable format.

02
90-Day Remediation Roadmap

Sequenced remediation plan with named owners, dependencies, and measurable milestones. Designed for execution by your existing team — not a separate project requiring external resources.

03
Technical Test Results

Raw scores from every technical test — ECE calibration per model, PSI distribution shift values, SHAP faithfulness ratings, adversarial degradation percentages. Fully reproducible.

04
Board-Ready Compliance Summary

A single-page executive summary your C-suite can present to an enforcement authority, an audit committee, or an enterprise client asking for compliance assurance.

05
Executive Debrief

60-minute session walking through every finding with your leadership team. Questions answered, priorities agreed, ownership assigned. Not a presentation — a working session.

The methodology

Two audit tracks — one complete picture

Traditional ML systems require different tests from LLM-based systems and agentic workflows. The ACAI framework covers both — every test is specific, article-mapped and reproducible. Most organisations have systems in both tracks and do not know it.

Track A — Traditional ML Systems Prediction models · Classification systems · Scoring engines
T1 · D1 AI System Inventory & Risk Classification
EU AI Act · Annex III · Article 6 · Article 26

A complete register of every AI system in production — traditional ML models, LLM deployments, automated pipelines with AI steps, and agentic workflows. Every system classified against Annex III risk tiers (Unacceptable, High-Risk, Limited, Minimal). Deployer obligations mapped for every vendor AI system under Article 26. Most organisations discover 30–40% more AI systems than they believed they had — including LLM-based tools and pipelines that were deployed without compliance review. Without a complete inventory, every other compliance step is built on an incomplete picture.

T2 · D2 Calibration Testing — Expected Calibration Error (ECE)
EU AI Act · Article 15

Expected Calibration Error is scored per production model. ECE measures the gap between a model's stated confidence and its actual accuracy — an ECE of 0.18 means a model claiming 80% confidence is accurate only 62% of the time. ECE > 0.15 is a critical Article 15 finding. Article 15 requires AI systems to achieve appropriate levels of accuracy and robustness throughout their lifecycle. A model with poor calibration cannot satisfy this requirement regardless of how good its overall accuracy metric appears. This test is almost never performed by organisations before an audit.

Critical threshold: ECE > 0.15 — Article 15 direct finding
T3 · D2 Distribution Shift Detection — Population Stability Index (PSI)
EU AI Act · Article 15 · Article 61

Population Stability Index is calculated by comparing the distribution of production data against the model's training data distribution. PSI > 0.25 means the model is operating on data that is meaningfully different from what it was validated on — effectively outside its validated operating conditions. Article 15 requires maintained accuracy throughout the system lifecycle. Article 61 requires post-market monitoring. A model operating with high PSI may have degraded significantly since deployment with no mechanism in place to detect it.

Critical threshold: PSI > 0.25 — operating outside validated conditions
T4 · D5 Explanation Faithfulness — SHAP/LIME Testing
EU AI Act · Article 13

SHAP faithfulness testing verifies that the explanations produced by your AI systems accurately reflect what the model actually computed — not a post-hoc rationalisation that happens to look plausible. Article 13 requires meaningful transparency for high-risk AI systems affecting natural persons. A misleading explanation is worse than no explanation under Article 13 — because it creates a false appearance of compliance. This test is technically demanding and requires production model access; it cannot be performed through documentation review alone.

T5 · D2 Adversarial Robustness Probing
EU AI Act · Article 15

Deliberate manipulation attempts on production models — testing resilience to inputs that have been crafted or perturbed to probe model weaknesses. Accuracy degradation > 15% under adversarial conditions is a critical Article 15 finding. Article 15 requires AI systems to be resilient to attempts to alter their outputs or performance by third parties. This test is particularly critical for high-risk systems in financial services, recruitment, and access-to-service contexts where adversarial manipulation has direct economic or social consequences.

Critical threshold: Accuracy degradation > 15% under adversarial conditions
Track B — LLM & Agentic AI Systems LLM decision support · Automated pipelines · Agentic workflows · Autonomous systems
Does your LLM pipeline create EU AI Act obligations? Most organisations don't know. If an LLM-based system influences decisions affecting your customers, employees or partners — credit, hiring, access to services, content moderation — it is likely in scope under Annex III regardless of whether it uses a "traditional" ML model or a large language model. Track B tests are applied to any in-scope LLM deployment, automated pipeline or agentic workflow identified in T1.
B1 · D1 Agentic System Inventory & Autonomy Classification
EU AI Act · Annex III · Article 9 · Article 14

Every LLM deployment, automated pipeline and agentic workflow mapped and classified on an autonomy scale: Decision Support (human decides) → Human-in-the-Loop (human approves AI recommendation) → Human-on-the-Loop (human can intervene but AI acts by default) → Fully Autonomous. The autonomy classification directly determines your Article 14 obligations. Most organisations have systems at Human-on-the-Loop or Fully Autonomous that were deployed without any compliance review. Also maps which tools, APIs and data each agent has access to — the blast radius if the system behaves unexpectedly.

B2 · D4 Decision Traceability & Logging Assessment
EU AI Act · Article 12 · Article 13

Assessment of whether the organisation has structured logging at the pipeline level — every tool call, every model response, every action taken, every decision made. Article 12 requires high-risk AI systems to automatically log events throughout operation. For agentic systems this means a complete audit trail of the agent's reasoning chain, not just its final output. Absence of pipeline-level logging is a critical Article 12 finding. Most LangChain and LangGraph deployments have application logs but not compliance-grade audit trails.

Critical finding: No structured pipeline logging — Article 12 direct violation
B3 · D4 Human Oversight Functional Testing
EU AI Act · Article 14

Article 14 requires that humans can effectively oversee, understand and intervene in high-risk AI systems. For agentic systems this means testing whether the override mechanism actually works — not just whether it exists on paper. Your team demonstrates the override in a test environment: can a human stop the agent mid-task? Is there an audit trail of every action taken before the override? Does the agent return to a safe state? If you cannot demonstrate a working override in 30 minutes, that is a critical Article 14 finding. The test also assesses whether human reviewers have sufficient information to make meaningful oversight decisions — or whether they are rubber-stamping AI recommendations without genuine understanding.

Critical finding: Override cannot be demonstrated — Article 14 direct violation
B4 · D2 Prompt Injection & Goal Hijacking Testing
EU AI Act · Article 15

The agentic equivalent of adversarial robustness probing. Tests whether an external actor — through a malicious web page the agent reads, a crafted input it processes, a poisoned API response it receives — can cause the agent to take actions outside its intended scope. This is the most serious unaddressed security risk for agentic AI systems and the test most organisations have never considered. Article 15 requires resilience to attempts to alter system performance or outputs. A customer service agent that can be manipulated into revealing other customers' data, or a document processing agent that can be hijacked to exfiltrate information, fails Article 15 regardless of how well it performs its intended function.

B5 · D3 Tool Use Boundary & Permission Audit
EU AI Act · Article 9 · Article 15

Agents are given tools — web search, code execution, database access, email sending, API calls. This test audits whether the agent operates within the intended boundaries of each tool, and whether those boundaries are actually enforced at the system level. Does the agent only query the database tables it is authorised for? Does it only send communications to intended recipients? Does it only access external services within its defined scope? Most agentic systems have permissive tool configurations that were set during development and never tightened before production deployment. Article 9 requires a risk management system that identifies and addresses foreseeable misuse — unconstrained tool access is a foreseeable misuse vector.

B6 · D2 Behavioural Consistency Under Varied Context
EU AI Act · Article 15

Traditional models are tested on a fixed input distribution. Agents behave differently depending on what they find in the environment — the document they retrieved, the email they read, the API response they received. Behavioural consistency testing assesses whether the agent produces consistent, appropriate outputs across a wide range of environmental contexts, including unexpected, ambiguous and adversarial ones. LLM behaviour also changes as underlying model versions are updated — even without any deliberate changes by the organisation. An agent that passed behavioural assessment in January may behave differently in June after the underlying model was silently updated by the LLM provider. This test is the foundation of the monitoring strategy in the Compliance Retainer.

Engagement process

From discovery call to findings register

A structured four-stage process designed to minimise disruption to your team while producing a defensible compliance record.

01
Discovery & Scoping

30-minute call. Map your AI systems, agree scope, confirm engagement tier. You receive a scoping document within 48 hours.

02
Guided Technical Testing

Your engineering team runs the tests in your own environment — no external access to production systems required. I provide the exact protocol, tools and parameters. Your team executes; I guide every step and interpret all results.

03
Findings & Roadmap

Findings register delivered with article mapping and severity ratings. 90-day remediation roadmap with named owners and sequenced milestones.

04
Executive Debrief

60-minute working session with your leadership team. Every finding explained, priorities agreed, ownership assigned. Board summary delivered.

Investment

Three engagement tiers

Every engagement begins with a free 30-minute discovery call. Scope and final pricing confirmed before any commitment is made.

Full engagement
Full Compliance Audit
from €25,000
4 weeks · Track A + Track B · every in-scope system
Includes everything in Readiness Scan plus
Track A — ECE, PSI, SHAP, adversarial testing on ML systems
Track B — Autonomy classification, traceability, override testing, prompt injection, tool boundary audit on LLM & agentic systems
90-day remediation roadmap with named owners
Board-ready compliance summary
Executive debrief — 60 minutes
Ongoing
Compliance Retainer
from €4,000/mo
post-audit · available after Readiness Scan or Full Audit
Includes
Continuous compliance posture monitoring — ML and agentic systems
Agentic system change monitoring — prompt changes, model version updates, new tool access, pipeline changes all assessed for compliance impact
Quarterly behavioural re-assessment of high-risk agentic systems
Board-level advisory and representation
Regulatory update briefings as EU AI Act guidance develops

Scope drives price. Final pricing is confirmed after the discovery call based on number of AI systems, technical complexity, and timeline. The Readiness Scan is available immediately. All engagements begin with a free 30-minute scoping call — no commitment required. Your team runs all technical tests in your own environment; no external access to production systems is ever required.

Ready to start

In 30 minutes, know exactly
where your AI stands.

No pitch. Five questions about your AI systems. You leave with clarity on your exposure regardless of whether you engage me.