Real engineering tests across your entire AI estate — traditional ML models, LLM-based systems, automated pipelines and agentic workflows. Every finding mapped to the exact EU AI Act article it violates, with a severity rating, named remediation tasks, and a 90-day roadmap your team can execute before August 2026. Your team runs the tests. I guide every step. No external access to your production systems required.
A full audit produces five concrete deliverables — not a report that sits in a folder. Every item is designed to be used immediately.
Every compliance gap mapped to the exact EU AI Act article it violates, with a severity rating (Critical / High / Medium) and a named remediation task. Board-presentable format.
Sequenced remediation plan with named owners, dependencies, and measurable milestones. Designed for execution by your existing team — not a separate project requiring external resources.
Raw scores from every technical test — ECE calibration per model, PSI distribution shift values, SHAP faithfulness ratings, adversarial degradation percentages. Fully reproducible.
A single-page executive summary your C-suite can present to an enforcement authority, an audit committee, or an enterprise client asking for compliance assurance.
60-minute session walking through every finding with your leadership team. Questions answered, priorities agreed, ownership assigned. Not a presentation — a working session.
Traditional ML systems require different tests from LLM-based systems and agentic workflows. The ACAI framework covers both — every test is specific, article-mapped and reproducible. Most organisations have systems in both tracks and do not know it.
A complete register of every AI system in production — traditional ML models, LLM deployments, automated pipelines with AI steps, and agentic workflows. Every system classified against Annex III risk tiers (Unacceptable, High-Risk, Limited, Minimal). Deployer obligations mapped for every vendor AI system under Article 26. Most organisations discover 30–40% more AI systems than they believed they had — including LLM-based tools and pipelines that were deployed without compliance review. Without a complete inventory, every other compliance step is built on an incomplete picture.
Expected Calibration Error is scored per production model. ECE measures the gap between a model's stated confidence and its actual accuracy — an ECE of 0.18 means a model claiming 80% confidence is accurate only 62% of the time. ECE > 0.15 is a critical Article 15 finding. Article 15 requires AI systems to achieve appropriate levels of accuracy and robustness throughout their lifecycle. A model with poor calibration cannot satisfy this requirement regardless of how good its overall accuracy metric appears. This test is almost never performed by organisations before an audit.
Population Stability Index is calculated by comparing the distribution of production data against the model's training data distribution. PSI > 0.25 means the model is operating on data that is meaningfully different from what it was validated on — effectively outside its validated operating conditions. Article 15 requires maintained accuracy throughout the system lifecycle. Article 61 requires post-market monitoring. A model operating with high PSI may have degraded significantly since deployment with no mechanism in place to detect it.
SHAP faithfulness testing verifies that the explanations produced by your AI systems accurately reflect what the model actually computed — not a post-hoc rationalisation that happens to look plausible. Article 13 requires meaningful transparency for high-risk AI systems affecting natural persons. A misleading explanation is worse than no explanation under Article 13 — because it creates a false appearance of compliance. This test is technically demanding and requires production model access; it cannot be performed through documentation review alone.
Deliberate manipulation attempts on production models — testing resilience to inputs that have been crafted or perturbed to probe model weaknesses. Accuracy degradation > 15% under adversarial conditions is a critical Article 15 finding. Article 15 requires AI systems to be resilient to attempts to alter their outputs or performance by third parties. This test is particularly critical for high-risk systems in financial services, recruitment, and access-to-service contexts where adversarial manipulation has direct economic or social consequences.
Every LLM deployment, automated pipeline and agentic workflow mapped and classified on an autonomy scale: Decision Support (human decides) → Human-in-the-Loop (human approves AI recommendation) → Human-on-the-Loop (human can intervene but AI acts by default) → Fully Autonomous. The autonomy classification directly determines your Article 14 obligations. Most organisations have systems at Human-on-the-Loop or Fully Autonomous that were deployed without any compliance review. Also maps which tools, APIs and data each agent has access to — the blast radius if the system behaves unexpectedly.
Assessment of whether the organisation has structured logging at the pipeline level — every tool call, every model response, every action taken, every decision made. Article 12 requires high-risk AI systems to automatically log events throughout operation. For agentic systems this means a complete audit trail of the agent's reasoning chain, not just its final output. Absence of pipeline-level logging is a critical Article 12 finding. Most LangChain and LangGraph deployments have application logs but not compliance-grade audit trails.
Article 14 requires that humans can effectively oversee, understand and intervene in high-risk AI systems. For agentic systems this means testing whether the override mechanism actually works — not just whether it exists on paper. Your team demonstrates the override in a test environment: can a human stop the agent mid-task? Is there an audit trail of every action taken before the override? Does the agent return to a safe state? If you cannot demonstrate a working override in 30 minutes, that is a critical Article 14 finding. The test also assesses whether human reviewers have sufficient information to make meaningful oversight decisions — or whether they are rubber-stamping AI recommendations without genuine understanding.
The agentic equivalent of adversarial robustness probing. Tests whether an external actor — through a malicious web page the agent reads, a crafted input it processes, a poisoned API response it receives — can cause the agent to take actions outside its intended scope. This is the most serious unaddressed security risk for agentic AI systems and the test most organisations have never considered. Article 15 requires resilience to attempts to alter system performance or outputs. A customer service agent that can be manipulated into revealing other customers' data, or a document processing agent that can be hijacked to exfiltrate information, fails Article 15 regardless of how well it performs its intended function.
Agents are given tools — web search, code execution, database access, email sending, API calls. This test audits whether the agent operates within the intended boundaries of each tool, and whether those boundaries are actually enforced at the system level. Does the agent only query the database tables it is authorised for? Does it only send communications to intended recipients? Does it only access external services within its defined scope? Most agentic systems have permissive tool configurations that were set during development and never tightened before production deployment. Article 9 requires a risk management system that identifies and addresses foreseeable misuse — unconstrained tool access is a foreseeable misuse vector.
Traditional models are tested on a fixed input distribution. Agents behave differently depending on what they find in the environment — the document they retrieved, the email they read, the API response they received. Behavioural consistency testing assesses whether the agent produces consistent, appropriate outputs across a wide range of environmental contexts, including unexpected, ambiguous and adversarial ones. LLM behaviour also changes as underlying model versions are updated — even without any deliberate changes by the organisation. An agent that passed behavioural assessment in January may behave differently in June after the underlying model was silently updated by the LLM provider. This test is the foundation of the monitoring strategy in the Compliance Retainer.
A structured four-stage process designed to minimise disruption to your team while producing a defensible compliance record.
30-minute call. Map your AI systems, agree scope, confirm engagement tier. You receive a scoping document within 48 hours.
Your engineering team runs the tests in your own environment — no external access to production systems required. I provide the exact protocol, tools and parameters. Your team executes; I guide every step and interpret all results.
Findings register delivered with article mapping and severity ratings. 90-day remediation roadmap with named owners and sequenced milestones.
60-minute working session with your leadership team. Every finding explained, priorities agreed, ownership assigned. Board summary delivered.
Every engagement begins with a free 30-minute discovery call. Scope and final pricing confirmed before any commitment is made.
Scope drives price. Final pricing is confirmed after the discovery call based on number of AI systems, technical complexity, and timeline. The Readiness Scan is available immediately. All engagements begin with a free 30-minute scoping call — no commitment required. Your team runs all technical tests in your own environment; no external access to production systems is ever required.
No pitch. Five questions about your AI systems. You leave with clarity on your exposure regardless of whether you engage me.