The methodology

Two audit tracks — one complete picture

Traditional ML systems require different tests from LLM-based systems and agentic workflows. The ACAI framework covers both — every test is specific, article-mapped and reproducible. Most organisations have systems in both tracks and do not know it.

Track A — Traditional ML Systems Prediction models · Classification systems · Scoring engines

T1 · D1 AI System Inventory & Risk Classification

EU AI Act · Annex III · Article 6 · Article 26

A complete register of every AI system in production — traditional ML models, LLM deployments, automated pipelines with AI steps, and agentic workflows. Every system classified against Annex III risk tiers (Unacceptable, High-Risk, Limited, Minimal). Deployer obligations mapped for every vendor AI system under Article 26. Most organisations discover 30–40% more AI systems than they believed they had — including LLM-based tools and pipelines that were deployed without compliance review. Without a complete inventory, every other compliance step is built on an incomplete picture.

T2 · D2 Calibration Testing — Expected Calibration Error (ECE)

EU AI Act · Article 15

Expected Calibration Error is scored per production model. ECE measures the gap between a model's stated confidence and its actual accuracy — an ECE of 0.18 means a model claiming 80% confidence is accurate only 62% of the time. ECE > 0.15 is a critical Article 15 finding. Article 15 requires AI systems to achieve appropriate levels of accuracy and robustness throughout their lifecycle. A model with poor calibration cannot satisfy this requirement regardless of how good its overall accuracy metric appears. This test is almost never performed by organisations before an audit.

Critical threshold: ECE > 0.15 — Article 15 direct finding

T3 · D2 Distribution Shift Detection — Population Stability Index (PSI)

EU AI Act · Article 15 · Article 61

Population Stability Index is calculated by comparing the distribution of production data against the model's training data distribution. PSI > 0.25 means the model is operating on data that is meaningfully different from what it was validated on — effectively outside its validated operating conditions. Article 15 requires maintained accuracy throughout the system lifecycle. Article 61 requires post-market monitoring. A model operating with high PSI may have degraded significantly since deployment with no mechanism in place to detect it.

Critical threshold: PSI > 0.25 — operating outside validated conditions

T4 · D5 Explanation Faithfulness — SHAP/LIME Testing

EU AI Act · Article 13

SHAP faithfulness testing verifies that the explanations produced by your AI systems accurately reflect what the model actually computed — not a post-hoc rationalisation that happens to look plausible. Article 13 requires meaningful transparency for high-risk AI systems affecting natural persons. A misleading explanation is worse than no explanation under Article 13 — because it creates a false appearance of compliance. This test is technically demanding and requires production model access; it cannot be performed through documentation review alone.

T5 · D2 Adversarial Robustness Probing

EU AI Act · Article 15

Deliberate manipulation attempts on production models — testing resilience to inputs that have been crafted or perturbed to probe model weaknesses. Accuracy degradation > 15% under adversarial conditions is a critical Article 15 finding. Article 15 requires AI systems to be resilient to attempts to alter their outputs or performance by third parties. This test is particularly critical for high-risk systems in financial services, recruitment, and access-to-service contexts where adversarial manipulation has direct economic or social consequences.

Critical threshold: Accuracy degradation > 15% under adversarial conditions

Track B — LLM & Agentic AI Systems LLM decision support · Automated pipelines · Agentic workflows · Autonomous systems

Does your LLM pipeline create EU AI Act obligations? Most organisations don't know. If an LLM-based system influences decisions affecting your customers, employees or partners — credit, hiring, access to services, content moderation — it is likely in scope under Annex III regardless of whether it uses a "traditional" ML model or a large language model. Track B tests are applied to any in-scope LLM deployment, automated pipeline or agentic workflow identified in T1.

B1 · D1 Agentic System Inventory & Autonomy Classification

EU AI Act · Annex III · Article 9 · Article 14

Every LLM deployment, automated pipeline and agentic workflow mapped and classified on an autonomy scale: Decision Support (human decides) → Human-in-the-Loop (human approves AI recommendation) → Human-on-the-Loop (human can intervene but AI acts by default) → Fully Autonomous. The autonomy classification directly determines your Article 14 obligations. Most organisations have systems at Human-on-the-Loop or Fully Autonomous that were deployed without any compliance review. Also maps which tools, APIs and data each agent has access to — the blast radius if the system behaves unexpectedly.

B2 · D4 Decision Traceability & Logging Assessment

EU AI Act · Article 12 · Article 13

Assessment of whether the organisation has structured logging at the pipeline level — every tool call, every model response, every action taken, every decision made. Article 12 requires high-risk AI systems to automatically log events throughout operation. For agentic systems this means a complete audit trail of the agent's reasoning chain, not just its final output. Absence of pipeline-level logging is a critical Article 12 finding. Most LangChain and LangGraph deployments have application logs but not compliance-grade audit trails.

Critical finding: No structured pipeline logging — Article 12 direct violation

B3 · D4 Human Oversight Functional Testing

EU AI Act · Article 14

Article 14 requires that humans can effectively oversee, understand and intervene in high-risk AI systems. For agentic systems this means testing whether the override mechanism actually works — not just whether it exists on paper. Your team demonstrates the override in a test environment: can a human stop the agent mid-task? Is there an audit trail of every action taken before the override? Does the agent return to a safe state? If you cannot demonstrate a working override in 30 minutes, that is a critical Article 14 finding. The test also assesses whether human reviewers have sufficient information to make meaningful oversight decisions — or whether they are rubber-stamping AI recommendations without genuine understanding.

Critical finding: Override cannot be demonstrated — Article 14 direct violation

B4 · D2 Prompt Injection & Goal Hijacking Testing

EU AI Act · Article 15

The agentic equivalent of adversarial robustness probing. Tests whether an external actor — through a malicious web page the agent reads, a crafted input it processes, a poisoned API response it receives — can cause the agent to take actions outside its intended scope. This is the most serious unaddressed security risk for agentic AI systems and the test most organisations have never considered. Article 15 requires resilience to attempts to alter system performance or outputs. A customer service agent that can be manipulated into revealing other customers' data, or a document processing agent that can be hijacked to exfiltrate information, fails Article 15 regardless of how well it performs its intended function.

B5 · D3 Tool Use Boundary & Permission Audit

EU AI Act · Article 9 · Article 15

Agents are given tools — web search, code execution, database access, email sending, API calls. This test audits whether the agent operates within the intended boundaries of each tool, and whether those boundaries are actually enforced at the system level. Does the agent only query the database tables it is authorised for? Does it only send communications to intended recipients? Does it only access external services within its defined scope? Most agentic systems have permissive tool configurations that were set during development and never tightened before production deployment. Article 9 requires a risk management system that identifies and addresses foreseeable misuse — unconstrained tool access is a foreseeable misuse vector.

B6 · D2 Behavioural Consistency Under Varied Context

EU AI Act · Article 15

Traditional models are tested on a fixed input distribution. Agents behave differently depending on what they find in the environment — the document they retrieved, the email they read, the API response they received. Behavioural consistency testing assesses whether the agent produces consistent, appropriate outputs across a wide range of environmental contexts, including unexpected, ambiguous and adversarial ones. LLM behaviour also changes as underlying model versions are updated — even without any deliberate changes by the organisation. An agent that passed behavioural assessment in January may behave differently in June after the underlying model was silently updated by the LLM provider. This test is the foundation of the monitoring strategy in the Compliance Retainer.

The full technical
audit for EU AI Act
compliance.

What you receive

Two audit tracks — one complete picture

From discovery call to findings register

Three engagement tiers

In 30 minutes, know exactly
where your AI stands.

The full technicalaudit for EU AI Actcompliance.

What you receive

Two audit tracks — one complete picture

From discovery call to findings register

Three engagement tiers

In 30 minutes, know exactlywhere your AI stands.

The full technical
audit for EU AI Act
compliance.

In 30 minutes, know exactly
where your AI stands.