feat: Add vulnerability-management arc — Book VII, quantum framework, ORION, and kill-chain assessment tool

feat: Add four consultant assignments (identity, CA, Intune, collaboration)
feat: Add management overlay pattern (Nebula T0 / Tailscale T1) and cloud admin VM guidance
2026-06-15 07:56:50 +02:00 · 2026-06-09 16:56:48 +02:00 · 2026-06-09 14:40:34 +02:00 · 2026-06-09 11:48:11 +02:00 · 2026-06-09 11:48:07 +02:00 · 2026-06-05 12:54:44 +02:00
51 changed files with 8199 additions and 483 deletions
@@ -34,11 +34,13 @@ Most security and resilience frameworks optimize for **robustness**—the abilit
 │   ├── executive-summary.md          # One-page board brief
 │   ├── executive-summary-cs.md       # Czech version of board brief (Výkonné shrnutí)
 │   ├── c-suite-conversation-guide.md # Persuasion scripts for top management
-│   └── t0-asset-framework.md       # Tier 0 asset classification and protection
+│   ├── t0-asset-framework.md       # Tier 0 asset classification and protection
 │   └── quantum-vulnerability-management.md # Time-budgeted quanta model for the exploitation-first era (Book VII companion)
 ├── playbooks/                      # Executable modernisation and response plans
 │   ├── rapid-modernisation-plan.md # 30-60-90-180 day transformation roadmap
 │   ├── endpoint-management-entry-vector.md # Intune/device management as engagement entry point
 │   ├── ai-assisted-tvm.md          # AI-powered vulnerability management blueprint
 │   ├── kill-chain-assessment-app.md # Spec for the offline kill-chain mapping tool (tools/kill-chain-assessment.html)
 │   ├── zero-budget-vulnerability-discovery.md # Script-based vuln discovery without commercial scanners
 │   ├── perimeter-scanning-capability.md # External attack surface scanning strategy
 │   ├── osquery-custom-platform.md    # Build a sovereign vuln/asset discovery platform on osquery
@@ -47,7 +49,8 @@ Most security and resilience frameworks optimize for **robustness**—the abilit
 │   ├── ad-endpoint-hardening.md    # On-prem AD, Windows endpoint, hybrid identity
 │   ├── zero-budget-hardening.md    # Maximize existing tool investment
 │   ├── implementation-playbook.md  # Step-by-step operational guide
-│   ├── sovereign-tool-stack.md     # Open-source arsenal and capability map
+│   ├── cqre-product-suite.md       # ASTRAL, PULSAR, AURORA: details, alignment, deployment
 │   ├── sovereign-tool-stack.md     # Full arsenal: CQRE products, open-source, and commercial tools
 │   ├── privileged-access-architecture.md # PAM: Teleport, Tailscale/Headscale, JIT access (Module 13)
 │   ├── sovereign-communications.md # Delta Chat chatmail, Matrix/Element, crisis channels (Module 14)
 │   └── business-case-template.md  # Financial justification and ROI framework
@@ -65,14 +68,24 @@ Most security and resilience frameworks optimize for **robustness**—the abilit
 │   ├── vertical-power-utilities.md # Power generation, transmission, water utilities
 │   ├── vertical-telco.md           # Telecommunications and mobile operators
 │   └── vertical-banking.md         # Financial services regulatory alignment
 ├── tools/                          # Standalone runnable instruments (offline, single-file)
 │   ├── README.md                   # Tool index and design constraints
 │   └── kill-chain-assessment.html  # Maps unknown estates → shortest existential path → quanta
 ├── books/                          # The Antifragile Handbook (Books I–VII + field guides)
 └── assets/                         # Diagrams, visuals, and presentation materials
 ```
 ## What Is Brownhat?
 Brownhat is the delivery brand for CQRE consulting engagements. The name is a deliberate rejection of the traditional hat colour taxonomy in security (black hat / white hat / grey hat) — our work is not about adversarial simulation or compliance theatre. It is about the unglamorous, practical work of making real environments more resilient: brownfield by design, working with what exists, fixing what matters most.
 The **Brownhat methodology** is the operational posture behind every engagement: move fast, extract value from existing investments, and close existential gaps before they become incidents. The **Brownhat Diagnostic** is the specific entry engagement — a structured NIST CSF 2.0 baseline assessment that every new client completes before any module recommendation is made.
 ## Our Posture: Move Fast and Fix Things
 This practice is built on a simple, actionable stance: **move fast and fix things**. We do not wait for perfect plans. We identify the kill chain, extract value from existing investments, and close existential gaps before they become incidents.
- **Speed is a security control.** A 90% solution deployed today outperforms a 100% solution that ships in six months.
+- **Speed is a security control.** A realistic engagement delivers 30–60% of an ideal posture in 180 days — infinitely better than the 100% solution that stays in planning and never ships.
 - **Work beats purchases.** Most organizations own 60-80% of the capabilities they need. We configure and operationalize before we shop.
 - **Every fix must produce a signal.** A remediation without telemetry is a remediation that will rot.
@@ -92,6 +105,9 @@ Our approach is not an alternative to established frameworks. It is the fastest
 - **[CIS Controls v8](reference/cis-controls-mapping.md)** — IG1 as a non-negotiable 90-day floor, achieved primarily through existing tool configuration
 - **[NIST CSF 2.0](reference/nist-csf-mapping.md)** — All six functions addressed with emphasis on GOVERN as the missing keystone
 - **NIS2 (EU 2022/2555)** — Every engagement produces direct evidence for the Article 21 measures: configuration management (ASTRAL), logging and monitoring (PULSAR), access control, and incident detection. Essential and important entities under NIS2 will find the Brownhat module set directly maps to their supervisory obligations.
 - **DORA (EU 2022/2554)** — ICT change management records (ASTRAL Git trail), incident log retention (PULSAR), and ICT third-party risk governance map onto DORA Articles 10 and 11. Designed for financial entities who need demonstrable controls, not documentation exercises.
 - **GDPR Article 32** — Continuous configuration governance and audit log retention constitute "appropriate technical measures" under the accountability principle. Evidence produced by ASTRAL and PULSAR is directly usable in DPA and auditor reviews.
 ## Quick Start for Executives and Board Members
@@ -2,7 +2,22 @@
 > *"What gets measured gets managed. What gets managed honestly becomes antifragile."*
-This directory contains diagnostic tools, maturity models, and assessment resources for evaluating organizational antifragility. Two production-ready tools are available now; additional assessments are in active development.
+This directory contains diagnostic tools, maturity models, and assessment resources for evaluating organizational antifragility.
 ## Production-Ready Templates
 | Template | Purpose |
 |----------|---------|
 | [Engagement Checklist](engagement-checklist.md) | **Point-in-time, regularly updated.** Controls to inspect on every M365+AD engagement, organized by domain. Not scored — a structured inspection list. Review January 2027. |
 | [Adversarial Validation Checklist](adversarial-validation-checklist.md) | **Phase 2 — mature estates.** Every item is a test, not an inspection. Opening/closing metrics, eight detection simulations, CA ghost policy tests, attack path verification. Review January 2027. |
 | [Self-Service Cadence](self-service-cadence.md) | **Client leave-behind.** Monthly portal checks and quarterly tool runs (PingCastle, Purple Knight, CAExporter, PowerShell scripts) an admin can run between engagements. Includes "call us" triggers. Customise per client before handing over. |
 | [Assessment Team Guide](assessment-team-guide.md) | Technical execution guide for the Brownhat Diagnostic: tool sequence (ASTRAL, PULSAR, BloodHound, Elysium, Purple Knight, CAExporter), what to look for, kill chain synthesis, report structure, common mistakes. |
 | [Findings Backlog](findings-backlog.md) | Single source of truth for all findings across every module and diagnostic. The input queue for the housekeeping stream. Pragmatic alternative to a formal risk register for organisations that do not have one. |
 | [NIST CSF 2.0 Baseline Assessment](nist-csf-baseline.md) | The Brownhat Diagnostic: structured 2-half-day workshop, gap analysis, kill chain identification |
 | [Module Completion Report](module-completion-report.md) | Completion package template for every module; includes backlog update |
 | [Antifragile Risk Register](antifragile-risk-register.md) | Formal risk register template; the backlog feeds into this for organisations with mature risk management |
 | [Risk Register Example](risk-register-example.md) | 8 fully populated entries from a realistic engagement — calibration reference |
 | [M365 Project Risk Register](m365-project-risk-register.md) | M365-specific risk register with phase gates |
 ## Planned Assessments
@@ -0,0 +1,319 @@
 # Adversarial Validation Checklist
 > *For clients who have done the foundational work. Everything here is tested, not inspected.*
 **Last updated:** June 2026
 **Engagement type:** Phase 2 — mature estates
 **Field guide:** [Adversarial Validation Field Guide](../books/field-guide-adversarial-validation.md)
 **Next review:** January 2027
 ---
 ## How to use this
 This checklist assumes the foundational controls are in place. The question is not "does this control exist" — it is "does this control work." Every item is a test. If an item cannot be tested in the current engagement window, mark it as untested and note it as a finding: **an untested control is a broken control, you simply do not know it yet.**
 Before any test: confirm written authorization. Before the first test: capture baseline metrics (BloodHound path count, Entra role assignment export, CA policy JSON export). After the engagement: record the "after" metrics.
 **Notation:**
 `[VERIFY]` — confirm the claim against observed behavior
 `[SIMULATE]` — run the attack or failure scenario, authorized and controlled
 `[MEASURE]` — produce a number; the number is the finding, not pass/fail
 ---
 ## Opening metrics (capture before first test)
 - `[MEASURE]` BloodHound paths to Domain Admin (all paths; then filtered to paths reachable from standard user compromise)
 - `[MEASURE]` Count of active (non-eligible) Global Admin assignments excluding break-glass
 - `[MEASURE]` Count of active (non-eligible) Domain Admin assignments
 - `[MEASURE]` Service principals with escalation-grade Graph permissions (application permissions)
 - `[MEASURE]` CA policies verified to enforce (by prior observation) vs. total CA policies in scope
 - `[MEASURE]` Distinct device IDs in sign-in logs (last 30 days) vs. Intune enrolled device count
 - `[MEASURE]` Alert volume per day (last 30 days) vs. alerts with documented human response
 - `[MEASURE]` Structural changes produced by the last five closed security incidents or alerts
 - `[MEASURE]` Anonymous link count across SharePoint/OneDrive (existing, regardless of current tenant setting)
 - `[MEASURE]` Backup MTTR from last documented restore (if any; if none, record "never tested")
 ---
 ## Section 1 — Identity: the wall
 ### 1.1 Firebreak integrity
 - `[VERIFY]` Pull all Global Admin members and check `onPremisesSyncEnabled` for each. Any `true` value is a P0. "We moved them to cloud-only" is the claim; this is the verification.
 - `[VERIFY]` Trace every path from a simulated on-prem compromise (sync server connector account) to a cloud privileged role. Draw the graph. Each path is a hole in the wall.
 - `[VERIFY]` For each cloud admin: what MFA device are they using, and is that device also used for email and browsing? A Tier 2 device authenticating a Tier 0 role is a tier violation through the MFA layer.
 - `[VERIFY]` Does any admin's MFA authenticator app depend on a phone number or device that is outside the client's MDM? (MFA backup codes stored in iCloud are a personal device dependency for a privileged role.)
 ### 1.2 Break-glass: real test
 - `[SIMULATE]` Sign in to the break-glass Global Admin account.
 - `[MEASURE]` Time from sign-in to alert received by named responder.
 - `[VERIFY]` Alert reaches the named responder (not just fires into a queue). Responder acknowledges.
 - `[VERIFY]` Break-glass sign-in works with zero on-prem dependency (test while sync is stopped, or while on a network with no DC visibility).
 - `[VERIFY]` Break-glass credentials can be retrieved from their storage location without the systems they are recovering (test retrieval physically or procedurally).
 ### 1.3 PIM enforcement
 - `[VERIFY]` For Global Administrator role PIM settings: what is the MFA method required on activation? Confirm it is phishing-resistant (FIDO2 or certificate). Push-approve is a finding.
 - `[SIMULATE]` Activate an eligible GA role from a personal device or a non-compliant device. Is it blocked by a CA policy scoped to role activation?
 - `[SIMULATE]` Request activation requiring approval. Does the approval notification reach the approver with meaningful context (what role, for whom, what justification)? Does the approver act within SLA?
 - `[MEASURE]` Maximum activation time box for GA and Privileged Role Admin. Record in hours. 24-hour window = functionally standing privilege during business hours.
 - `[VERIFY]` Are there any GA assignments that are active (permanent) and are not break-glass accounts? Pull the list; any result is a PIM compliance gap from configuration drift.
 ### 1.4 AD FS (if still running)
 - `[MEASURE]` Token-signing certificate age in days since last rotation.
 - `[SIMULATE]` Golden SAML tabletop: if the private key were obtained, what alert (if any) would fire? Walk through the detection path. Document what is visible and what is not.
 - `[VERIFY]` Is there a signed migration plan with a named date? If not, document as P0 finding — migration tooling is mature; absence of a plan is a decision, not a default.
 ### 1.5 Connector account monitoring
 - `[SIMULATE]` Authenticate as the Entra connector account (Directory Synchronization Accounts) from a host other than the sync server. Does an alert fire?
 - `[MEASURE]` Time from test authentication to alert receipt.
 - `[VERIFY]` If no alert fires: the most DCSync-capable account in the estate is unmonitored. Document as P0.
 ### 1.6 Seamless SSO / AZUREADSSOACC
 - `[VERIFY]` `Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet` — compare to approximate tenant go-live date. If matching: never rotated.
 - `[VERIFY]` If Seamless SSO is not needed for the current device estate (Entra-joined devices on modern auth): document removal as a quick win.
 ---
 ## Section 2 — Privilege: attack paths
 ### 2.1 BloodHound / attack path analysis
 - `[MEASURE]` Total BloodHound paths to Domain Admin.
 - `[MEASURE]` Shortest path (fewest hops) to Domain Admin from a standard user account. Enumerate the specific path.
 - `[MEASURE]` Number of paths involving Kerberoastable service accounts.
 - `[MEASURE]` Number of paths involving ADCS templates (add ACL collection to BloodHound run).
 - `[VERIFY]` Has anyone on the client team reviewed BloodHound output in the last 90 days? If not, the path count from the last review is the stale baseline, not the current state.
 ### 2.2 Kerberoasting: attack and detection
 - `[SIMULATE]` Run Invoke-Kerberoast or Rubeus kerberoast (authorized, test account as origin).
 - `[VERIFY]` Did Defender for Identity, Sentinel, or any SIEM alert on the TGS request pattern?
 - `[MEASURE]` Time from attack to alert receipt (if alert fires).
 - `[SIMULATE]` Attempt to crack the harvested hashes offline. Record which accounts crack and approximate crack time.
 - Finding: accounts that crack quickly + no detection = P0 on both the account and the detection gap.
 ### 2.3 ADCS
 - `[VERIFY]` Run `certipy find` or `Certify.exe find /vulnerable` against the CA. Document any ESC findings.
 - `[VERIFY]` Is the ADCS server on a dedicated Tier 0 or hardened host, or on a standard server? Check who has local admin access.
 - `[VERIFY]` Are there published certificate templates with "Supply subject in request" and enrollment permissions broader than the intended service? (ESC1 pattern)
 - `[SIMULATE]` If ESC1 is found: demonstrate the exploit path (in authorized test context — enroll a cert for a test admin account using the vulnerable template). Show the client the domain admin cert in hand.
 ### 2.4 Service principal dark matter
 - `[VERIFY]` For each service principal with escalation-grade application permissions: ask the room to identify the current owner and current use case. Document every "I don't know."
 - `[VERIFY]` For each: check `lastSignInDateTime` for the service principal. Unused principal + dangerous permissions + non-expiring secret = standing credential that can be activated any time.
 - `[VERIFY]` Are there app registrations with admin consent granted for `Mail.Read`, `Files.ReadWrite.All`, or equivalent — where the granting user or admin is no longer at the organization?
 - `[SIMULATE]` Attempt to use a service principal with dangerous Graph permissions to escalate: assign a role, add an app role assignment, or read all users. Confirm the permission is real and enforced (not just declared).
 ### 2.5 Standing privilege beyond PIM
 - `[VERIFY]` Pull active (not eligible) role assignments for GA, PRA, Security Admin, Exchange Admin. Any active assignment not in the break-glass inventory is a drift finding.
 - `[VERIFY]` Pull Domain Admins and Enterprise Admins. Count them. Ask the client how many they believe exist. Present the actual count. In most estates, the actual count exceeds the belief.
 - `[VERIFY]` Are there administrator accounts with no associated human — service accounts running with Domain Admin because "it was easier at the time"?
 ### 2.6 Local privilege on endpoints
 - `[VERIFY]` Pull local Administrators group membership across a sample of endpoints (10+ devices). Are there accounts beyond the expected (LAPS-managed local admin, Entra-joined device admin, EPM)?
 - `[VERIFY]` Is Windows LAPS deployed and confirmed working? Retrieve a LAPS password for a test device through Intune or the AD attribute. Confirm rotation has occurred (password age < 30 days or per policy).
 - `[VERIFY]` If EPM is deployed: test an elevation request for a controlled binary. Is it logged? Is the log reviewed by anyone?
 ---
 ## Section 3 — Devices: compliance signal gap
 ### 3.1 CA policy enforcement (test each separately)
 For each CA policy in scope, write the expected outcome before looking at the configuration. Then test:
 - `[SIMULATE]` **Legacy auth block:** Authenticate using Basic Auth from a test account (Exchange ActiveSync, SMTP auth, or equivalent). Expected: blocked. Result: ___
 - `[SIMULATE]` **Compliant device gate:** Sign in from a known non-compliant device (personal device, or a managed device taken out of compliance). Expected: blocked from sensitive workloads. Result: ___
 - `[SIMULATE]` **Admin sign-in location gate:** Attempt a PIM role activation from a device outside the named compliant/PAW scope. Expected: blocked. Result: ___
 - `[SIMULATE]` **MFA enforcement:** Sign in as a test user from a new device with no registered session. Expected: MFA challenged. Confirm the MFA method that fires (push-approve vs. FIDO2). Result: ___
 - `[VERIFY]` For any policy that fails to enforce despite correct displayed configuration: recreate from scratch, re-test. Document if ghost policy confirmed.
 - `[VERIFY]` Are there CA policies in report-only mode that should be enabled? Report-only is a test state, not a permanent posture.
 - `[VERIFY]` Break-glass accounts excluded from blocking policies — test the break-glass sign-in path specifically under the conditions a blocking policy would normally fire.
 ### 3.2 Compliance signal quality
 - `[SIMULATE]` Induce a non-compliant state on a test managed device. Record the timestamp.
 - `[MEASURE]` Time from non-compliance induction to Intune state update.
 - `[MEASURE]` Time from non-compliance induction to CA token revocation / session block.
 - `[VERIFY]` Is CAE (Continuous Access Evaluation) active for critical workloads? If yes, measure revocation time for a CAE-supported app vs. a non-CAE app. Present the gap.
 - `[SIMULATE]` Root / jailbreak a test device. Does the jailbreak detection in the compliance policy trigger? How long?
 ### 3.3 Fleet reality check
 - `[MEASURE]` Distinct device IDs in sign-in logs (last 30 days).
 - `[MEASURE]` Intune enrolled device count.
 - `[MEASURE]` Devices in sign-in logs with device compliance state "non-compliant" or "unknown."
 - `[VERIFY]` Are there legacy-auth sign-ins in the logs that bypass device compliance evaluation entirely? Filter by Client App = non-modern entries. Each entry is a device control bypass.
 - `[VERIFY]` Pick 5 devices from the sign-in log that are not in Intune. What data do they have access to? What CA policy, if any, applies to them?
 ### 3.4 Update rings and rollback
 - `[VERIFY]` Are update rings configured with a named pilot group and a broad group with deferral?
 - `[VERIFY]` Is there a named person with the process to halt a broad ring update push? Do they know the procedure? Have they tested it?
 - `[SIMULATE]` (If authorized and non-disruptive) Push a test configuration change to the pilot ring only. Confirm it stays in the pilot ring and does not propagate to broad without explicit promotion.
 ### 3.5 MAM boundary (per platform)
 - `[SIMULATE]` On iOS: copy text from managed Outlook to an unmanaged app. Blocked or not?
 - `[SIMULATE]` On Android: same test. (Do separately — behavior is not symmetric.)
 - `[SIMULATE]` On iOS: "Open in" from a managed email attachment to Files app or an unmanaged viewer.
 - `[SIMULATE]` On either platform: save to local storage or backup to iCloud/Google Drive.
 - `[VERIFY]` For any gap found: confirm it reproduces after device reset. If it does, escalate to vendor. If it does not, investigate configuration.
 ---
 ## Section 4 — Data: does protection travel
 ### 4.1 Label encryption in the wild
 - `[SIMULATE]` Forward a Highly Confidential test document to an external test email address. Open it from a mail client with no tenant authentication. Does encryption prevent access?
 - `[SIMULATE]` Download the same document to an unmanaged device. Does encryption require re-authentication to the tenant?
 - `[SIMULATE]` Share the document via an anonymous link. Access from an unauthenticated browser. Does it open?
 - `[SIMULATE]` Copy/paste content from the document on a managed device under a MAM policy. Is it blocked?
 - `[VERIFY]` For any path where the document opens without authentication: this is an exfiltration route. Document the specific path, the expected control that should have blocked it, and the observed result.
 ### 4.2 DLP enforcement
 - `[SIMULATE]` Send an email from a test account containing content matching a high-value DLP rule (credit card number pattern, national ID format, or the client's custom regex for crown-jewel content). Does DLP intercept it? What action fires (block, override, audit-only)?
 - `[SIMULATE]` Upload the same content to a personal OneDrive or cloud storage from a managed device. Does DLP fire?
 - `[VERIFY]` For DLP rules that fire in audit-only mode: what happens to the audit events? Are they reviewed? By whom? How often?
 - `[VERIFY]` What is the false positive rate for high-sensitivity DLP rules? High false positive rates mean users have learned to override; the rule is not a control.
 ### 4.3 Anonymous links (existing population)
 - `[MEASURE]` Full count of anonymous links across the tenant. (Not the current sharing setting — the existing links that predate any restriction.)
 - `[VERIFY]` Confirm at least one existing anonymous link resolves from an unauthenticated browser. It does — almost certainly. This proves the declared sharing restriction is forward-looking, not retroactive.
 - `[VERIFY]` Can the client produce the anonymous link list and revoke all entries in under 30 minutes? Test the revocation capability, not just the list.
 ### 4.4 Email exfiltration paths
 - `[SIMULATE]` Create a test Inbox rule on a test account forwarding to an external test address. Does anything alert? When?
 - `[VERIFY]` `Get-RemoteDomain Default | Select-Object AutoForwardEnabled` — if False, test whether the Inbox rule still forwards. Document the result (transport-level and client-rule forwarding behave differently).
 - `[VERIFY]` `Get-TransportRule` for any rules with external redirect or blind copy. For each: who created it, when, and is there a documented owner?
 - `[MEASURE]` Time from Inbox rule creation to detection alert (if any).
 ### 4.5 Guest access and reshare chain
 - `[MEASURE]` Total guest count. Guests not signed in for 90+ days. Ratio of stale to active.
 - `[VERIFY]` Do guests have access beyond their original project scope? Pick 5 random active guests and enumerate their group and site memberships.
 - `[SIMULATE]` Share a test document to a test external guest. Have the guest reshare to a second external test account. Can the client observe the second hop? Can they revoke it?
 - `[VERIFY]` Are access reviews running for guests? What is the default action on reviewer non-response?
 ### 4.6 Audit log forensics readiness
 - `[VERIFY]` Confirm audit logging is enabled (Purview > Audit — look for the "Start recording" banner; if it appears, logging is off).
 - `[SIMULATE]` Run a forensic reconstruction: given a specific test user account, reconstruct everywhere they accessed data in the last 7 days. Can you produce a coherent picture from the audit log alone?
 - `[MEASURE]` How far back does the audit log extend for the current licensing tier? Test by querying for a known event at the boundary date.
 - `[VERIFY]` Are admin operations (CA policy changes, role assignments, app consent grants) present in the audit log? Run a query for admin events from the last 30 days and spot-check for completeness.
 ---
 ## Section 5 — Detection: the eight simulations
 For each simulation: run it, record whether the alert fired, record the time from event to human acknowledgment, and record whether the responder acted. The SLA comparison is the finding.
 | Simulation | Alert fires? | Time to human | Action taken | Finding |
 |---|---|---|---|---|
 | Break-glass sign-in | | | | |
 | New Global Admin assigned | | | | |
 | DCSync from non-DC host | | | | |
 | Kerberoasting (TGS pattern) | | | | |
 | Impossible travel (admin account) | | | | |
 | External auto-forward rule created | | | | |
 | Mass download from SharePoint | | | | |
 | OAuth consent grant (sensitive scope) | | | | |
 ### 5.1 Alert queue health
 - `[MEASURE]` Alert volume per day (last 30 days).
 - `[MEASURE]` Alerts with documented human response.
 - `[MEASURE]` Alerts suppressed or auto-closed without human review.
 - `[MEASURE]` Alerts open for more than 48 hours.
 - `[VERIFY]` For every alert category: is there a named owner? An alert category with no named owner is an unread alert category.
 - `[VERIFY]` Pick 5 alerts from the last 30 days that were closed. For each: what action was taken, and what structural change resulted?
 ### 5.2 The feedback loop test
 - `[MEASURE]` Last 5 closed security incidents: structural changes produced (count removals, access reductions, severed couplings — not reminders, training, or "noted in risk register").
 - `[VERIFY]` Is there a post-incident process that explicitly asks: "what structural thing changes as a result of this?"
 - `[VERIFY]` Is the post-incident process blameless on people (encouraging surfacing) and ruthless on structure (demanding a removal or change)?
 ---
 ## Section 6 — Recovery
 ### 6.1 Backup: restore something
 - `[SIMULATE]` Restore a mailbox (or a mailbox item set) from the third-party backup. Time the operation.
 - `[MEASURE]` Actual MTTR from test restore vs. policy-declared RTO.
 - `[VERIFY]` If the actual MTTR exceeds the policy RTO: the policy is a fiction. Document the observed time as the operative figure.
 - `[VERIFY]` Are backups isolated from the estate they protect? Can a Global Admin delete the backup copies?
 - `[VERIFY]` Is there a third-party M365 backup at all? If not: M365 native recycle bin + version history is the only recovery mechanism, and this is a P0 for any organization with business-critical M365 data.
 ### 6.2 AD forest recovery
 - `[VERIFY]` Does a written AD forest recovery runbook exist?
 - `[VERIFY]` Is it stored where it can be retrieved when AD is down? (Not SharePoint. Not AD-authenticated storage.)
 - `[VERIFY]` Has anyone on the team run the procedure — not a tabletop, an actual restore, even in a lab?
 - `[VERIFY]` Does the runbook include: DC restore sequence, metadata cleanup, double KRBTGT rotation, trust resets?
 - Finding if all above are no: the first time AD forest recovery is performed will be during the real disaster. Document as a rehearsal scope item.
 ### 6.3 Configuration known-good
 - `[VERIFY]` Export current CA policies to JSON. Diff against the opening-of-engagement export. For every difference: is there a change record?
 - `[VERIFY]` Are there CA policies that changed since the last documented review without a corresponding change order?
 - `[VERIFY]` If a CA policy was silently modified (intentionally or not), what mechanism would have detected it and when?
 ### 6.4 Break-glass independence
 - `[VERIFY]` Cloud admin recovery path works with no on-prem dependency — confirm by testing while sync is stopped or from a network with no DC visibility.
 - `[VERIFY]` If the primary MFA infrastructure (Microsoft Authenticator, FIDO2 key) is unavailable, is there a recovery path for privileged access that does not itself require privileged access?
 ---
 ## Closing metrics (capture after engagement)
 | Metric | Before | After | Delta |
 |--------|--------|-------|-------|
 | BloodHound paths to DA (from standard user) | | | |
 | Active (non-break-glass) Global Admin assignments | | | |
 | Active (non-break-glass) Domain Admin assignments | | | |
 | CA policies verified by observation (working) | | | |
 | Detection signals tested end-to-end (working) | | | |
 | Anonymous link count | | | |
 | Unmanaged device sign-in % of total | | | |
 | Actual backup MTTR (minutes) | | | |
 | Structural changes from last 5 incidents (before) | | | |
 | Structural changes produced this engagement | | | |
 ---
 ## Engagement close verification
 Before marking the engagement complete:
 - Every finding that was verified by observation has a structural change attached (not a risk register entry — a change).
 - The closing metrics have been calculated and compared to the opening metrics.
 - The break-glass has been tested and works.
 - At least one backup restore has been timed and the MTTR recorded.
 - At least one CA policy has been verified to enforce by a real sign-in with pre-written expected outcomes.
 - At least one detection signal has been tested end-to-end to a human responder.
 - The configuration-as-code export (CA policies, role assignments) has been stored and the client has it.
 - A named date exists for the next adversarial validation cycle.
 The engagement is not complete when the list is walked. It is complete when every finding from observation has become a structural change or a named, dated, owned commitment.
 ---
 *Adversarial Validation Checklist. Updated June 2026. Review alongside the field guide — January 2027.*
@@ -64,7 +64,7 @@ Risks related to loss of control over data, intelligence, or infrastructure.
 | Risk | Kill Chain | T0? | Antifragile Move |
 |------|-----------|-----|-----------------|
-| Proprietary data trains competitor AI models | Data → cloud AI → model improvement → competitive erosion | Yes | Deploy local or Azure OpenAI with data protection guarantees; classify AI data flows |
+| Proprietary data processed by uncontrolled AI | Data → cloud AI → residency/audit exposure → regulatory or competitive risk | Yes | Deploy sovereign or enterprise AI with verified data residency and audit rights; classify all AI data flows |
 | Cloud vendor changes terms or pricing | Terms change → operational disruption → forced migration under duress | Yes | Document exit architecture; maintain data portability; dual-vendor readiness |
 | Vendor discontinues critical service | Service ends → workflow collapse → emergency procurement | T1 | Maintain abstraction layers; escrow agreements; 90-day exit plans |
@@ -0,0 +1,514 @@
 # Assessment Team Guide: Technical Execution of the Brownhat Diagnostic
 > *"The workshop tells you what the client thinks is happening. The tools tell you what is actually happening. Run the tools before the second session — the findings change the conversation."*
 This guide covers the technical execution of the Brownhat Diagnostic. It is the companion to the [NIST CSF 2.0 Baseline Assessment](nist-csf-baseline.md), which covers the workshop methodology. Read both before your first diagnostic.
 **Division of labour**: The workshop facilitator runs the NIST CSF sessions and manages the client conversation. The technical assessor runs the tools, collects evidence, and builds the findings. These can be the same person in smaller engagements, but if you have two people, split them — the findings from Day 2 tool runs should inform the workshop conversation, not interrupt it.
 ---
 ## Before You Arrive: Pre-Engagement Preparation
 ### Access to Request (Before Kickoff)
 Send this checklist to the client IT lead at least 5 business days before Day 1. Missing access on Day 1 is the most common cause of diagnostic delay.
 **M365 / Entra ID**:
 - [ ] Global Reader role in Entra ID (read-only; sufficient for most checks)
 - [ ] Entra ID audit log access (to verify logging is enabled before PULSAR deploys)
 - [ ] Exchange admin centre read access
 - [ ] SharePoint admin centre read access
 - [ ] Intune read access (Device Management / Endpoint Manager)
 - [ ] Microsoft Secure Score access
 - [ ] Conditional Access policies read access
 **Active Directory**:
 - [ ] Domain User account on the domain(s) — BloodHound collection only needs this
 - [ ] Read access to ADUC (Active Directory Users and Computers)
 - [ ] Ability to run PowerShell on a domain-joined machine (for BloodHound collector and Elysium — see notes below on Elysium privilege requirements)
 **Network / Infrastructure**:
 - [ ] Access to firewall management interface (read-only; to review ruleset)
 - [ ] VPN access or on-site working arrangement for Day 2 tool runs
 - [ ] Previous pentest or audit reports (if any exist)
 **Documents to request**:
 - [ ] Network diagram (any version, however outdated — better than none)
 - [ ] Asset inventory or CMDB export (even a spreadsheet)
 - [ ] Previous security audit or pentest report
 - [ ] List of third-party SaaS tools (from procurement or IT)
 - [ ] Organisational chart for IT/security team
 > **What to do if access is not ready**: Do not delay the workshop waiting for full access. Start Session 1 with what you have. Deploy ASTRAL and PULSAR as soon as any M365 access is confirmed — they produce value from minute one. Tool runs that need AD access can happen Day 2–3 once an account is provisioned.
 ### Tool Preparation
 Have these ready to deploy before Day 1. Do not learn a tool at a client's expense.
 | Tool | Preparation | Time to deploy |
 |------|-------------|---------------|
 | ASTRAL | ADO project created; pipeline YAMLs ready; `bootstrap-tenant.ps1` reviewed | 2–4 hours on-site |
 | PULSAR | Docker Compose environment ready; `bootstrap-tenant.ps1` reviewed | 1–2 hours on-site |
 | BloodHound CE | Installed on assessment laptop; SharpHound collector downloaded | 15 minutes |
 | Elysium | Cloned and tested in lab; KHDB download initiated (large file — download before arriving) | 30 min setup; KHDB download 30–60 min |
 | CAExporter | Downloaded and tested | 10 minutes |
 | Purple Knight | Downloaded from Semperis (free, requires registration) | 15 minutes |
 | E8-CAT | Downloaded and tested (for Australian clients or E8-aligned clients) | 10 minutes |
 | Nmap / Shodan | Nmap installed; Shodan account active (free tier sufficient) | Ready |
 ---
 ## Day 1: Deploy First, Ask Questions Later
 The single most important discipline: **deploy ASTRAL and PULSAR before the first workshop session begins.** The baseline they capture is a point-in-time snapshot. If you wait until after the workshops, the baseline may already reflect changes the client made in response to your questions.
 ### Morning: Deploy Listening Tools
 Before Session 1 starts, or during the first 30-minute introductions slot:
 **Step 1 — ASTRAL deployment** (~2 hours, can run in background)
 ```powershell
 # On the client's Azure DevOps or your assessment instance
 .\deploy\bootstrap-tenant.ps1 -TenantName "<client>.onmicrosoft.com"
 ```
 Follow the [ASTRAL onboarding runbook](https://github.com/cqrenet/astral/blob/main/deploy/onboarding-runbook.md). The initial full backup pipeline run captures the complete M365 configuration baseline. This is your "before" snapshot — everything you find during the assessment is measured against this.
 **What ASTRAL captures on first run**:
 - All Intune profiles, policies, compliance policies, applications, scripts
 - All Conditional Access policies (with full named-object resolution via CAExporter integration)
 - All Entra ID app registrations and enterprise applications
 - All authentication methods and named locations
 - Produces HTML/PDF as-built documentation automatically
 **Step 2 — PULSAR deployment** (~1 hour, can run in background)
 ```bash
 cp .env.example .env
 # Fill in CLIENT_ID, CLIENT_SECRET, TENANT_ID from bootstrap output
 docker compose up --build -d
 ```
 Once running, trigger a manual fetch to confirm audit log ingestion is working:
 ```
 GET http://localhost:8000/api/fetch-audit-logs
 ```
 **What PULSAR captures immediately**: All M365 admin audit events from the Management Activity API (Exchange, SharePoint, Teams, Entra, Intune). Retention starts from this moment — every admin action from here forward is permanently searchable. For clients with no prior log retention, this is instant value.
 **Step 3 — Microsoft Secure Score baseline** (10 minutes)
 Navigate to `security.microsoft.com → Secure Score`. Screenshot the current score and the top 10 recommended actions. This is a quick reference point for the workshop conversation and gives the client a number they immediately understand.
 **Step 4 — Passive external scan** (runs in background during workshop)
 ```bash
 # From your assessment machine
 nmap -sV --open -p 80,443,8080,8443,3389,22,21,25,993,995 [client-public-IPs]
 # Shodan CLI for ASN-based discovery
 shodan search "org:[client-org-name]" --fields ip_str,port,banner,product
 ```
 Also check:
 - Certificate transparency logs: `crt.sh/?q=[client-domain]` — reveals subdomains, expired certs, shadow IT domains
 - Shodan for the VPN endpoint specifically: firmware version, known CVEs
 - `whois` and reverse DNS for all IP ranges the client mentions
 ---
 ## During the Workshops: What to Listen For
 The [NIST CSF Baseline](nist-csf-baseline.md) has the full question set. Below are the specific signals to listen for that indicate P0/P1 findings. Note these immediately — they feed the technical checklist for Day 2.
 | What the client says | What it likely means | Check on Day 2 |
 |---------------------|---------------------|----------------|
 | "We haven't tested our backups recently" | No restore has ever been done | Recovery drill required; check backup destination |
 | "We use shared admin accounts" | Multiple people using one credential | Elysium; AD audit; no MFA possible on shared account |
 | "Contractors have the same access as employees" | Likely no offboarding process; stale accounts | Elysium; AD account audit; HR cross-reference |
 | "We have MFA but I think some people have exemptions" | CA policies in report-only or with large exclusion groups | CAExporter; Entra ID CA policy review |
 | "The acquisition brought in a second AD" | Forest trusts; uncharted attack paths; duplicate admin accounts | BloodHound must cover both domains |
 | "We use [legacy on-prem system] with its own accounts" | Shadow identity; service accounts not in scope of central IAM | Manual AD service account audit |
 | "IT handles offboarding when HR tells us" | Offboarding depends on HR notification — often delayed | Elysium; compare AD accounts to HR list |
 | "I'm not sure who all has admin access" | No privileged access inventory | BloodHound; ADUC privileged group audit |
 | "We have a firewall but nobody has reviewed the rules in years" | Accumulated rules; likely any/any entries; retired services still open | Firewall rule export and review |
 | "Some of our developers have direct access to production" | Uncontrolled privileged access to production systems | Scope question for Module 6 |
 ---
 ## Day 2–3: Technical Tool Runs
 Run tools in this order. Earlier tools inform later ones.
 ### 1. CAExporter — Conditional Access Baseline (30 minutes)
 Run first. The CA policy export reveals whether MFA is actually enforced or just configured. This is consistently the most surprising finding in M365 environments.
 ```powershell
 # Requires Entra ID reader access
 .\CAExporter.ps1 -TenantId <tenant-id> -OutputPath .\ca-export\
 ```
 **What to look for**:
 - Policies in **Report-Only** mode (not enforced — common; clients assume they are protected when they are not)
 - Large **exclusion groups** containing most users ("AllUsers_ExceptionGroup" type)
 - Policies that claim to block legacy authentication but have exclusions that defeat the purpose
 - No policy enforcing device compliance
 - Multiple overlapping policies with unclear precedence
 **Output**: Excel workbook with one row per policy, conditions and controls expanded, groups and apps named rather than showing GUIDs. This is the CA baseline document.
 ---
 ### 2. BloodHound — AD Attack Path Analysis (1–2 hours collection + analysis)
 ```powershell
 # Run SharpHound from a domain-joined machine using the assessor domain account
 .\SharpHound.exe -c All --zipfilename nexus-bloodhound.zip
 ```
 Copy the zip to your assessment machine and import into BloodHound CE.
 **Required queries** (run these first, every engagement):
 ```cypher
 -- Shortest paths to Domain Admin from all non-admin users
 MATCH p=shortestPath((u:User {admincount:false})-[*1..]->(g:Group {name:"DOMAIN ADMINS@DOMAIN.LOCAL"})) RETURN p
 -- All Domain Admin members with direct login sessions on workstations
 MATCH (u:User)-[:MemberOf]->(g:Group {name:"DOMAIN ADMINS@DOMAIN.LOCAL"})
 MATCH (u)-[:HasSession]->(c:Computer) WHERE NOT c.name CONTAINS "DC" RETURN u.name, c.name
 -- Kerberoastable accounts with high privilege
 MATCH (u:User {hasspn:true}) WHERE u.admincount=true RETURN u.name, u.serviceprincipalnames
 -- ASREPRoastable accounts (no Kerberos pre-auth)
 MATCH (u:User {dontreqpreauth:true}) RETURN u.name, u.enabled
 -- Service accounts with paths to Domain Admin
 MATCH p=shortestPath((u:User)-[*1..5]->(g:Group {name:"DOMAIN ADMINS@DOMAIN.LOCAL"}))
 WHERE u.name CONTAINS "$" OR u.name CONTAINS "SVC" OR u.name CONTAINS "SERVICE"
 RETURN p
 ```
 **What to document**:
 - Number of paths to Domain Admin from non-admin users (the "847 paths" number from the sample)
 - Shortest path length and the specific nodes on it — this is your kill chain
 - Domain Admins with sessions on non-DC workstations — P1 finding in almost every environment
 - Any service accounts that are Kerberoastable and have high privilege — often P0
 - KRBTGT last password set date (check in ADUC or PowerShell)
 ```powershell
 # KRBTGT last password set
 Get-ADUser krbtgt -Properties PasswordLastSet | Select PasswordLastSet
 ```
 ---
 ### 3. Elysium — Password Audit (2–4 hours, requires elevated AD access)
 > **Privilege requirement**: Elysium requires Domain Admin or equivalent (DSInternals needs to read password hashes). Confirm this access before scheduling. If it cannot be arranged during the diagnostic, schedule it for week 1 of Module 6.
 ```powershell
 # Run from a domain controller or with delegated rights
 .\Elysium.ps1 -Domain <domain-fqdn> -OutputPath .\elysium-output\
 ```
 **What Elysium finds**:
 - Accounts matching known-breached password hashes (from the KHDB — download before arriving)
 - Accounts with blank passwords
 - Accounts with passwords matching dictionary patterns
 - Duplicate passwords across accounts (shared credential detection)
 **Output to document**:
 - Total accounts audited
 - Accounts matching KHDB (breached) — split by privileged vs non-privileged
 - Accounts with common passwords
 - Any privileged account with a compromised or weak password → immediate P0
 **Privacy handling**: Elysium does not transmit usernames or plaintext passwords. The KHDB comparison is local. The output is a list of SAMAccountNames to reset — not passwords. Communicate this clearly to the client before running.
 ---
 ### 4. Purple Knight — AD Security Scoring (30 minutes)
 Purple Knight (Semperis, free) runs a broad checklist of AD security misconfigurations. Run it from any domain-joined machine.
 ```powershell
 .\PurpleKnight.ps1
 ```
 The report scores against ~100 indicators. **Focus on**:
 - LDAP signing and channel binding status
 - AdminSDHolder unusual members
 - Protected Users group membership (or absence of it for admins)
 - Reversible encryption enabled accounts
 - Unconstrained delegation (computers and users)
 - Machine account quota (default 10 — often abused for relay attacks)
 - Exchange permissions on AD objects (if Exchange exists on-prem)
 Cross-reference Purple Knight findings with BloodHound. Purple Knight finds the indicators; BloodHound shows how they chain together into attack paths.
 ---
 ### 5. Entra ID Manual Checks (1 hour)
 These cannot be automated — they require visual inspection in the Entra admin centre.
 **App registrations and enterprise applications**:
 - Navigate to: `Entra ID → App registrations → All applications`
 - Filter by: "High privilege permissions" — look for `Mail.ReadWrite`, `Directory.ReadWrite.All`, `User.ReadWrite.All`
 - Note any apps with these permissions that are: (a) published by unknown parties, (b) have no documented owner, (c) were consented to by users rather than admins
 - This is consistently where the most surprising findings live — OAuth consent abuse is underdetected in every mid-market environment
 **Guest accounts**:
 - Navigate to: `Entra ID → Users → Filter: User type = Guest`
 - How many guests are there? When was their last sign-in? Are any of them former contractors?
 **MFA registration status**:
 - Navigate to: `Entra ID → Users → Per-user MFA` (legacy view) OR `Identity Protection → Monitoring → Authentication methods → User registration details`
 - What % of users have MFA registered? What % have it enforced?
 - Are there any break-glass accounts? Are they properly protected and audited?
 **Entra ID Connect sync account** (hybrid environments only):
 - Navigate to: `Entra ID → App registrations → find the sync account`
 - Check what rights it has in Entra ID
 - Cross-reference with on-prem AD: does this account have DCSync rights? (BloodHound query: search for the account name and check its paths)
 ---
 ### 6. Intune / Endpoint Check (30 minutes — via ASTRAL output or direct)
 ASTRAL's first run will have produced an Intune inventory. Review:
 - **Enrollment rate**: What % of devices are enrolled? What platforms?
 - **Compliance policy coverage**: Is there a compliance policy? What does it enforce? Is it assigned to all devices?
 - **Conditional Access integration**: Is the "Require compliant device" CA policy active — or in report-only?
 - **Stale devices**: Devices with last check-in > 90 days are likely personal devices or ghost entries. Note the count.
 - **Script inventory**: What PowerShell scripts are deployed via Intune? Any that look unfamiliar?
 ---
 ### 7. External Attack Surface (30–60 minutes)
 By Day 2, the Nmap and Shodan scans from Day 1 should have results.
 **Review**:
 - Any RDP (3389) exposed to internet → P0 in almost every context
 - Any management interfaces (firewalls, switches, VPN management) accessible from internet
 - Any services with outdated banners suggesting old software versions
 - Certificate expiry on any internet-facing service
 - VPN endpoint firmware version → check against vendor advisory for known CVEs
 **Additional check — subdomain enumeration**:
 ```bash
 # Using crt.sh results and DNS brute force
 cat crt-sh-results.txt | grep "<client-domain>" | sort -u
 # For each subdomain found: what is it? Is it documented? Is it still active?
 ```
 Undocumented subdomains pointing to forgotten services are a regular P1 finding.
 ---
 ### 8. Firewall Rule Review (30–60 minutes)
 Request an export of the firewall ruleset. Most firewall platforms support CSV or XML export.
 **What to look for**:
 - Rules with `source = ANY` and `destination = ANY` (any/any) → almost always P2 but sometimes P1 if it covers a sensitive segment
 - Rules allowing direct internet access from server VLANs → P1
 - Rules created for a specific project that are still active years later → P2
 - Rules referencing IP addresses that no longer correspond to live systems
 - No rule for blocking outbound traffic (egress filtering absent) → P1 for environments with sensitive data
 ---
 ### 9. Backup and Recovery Spot Check (30 minutes)
 Ask the IT lead to show you, live:
 - Where backups are stored (destination)
 - When the last backup ran and whether it completed successfully
 - Whether the backup destination is on the same network segment as the system being backed up
 - Whether anyone has ever triggered a test restore and what the result was
 > **The standard answer**: "Backups run every night and we get a green tick." The right follow-up: "Show me the most recent successful restore test." In most environments, one has never been performed.
 Document: backup target, last run, completion status, last restore test (date or "never").
 ---
 ## Synthesising Findings: From Data to Kill Chain
 After tool runs are complete, before writing the report, do this step explicitly. Sit with your notes and answer one question:
 **"What is the shortest sequence of steps an adversary with no prior access could take to cause the organisation to fail to operate?"**
 Build the kill chain step by step:
 1. Start from the outside (what can be accessed without credentials?)
 2. What is the first credential gain? (phishing, password spray against legacy auth, VPN without MFA)
 3. What does that credential give access to? (M365 if MFA is not enforced; VPN if no MFA there)
 4. What can you do with M365 access? (read all email, access SharePoint, escalate via app permissions)
 5. What is the path from M365 access to domain admin? (Entra ID admin → AD Connect sync account → DCSync)
 6. What does domain admin give you? (everything on-prem, including ERP, backup servers)
 7. What is the impact? (data exfiltration, ransomware, operational disruption)
 Write this as a chain, not a list. The [sample engagement kill chain](sample-engagement-mid-market.md#kill-chain-assessment) shows the format.
 ---
 ## Finding Triage and Priority Assignment
 For every finding, apply the kill chain test:
 | Question | Priority |
 |----------|----------|
 | Is this a node on the kill chain? | **P0** — fix before anything else |
 | If exploited, does material harm result even if not on the kill chain? | **P1** — fix this engagement |
 | Real finding, real risk, but not on the kill chain and not immediately material? | **P2** — housekeeping queue |
 | Best practice recommendation with no exploitable risk? | **Observation** — note in report, do not count as a finding |
 **Common priority inflation mistakes**:
 - Marking "no security awareness training programme" as P0 — it is P2 at most
 - Marking every missing patch as P0 — only patches for internet-facing or kill-chain systems
 - Marking "weak password policy" as P0 when Elysium shows no actual weak passwords — the policy is P2; actual weak credentials on privileged accounts are P0
 ---
 ## Quick Wins Identification
 A quick win must pass three tests:
 1. **Closeable in hours or days, not weeks** — requires no procurement, no change window longer than one day, no significant testing
 2. **Uses only existing tools and permissions** — no new purchase, no new deployment
 3. **Meaningfully reduces risk** — not cosmetic
 For M365/AD environments, the standard quick wins checklist:
 - [ ] Activate CA policies already in Report-Only mode
 - [ ] Remove large exception groups from CA compliance policies
 - [ ] Block legacy authentication (CA policy template exists in every tenant)
 - [ ] Enforce MFA at organisation level in GitHub / other SaaS tools
 - [ ] Disable accounts confirmed as departed contractors (HR-verified, scripted disable)
 - [ ] Enable audit logging where it is off (often disabled on legacy servers to save disk)
 - [ ] Revoke suspicious OAuth app permissions (for obvious unknowns with high privilege)
 - [ ] Change default credentials on any system where they are confirmed unchanged
 ---
 ## Report Structure
 The Brownhat Diagnostic report has five sections. Target length: 15–25 pages. Not more — if it is longer, it will not be read.
 ### 1. Executive Summary (2 pages)
 - Current state in one paragraph — honest, not alarming
 - Kill chain: the specific path, named, diagrammed if possible
 - P0 count, P1 count, P2 count
 - Quick wins: what was closed immediately (if Day 1 quick wins were executed)
 - Recommended first module and rationale
 - NIS2 compliance gap summary (if applicable): which Article 21 measures have evidence, which do not
 ### 2. Methodology (0.5 pages)
 - Workshop dates, attendees
 - Tools used (ASTRAL, PULSAR, BloodHound, Elysium, Purple Knight, CAExporter, external scan)
 - Access used (read-only Entra ID, domain user for BloodHound, domain admin for Elysium)
 - What was NOT assessed (explicitly scoped out — sets expectations)
 ### 3. Findings (8–15 pages)
 Organise by priority tier, not by domain.
 **P0 — Kill Chain Nodes**: Each finding gets a half-page: the finding in one sentence, the evidence, the business impact in non-technical language, and the remediation. Name the specific accounts, policies, or systems involved. "Admin accounts lack MFA" is a weak finding. "3 of 5 Global Administrator accounts — `admin@nexus.onmicrosoft.com`, `it-admin@nexus.onmicrosoft.com`, and the break-glass account — can authenticate without MFA because the Conditional Access policy 'Require MFA' is in Report-Only mode" is a finding.
 **P1 — Material Risk**: Same format, briefer. One paragraph per finding.
 **P2 — Housekeeping Queue**: Table format only. ID, finding, why it matters in one sentence.
 ### 4. Module Recommendation (2 pages)
 - Recommended sequence with rationale
 - What each module closes (map to specific P0/P1 findings)
 - Timeline estimate
 - Investment estimate (effort ranges, not day rates — rates go in the proposal)
 ### 5. Quick Wins Closed (0.5 pages)
 List what was already fixed during the diagnostic. This is the most important page for client confidence — they paid for the diagnostic and something is already better.
 ---
 ## Backlog Population
 Before leaving the client site (or within 24 hours):
 1. Create the ADO Work Items project (or agree on the tool with Ondřej)
 2. Enter every finding as a Work Item: ID, finding text (one sentence), source (Brownhat Diagnostic), priority (P0/P1/P2), owner (named person)
 3. Move quick wins to Closed with the date they were resolved
 4. Brief the named IT lead on the backlog: where it lives, how the monthly cycle works, who owns what
 5. Pin the ADO board as a Teams tab if applicable
 The backlog handover is not optional. A diagnostic that produces a report but no maintained tracking system has a half-life of one steering committee meeting.
 ---
 ## ASTRAL and PULSAR Handover
 By the end of the diagnostic engagement:
 **ASTRAL**:
 - First full backup has run and committed to the ADO repository
 - Client IT lead can access the ADO project and review the baseline
 - Drift detection is live — the first drift PR, if one occurs, should be reviewed together with the client as a training exercise
 - Reviewer notification configured to email or Teams-notify Ondřej
 **PULSAR**:
 - Audit events ingesting and searchable
 - Teams tab pinned in the IT channel
 - Basic search walkthrough done with client IT lead: show them how to find a specific event, how to filter by actor and operation
 - No alert rules yet — those come in Module 2/3 when there is a hardened baseline to alert against
 ---
 ## Common Mistakes in Assessment Execution
 **Starting tool runs before access is confirmed.** Tool runs that fail eat time and erode confidence. Confirm credentials work before you need them.
 **Running Elysium without telling the client what it does.** "We are going to compare your password hashes against a database of known-compromised credentials" needs to be explained before it happens. Most clients are fine with it once they understand the privacy model. Zero clients want a surprise.
 **Presenting findings before you have run BloodHound.** The kill chain often only becomes clear once BloodHound has shown how the pieces connect. Do not anchor the client on an incomplete kill chain in Session 2 and then have to walk it back.
 **Marking everything P0.** If you present 15 P0 findings, the client has no way to act. Real P0 items are rare — typically 3–8 in a first diagnostic. If you have more, re-examine your priority assignments.
 **Leaving without a named owner for every P0.** The diagnostic ends. The report goes out. Nobody fixes the P0 items because nobody has their name on them. Get owner names in the room before you leave.
 **Forgetting to document what you ran and what access you used.** The methodology section of the report should be written from notes taken during the assessment, not reconstructed from memory three days later.
 ---
 ## Post-Assessment Checklist
 Before submitting the report:
 - [ ] Kill chain written as a chain, not a list
 - [ ] Every P0 finding has: evidence citation, specific named assets, remediation steps, named owner
 - [ ] Quick wins section lists what was already fixed
 - [ ] Module recommendation is tied to specific findings ("Module 2 closes P0-001, P0-002, P1-001, P1-004")
 - [ ] ASTRAL baseline committed and accessible to client
 - [ ] PULSAR ingesting and accessible to client
 - [ ] Findings backlog populated in agreed tool, owners assigned
 - [ ] Report reviewed for any claim that is an assertion rather than evidence (replace with what was found)
 - [ ] NIS2 compliance map completed if client is in scope
 - [ ] Next steps section includes: module recommendation, first meeting date, decision required from client
 ---
 *Companion documents:*
 *[NIST CSF 2.0 Baseline Assessment](nist-csf-baseline.md) — workshop methodology and questionnaires*
 *[Sample Engagement: Mid-Market Hybrid](../playbooks/sample-engagement-mid-market.md) — calibration reference for findings and recommendations*
 *[Findings Backlog](findings-backlog.md) — where findings land and how the housekeeping stream works*
 *[Sovereign Tool Stack](../playbooks/sovereign-tool-stack.md) — full tool reference with deployment guidance*
 *[Module Menu](../core/modular-engagements.md) — module selection after the diagnostic*
@@ -0,0 +1,389 @@
 # M365 + AD Engagement Checklist
 > *Not a benchmark. Not scored. A structured inspection list for consultants on active engagements.*
 **Last updated:** June 2026
 **Companion to:** [Field Guide 2026](../books/field-guide-2026.md) · [Books I–VI](../books/)
 **Next review:** January 2027
 ---
 ## How to use this
 Work through the relevant sections during the Brownhat Diagnostic or at the start of a module engagement. Each item is a control area — something to inspect and a question to answer honestly. Mark items that surface findings. Mark items that are verified clean. If an item is not applicable, note why.
 This is not a scoring tool. "Found" and "clean" are the only states that matter. A clean item with no evidence of testing is the same as not checked.
 **Notation used below:**
 - `[LOOK AT]` — inspect and document current state
 - `[TEST]` — verify by observation, not by reading the config
 - `[ASK]` — a question that requires a conversation, not just a portal check
 Nothing here replaces the governing question from Book I:
 > **If this is owned tonight, what is the largest thing an attacker reaches before hitting a wall — and can I draw that wall?**
 ---
 ## Section A — Hybrid Identity
 ### A1. Authentication Method
 - `[LOOK AT]` Which authentication method is actually in use: PHS, PTA, or Federation (AD FS)?
 - `[LOOK AT]` Does the method shown in the Entra portal match what is documented and what IT staff believe to be true?
 - `[TEST]` If on-prem AD is simulated as unavailable (pull the sync server), does cloud authentication survive? Which auth method does this actually prove?
 - `[LOOK AT]` Is PHS running alongside PTA as a failover? (Optionality — cheap insurance)
 - `[LOOK AT]` If on PTA: how many PTA agents are deployed, and what host/network tier are they on?
 ### A2. Sync Engine (Entra Connect / Cloud Sync)
 - `[LOOK AT]` Which sync engine is running: Entra Connect Sync or Entra Cloud Sync?
 - `[LOOK AT]` What server hosts the sync engine, and what domain/tier is it joined to?
 - `[LOOK AT]` What account runs the on-prem connector service, and does it have `Replicate Directory Changes All` (DCSync capability)?
 - `[LOOK AT]` What is the patch / update level of the sync server (OS and sync software)?
 - `[LOOK AT]` Who has local administrator rights on the sync server?
 - `[LOOK AT]` What does the Entra connector account (Directory Synchronization Accounts role) have permission to do in the cloud?
 - `[TEST]` If the connector account is monitored: does an alert fire when it authenticates from an unexpected host?
 - `[LOOK AT]` Are there active alerts or errors in the sync engine health dashboard?
 ### A3. AD FS
 - `[LOOK AT]` Is AD FS deployed and active?
 - `[ASK]` If yes: why is it still running? What relying party trusts require it, and is there a migration plan?
 - `[LOOK AT]` When was the token-signing certificate last rotated? Where is the private key stored?
 - `[LOOK AT]` Is the rollover certificate about to expire?
 - `[LOOK AT]` Which servers host AD FS, and what network tier and patching cadence do they have?
 - `[TEST]` Golden SAML tabletop: if the token-signing key were obtained, what would detection see, and how fast could the cert be rotated? Is the procedure written and tested?
 - `[ASK]` Is there a Entra staged rollout in progress or planned to migrate away from federation?
 ### A4. Privileged Account Sync
 - `[LOOK AT]` Are any Domain Admins, Enterprise Admins, or other Tier 0 accounts synced to Entra ID (i.e., present as cloud objects)?
 - `[LOOK AT]` Are Global Admins or other Entra privileged role holders cloud-only accounts, or synced from on-prem?
 - `[LOOK AT]` Are admin accounts (on-prem or cloud) using the same device for privileged work as for daily tasks (email, browsing)?
 ### A5. Writebacks
 - `[LOOK AT]` Which writebacks are enabled: password writeback, group writeback, device writeback?
 - `[ASK]` For each: who owns the decision, and is the reverse blast radius (cloud compromise → on-prem impact) documented?
 - `[LOOK AT]` Is group writeback (v2) enabled? If so, which cloud groups write into AD, and what on-prem resources do they gate?
 ### A6. Seamless SSO
 - `[LOOK AT]` Is Seamless SSO enabled?
 - `[LOOK AT]` When was the `AZUREADSSOACC` Kerberos key last rotated? (`Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet`)
 - `[ASK]` Is Seamless SSO actually needed, or can it be removed (Entra-joined devices + modern auth typically do not require it)?
 ### A7. Sync Scope
 - `[LOOK AT]` Is sync scoped to specific OUs, or is "sync everything" the default?
 - `[LOOK AT]` Are there synced objects that serve no cloud purpose (decommissioned systems, service accounts, administrative accounts)?
 ### A8. Breach Optionality
 - `[ASK]` Is there a written, accessible runbook for severing the AD↔Entra bridge under breach conditions?
 - `[TEST]` Is the runbook stored somewhere accessible when both AD and SharePoint are unavailable?
 - `[ASK]` Has anyone walked through the "kill the sync" procedure, and does the team know what breaks per auth method?
 - `[LOOK AT]` Does the cloud admin path (break-glass Global Admin) work with zero on-prem dependency?
 ---
 ## Section B — Privileged Access
 ### B1. Standing Privilege Inventory
 - `[LOOK AT]` How many identities hold standing (permanent, active) privilege: Global Admin, Privileged Role Admin, Domain Admin, Enterprise Admin?
 - `[LOOK AT]` Are there any standing Global Admin assignments that are not break-glass accounts? (Should be zero)
 - `[LOOK AT]` How many Domain Admins and Enterprise Admins exist, and are they all justified with named owners?
 - `[ASK]` When was the privileged account list last reviewed, and by whom?
 ### B2. Admin workstations and management plane
 - `[ASK]` What do admins use to reach a domain controller remotely? Is that path independent of the AD it manages, or does it depend on AD for authentication?
 - `[LOOK AT]` Do admins use the same device for privileged work (DC management, PIM activation) and daily tasks (email, browsing)?
 - `[ASK]` Is there a dedicated admin workstation — physical PAW or cloud admin VM (Windows 365 / AVD) — that is used only for privileged tasks?
 - `[LOOK AT]` If a cloud admin VM exists: is it enrolled in Intune with a hardened profile? Is it excluded from email and general browsing? Is it the device scoped in the CA policy restricting privileged role access?
 - `[LOOK AT]` Is there a management overlay (Nebula, Tailscale, Headscale) providing the admin access path to on-prem Tier 0 systems?
 - `[ASK]` If a Nebula T0 overlay exists: where is the CA key stored? Who can sign new node certificates? When was the last signing ceremony?
 - `[ASK]` If a Tailscale T1 overlay exists: is key expiry configured? Does re-authentication require phishing-resistant MFA via Entra?
 - `[LOOK AT]` For multi-cloud clients without a physical data centre: is the management plane explicitly designed, or is access to cloud management consoles and on-prem servers done ad hoc (VPN, direct RDP, per-cloud bastion, no unified plane)?
 ### B3. PIM / JIT
 - `[LOOK AT]` Is Entra PIM deployed and enforced for Entra administrative roles?
 - `[LOOK AT]` Are Entra roles set to eligible (not active) by default?
 - `[LOOK AT]` Does PIM activation require phishing-resistant MFA (FIDO2 / certificate), or just push-approve?
 - `[LOOK AT]` Do crown roles (Privileged Role Administrator, Global Administrator) require approval workflow on PIM activation?
 - `[LOOK AT]` What is the maximum activation time-box configured? (Should be justified and bounded — 8 hours maximum for a working day)
 - `[LOOK AT]` Is PIM alert configuration enabled (Roles activated without MFA, Redundant assignments, etc.)?
 - `[ASK]` For on-prem DA/EA: is there any JIT or time-limited elevation mechanism in place?
 ### B4. Service Accounts (On-Prem)
 - `[LOOK AT]` Are there service accounts with SPNs and static passwords older than 12 months? (Kerberoastable)
 - `[LOOK AT]` Which service accounts are over-permissioned (e.g., Domain Admin, local admin on all servers)?
 - `[LOOK AT]` Which service accounts have been migrated to gMSA?
 - `[LOOK AT]` Are there service accounts nobody can identify a current owner for?
 - `[TEST]` Run a Kerberoast simulation: do ticket requests for service account SPNs generate any detection?
 ### B5. Service Principals & App Registrations (Cloud)
 - `[LOOK AT]` Which app registrations hold escalation-grade Graph permissions (application permissions): `RoleManagement.ReadWrite.Directory`, `AppRoleAssignment.ReadWrite.All`, `Application.ReadWrite.All`, `Directory.ReadWrite.All`?
 - `[LOOK AT]` Which app registrations have non-expiring client secrets?
 - `[LOOK AT]` Are there orphaned app registrations with no current owner?
 - `[LOOK AT]` Which apps have tenant-wide admin consent, and is each justified and reviewed?
 - `[LOOK AT]` Which Azure workloads use client secrets instead of managed identities where managed identities are available?
 ### B6. Tier Model / Clean Source
 - `[LOOK AT]` Do Domain Admins / Enterprise Admins authenticate from standard workstations used for email and browsing?
 - `[LOOK AT]` Is ADCS (Active Directory Certificate Services) deployed? If so, is it on a Tier 0 or hardened host, or on a standard server?
 - `[LOOK AT]` Are there shared administrative jump boxes that cross tier boundaries (used for both Tier 0 and Tier 1 work)?
 - `[LOOK AT]` Do cloud admins use the same device for privileged Entra work as for daily activity?
 ### B7. Escalation Paths
 - `[LOOK AT]` Are there accounts with `GenericAll`, `WriteDACL`, or `WriteOwner` on high-value AD objects (domain root, DCs, admin groups) that are not themselves Tier 0?
 - `[LOOK AT]` Are there computers with unconstrained delegation enabled (excluding DCs)?
 - `[LOOK AT]` When was KRBTGT last rotated? (`Get-ADUser krbtgt -Properties PasswordLastSet`)
 - `[LOOK AT]` Is LAPS (Windows LAPS preferred) deployed across all workstations and servers? What is the coverage percentage?
 - `[TEST]` Run BloodHound (or equivalent) and count attack paths to Domain Admin. Note the number as a baseline. Is it going up or down over time?
 ### B8. Break-Glass
 - `[LOOK AT]` Do cloud-only break-glass Global Admin accounts exist?
 - `[LOOK AT]` Is phishing-resistant authentication (FIDO2 or certificate) configured on break-glass accounts?
 - `[LOOK AT]` Are break-glass accounts excluded from the CA policies that would otherwise enforce device compliance or block sign-in?
 - `[LOOK AT]` Does any use of the break-glass account trigger an immediate, monitored alert?
 - `[TEST]` Sign in to the break-glass account in a controlled drill. Does it work? Does the alert fire? Does someone respond?
 - `[ASK]` Where are the break-glass credentials stored, and can they be retrieved without the systems they recover?
 ### B9. Phishing-Resistant MFA for Admins
 - `[LOOK AT]` What MFA method is enforced for Global Admins: FIDO2, certificate-based auth, or push/SMS?
 - `[LOOK AT]` Push-approve and SMS are not acceptable for administrative accounts. If they are in use, that is a P0.
 - `[LOOK AT]` Is there a CA policy restricting privileged role activation to compliant/managed devices or named PAWs?
 ---
 ## Section C — Devices & Endpoint
 ### C1. Fleet Reality
 - `[LOOK AT]` Reconcile: Intune enrolled devices vs. Entra registered devices vs. sign-in log device population. What is the gap?
 - `[LOOK AT]` How many sign-in events in the last 30 days came from non-compliant or unmanaged devices (device compliance state = unknown or non-compliant in sign-in logs)?
 - `[LOOK AT]` Are there legacy-protocol sign-ins (Basic Auth) that bypass Conditional Access entirely? (Sign-in logs, filter Client App = "Exchange ActiveSync," "Other clients")
 - `[LOOK AT]` How many BYOD / personal devices are accessing corporate data through the web client or OWA (known-unmanaged population)?
 ### C2. Join State and Management Mode
 - `[LOOK AT]` Are devices Entra-joined, hybrid Entra-joined, or Entra-registered (BYOD)?
 - `[LOOK AT]` Is hybrid Entra join still in use? If so, which on-prem dependencies actually require it?
 - `[LOOK AT]` Is there a roadmap to go cloud-native (Entra join + Intune) for devices currently on hybrid join?
 - `[LOOK AT]` Are there GPO and Intune co-management conflicts producing inconsistent configuration?
 ### C3. Conditional Access Enforcement
 - `[TEST]` For every CA policy that enforces device compliance or blocks legacy auth: run real sign-ins with expected outcomes written down beforehand. Does the observed result match?
 - `[TEST]` If a policy looks correct but does not enforce: recreate from scratch, re-test. Document ghost policy findings.
 - `[LOOK AT]` Is there a CA policy blocking legacy authentication protocols across all apps? (This is the single highest-leverage CA policy — if not in place, that is P0)
 - `[LOOK AT]` Is there a CA policy requiring MFA for all admin role activations?
 - `[LOOK AT]` Is there a CA policy requiring compliant or managed device for access to sensitive workloads?
 - `[LOOK AT]` Are break-glass accounts and emergency service accounts correctly excluded from blocking CA policies?
 - `[TEST]` Lock yourself out in report-only mode (simulate a compliance failure on an admin account). Confirm break-glass bypasses the policy. Confirm a legitimate admin gets the expected failure and knows the escalation path.
 ### C4. Compliance Signal Quality
 - `[LOOK AT]` What is the compliance check-in cadence? (The window where a fallen-out device still holds a "compliant" token)
 - `[LOOK AT]` Is Continuous Access Evaluation (CAE) enabled for workloads that support it? (Narrows the stale-token window)
 - `[ASK]` Is root/jailbreak detection in compliance policy, and how is it treated — as a hard block or a risk signal? Is it believed to be a wall or a tripwire?
 - `[TEST]` Spoof compliance on a test device (root a test device). How long until the signal flips? Does CA revoke access?
 ### C5. Endpoint Privilege
 - `[LOOK AT]` Do standard users have standing local admin on their endpoints?
 - `[LOOK AT]` Is Endpoint Privilege Management (EPM) deployed, or is there a JIT elevation mechanism for tasks requiring admin rights?
 - `[LOOK AT]` Is Windows LAPS deployed across the fleet? Is legacy LAPS still in use (to be migrated)?
 - `[LOOK AT]` Are there shared local admin accounts with common passwords across multiple machines?
 ### C6. Update and Patch Velocity
 - `[LOOK AT]` Is Windows Autopatch in use (for update ring management)?
 - `[LOOK AT]` Are Intune update rings configured with pilot, broad, and deferral stages?
 - `[ASK]` Is there a named person with the authority and procedure to halt a broad update ring push? Has this been tested?
 - `[LOOK AT]` What is the current patch lag for the fleet (how many devices are 30+ days behind on OS updates)?
 ### C7. MAM / App Protection (BYOD)
 - `[TEST]` On iOS: attempt copy/paste from managed Outlook/Teams to an unmanaged app. Does it block?
 - `[TEST]` On Android: same test, separately — behavior is not symmetric with iOS.
 - `[TEST]` Attempt to "Open in" from a managed attachment to an unmanaged app on each platform.
 - `[TEST]` Attempt to save to local storage or sync to a personal cloud (iCloud, Google Drive).
 - `[LOOK AT]` Are managed browsers enforced for SharePoint/OWA access on BYOD, or can users access via any browser?
 ### C8. Autopilot and Enrollment Trust
 - `[LOOK AT]` Is the Autopilot device list audited? Are there stale or unknown device registrations?
 - `[LOOK AT]` Are enrollment restrictions in place to prevent unauthorized device enrollment?
 - `[TEST]` Time a wipe-and-reprovision on a corporate device via Autopilot. Is the "replaceable in an hour" claim accurate?
 - `[LOOK AT]` Is the PRT (Primary Refresh Token) TPM-bound on Windows devices?
 ---
 ## Section D — Data & Collaboration
 ### D1. Sharing Posture
 - `[LOOK AT]` What is the tenant-level external sharing setting in SharePoint Admin Center?
 - `[LOOK AT]` Are "Anyone with the link" anonymous shares enabled at the tenant level?
 - `[TEST]` Enumerate existing anonymous links across the tenant. Can you produce the list? How large is it?
 - `[LOOK AT]` Are per-site sharing settings more permissive than the tenant default? (Sites can override upward)
 - `[LOOK AT]` Are sharing expiration policies configured for anonymous and external links?
 - `[TEST]` Share a document to a test external guest and attempt to reshare onward. Can you track the second-hop share?
 ### D2. Guest Access
 - `[LOOK AT]` How many active guests exist in the tenant?
 - `[LOOK AT]` How many guests have not signed in for 90+ days?
 - `[LOOK AT]` Are access reviews configured for guest accounts? What is the review cadence and the default action on non-response?
 - `[LOOK AT]` Do guests have broader access than the project they were invited for (i.e., access to Teams/channels beyond their original scope)?
 - `[LOOK AT]` Are external identities governed by specific B2B collaboration settings, or is the default (all external domains) allowed?
 ### D3. Email Security
 - `[TEST]` Enumerate external auto-forwarding rules at the transport level (`Get-TransportRule`). Are there any active rules forwarding externally without a documented business owner?
 - `[TEST]` Enumerate Inbox rules on executive / privileged user mailboxes forwarding externally. (`Get-InboxRule`)
 - `[LOOK AT]` Is the global "allow automatic forwarding" setting disabled in Remote Domains for the Default domain?
 - `[LOOK AT]` Are anti-phishing policies configured? Is impersonation protection enabled for executives and key domains?
 - `[LOOK AT]` Is DKIM signing enabled for all sending domains?
 - `[LOOK AT]` Is DMARC configured (policy `reject` or `quarantine`), and is the SPF record current?
 ### D4. Crown Jewels
 - `[ASK]` Can the client name the five data sets that, if exfiltrated, would cause the most damage?
 - `[LOOK AT]` Where do the crown jewels live (SharePoint sites, mailboxes, OneDrive, Teams channels)?
 - `[LOOK AT]` Who has access to the crown-jewel locations? Is access reviewed periodically?
 - `[LOOK AT]` Are the crown-jewel locations labeled with sensitivity labels that carry encryption?
 - `[LOOK AT]` Are audit logs turned on and retained long enough to reconstruct access to crown-jewel locations?
 ### D5. Sensitivity Labels and DLP
 - `[LOOK AT]` Are sensitivity labels deployed in the tenant? What is the coverage across the most-used content types (email, files)?
 - `[LOOK AT]` Are labels configured with encryption for the highest sensitivity tiers?
 - `[LOOK AT]` Is auto-labeling deployed for known crown-jewel content types (if licensed for M365 E5 Compliance)?
 - `[LOOK AT]` Is DLP deployed? Is it scoped to specific known-value patterns (regulated data, PII, crown-jewel keywords) or applied as a broad dragnet generating noise?
 - `[TEST]` Exfiltrate a labeled test document via email to an external address. Does DLP fire? Does the label encryption hold on the received document?
 ### D6. Collaboration Sprawl
 - `[LOOK AT]` Is there ungoverned self-service creation of Teams and SharePoint sites?
 - `[LOOK AT]` Are there orphaned or inactive Teams/sites that still hold data and have no active owner?
 - `[LOOK AT]` Are there Teams channels or SharePoint sites with "Everyone" or broad internal membership grants on sensitive data?
 - `[LOOK AT]` Is late-joiners' access to Team history governed (a user joining a Team today can read all prior messages by default)?
 ### D7. OAuth App Consent
 - `[LOOK AT]` Is user consent for OAuth apps restricted (users cannot consent to app permission requests without admin approval)?
 - `[LOOK AT]` Are there existing grants for apps holding `Mail.Read`, `Files.ReadWrite.All`, or equivalent sensitive scopes by non-first-party apps?
 - `[LOOK AT]` Is Microsoft's app governance module (Purview) enabled? Are risky app alerts configured?
 ### D8. Audit Logging
 - `[LOOK AT]` Is Unified Audit Logging enabled (confirm in Purview Compliance Center > Audit)?
 - `[LOOK AT]` What is the audit retention period, given the client's licensing?
 - `[TEST]` Run a sample audit query on a known recent activity and verify log entries are present. Do not assume the log is on without testing it.
 - `[LOOK AT]` Are admin operations (role assignment changes, app consent, CA policy changes) captured in the audit log?
 ---
 ## Section E — Recovery & Detection
 ### E1. Backup and Recovery
 - `[ASK]` What is the recovery path if a Global Admin deletes all Exchange Online mailboxes and SharePoint sites? Be specific about process, tool, and time estimate.
 - `[LOOK AT]` Is there a third-party M365 backup solution covering Exchange, SharePoint, OneDrive, and Teams?
 - `[LOOK AT]` Are M365 backups isolated from the estate they protect (immutable, separate authentication domain)?
 - `[TEST]` When was the last successful restore from backup, and how long did it take? Restore a test mailbox or a file share and time it. This is the MTTR.
 - `[LOOK AT]` Are on-prem AD backups (System State) taken regularly, stored offline, and verified?
 - `[TEST]` Can the current backup restore an AD domain if all DCs are destroyed? Has anyone run the forest recovery procedure, even in a lab?
 ### E2. Configuration-as-Code (Known-Good Baseline)
 - `[LOOK AT]` Have CA policies been exported to code/JSON (e.g., using CAExporter)?
 - `[LOOK AT]` Has the Entra role assignment state been captured as a document?
 - `[LOOK AT]` Has the Intune baseline configuration been exported?
 - `[LOOK AT]` Is there a diff between the opening state and current state for any changes made during the engagement?
 - `[ASK]` If the tenant CA policies were silently modified by an attacker, would anyone know? Is there drift detection against the known-good?
 ### E3. Recovery Path Independence
 - `[LOOK AT]` Does any part of the recovery runbook depend on the system it recovers (e.g., runbook stored in SharePoint, backup auth via the compromised AD)?
 - `[LOOK AT]` Are recovery credentials (break-glass, backup admin accounts) accessible independently of the estate?
 - `[LOOK AT]` Is the AD forest recovery runbook stored offline or in a location that survives domain destruction?
 - `[ASK]` If both AD and M365 were simultaneously unavailable, what is the recovery sequencing? Is that decision documented?
 ### E4. Detection: Signal Quality
 - `[LOOK AT]` Break-glass account use: is there an alert? Is it monitored by a named person?
 - `[LOOK AT]` New Global Admin assignment: does an alert fire?
 - `[LOOK AT]` DCSync from a non-DC host: is this detected (Defender for Identity or SIEM rule)?
 - `[LOOK AT]` Impossible-travel sign-in for admin accounts: is Entra ID Protection user risk policy configured and alerting?
 - `[LOOK AT]` External auto-forward rule creation: is this generating an alert?
 - `[LOOK AT]` Mass download from SharePoint/OneDrive: is there a Defender for Cloud Apps or Purview policy detecting it?
 - `[LOOK AT]` New OAuth consent grant to sensitive scopes: is this alerting?
 - `[LOOK AT]` PIM activation outside business hours: is this logged and reviewed?
 - `[TEST]` For each configured detection: simulate the event (in a controlled, authorized test context) and confirm the alert fires, is received by a named person, and generates a response within the expected SLA.
 ### E5. Detection: Noise and Action
 - `[ASK]` How many alerts does the monitoring system generate per day? How many are triaged vs. suppressed vs. missed?
 - `[ASK]` For the last three security incidents or notable alerts: what structural change resulted? If the answer is "we sent an awareness email" or "we noted it," the feedback loop is broken.
 - `[LOOK AT]` Is there a named owner for each alert category? An alert without a named owner is an unread alert.
 - `[ASK]` Is there a blameless post-incident process? Do people surface incidents, or do they bury them to avoid blame?
 ### E6. Game-Days and Drills
 - `[ASK]` When was the last deliberate test of recovery or detection (a drill, tabletop, or game-day)?
 - `[TEST]` Break-glass drill: sign in, confirm it works, confirm the alert fires. Document the test and the result.
 - `[TEST]` CA policy enforcement drill: force a non-compliant state on a test user. Confirm the expected outcome and that break-glass bypasses the gate.
 - `[ASK]` Has the client ever run a ransomware tabletop that assumes Tier 0 is owned? What did they find?
 ---
 ## Section F — Quick-Win Inventory
 Use this section to capture findings that can be addressed in the same session or within the engagement without additional scoping.
 Each of the following, if found to be the case, is a fix that typically takes under an hour and has immediate blast-radius reduction. Do not leave these open for the next engagement.
 | Control | Condition that makes it a quick win |
 |---------|-------------------------------------|
 | Tenant-level anonymous sharing | "Anyone" links enabled at tenant level — one toggle |
 | External auto-forwarding | Global block not set — one Exchange setting |
 | Legacy auth CA policy | No policy blocking legacy auth — deploy baseline CA policy |
 | Break-glass alert | Break-glass use not alerting — configure alert rule |
 | Global admins audit | Standing synced GAs — identify and initiate migration |
 | KRBTGT age | Password not set in 365+ days — document and schedule rotation |
 | Stale admin accounts | Disabled or unchecked admin accounts — disable and document |
 | Audit log | Not enabled — turn on (one click in Purview) |
 | PIM not deployed | P2 licensed but PIM off — scope activation as P1 |
 | No CA blocking admin sign-in from personal devices | Missing policy — create report-only immediately, test and enable |
 ---
 ## Engagement Close — Structural Change Verification
 At the close of each engagement or module, confirm:
 1. Which items above were found to be fragile?
 2. For each: what **structural change** was made (not documented, not accepted, but changed)?
 3. Which items were tested by observation (not just inspected)?
 4. Which items are open and in the risk register with a named owner and a timeline?
 5. Has the configuration-as-code baseline been exported and stored?
 6. Has the break-glass been tested?
 7. Is there a named date for the next review of this checklist?
 The work is not complete when the list is walked. It is complete when fragility found has become structure changed.
 ---
 *Engagement Checklist. Updated June 2026. Review and update alongside the Field Guide — January 2027.*
@@ -0,0 +1,216 @@
 # Findings Backlog
 > *"A finding without a home is a finding that will never be fixed. The risk register is the right home. The backlog is the one that actually exists."*
 ## The Problem This Solves
 Every assessment, module, and engagement produces findings. Some get fixed immediately. Most do not. In organisations with mature risk management, findings go into a risk register, get assigned owners, get reviewed quarterly, and get closed over time.
 In practice, most organisations do not have a working risk register. They have a template someone downloaded, a spreadsheet that was last updated during the ISO 27001 attempt three years ago, or a GRC tool that nobody logs into. Findings that go into these systems disappear.
 The **findings backlog** is the pragmatic alternative. It is not a replacement for a formal risk register — it is the lightweight, maintainable system that fills the gap between "finding documented in a report" and "finding tracked to closure." For organisations that eventually build a working risk register, the backlog feeds into it. For organisations that never do, the backlog is their risk register in all but name.
 ---
 ## Deployment Options
 Three options, in order of preference. Choose based on what the client will actually open.
 ### Option 1 — Azure DevOps Work Items (recommended for ASTRAL clients)
 If ASTRAL is deployed, the client already has an ADO project. Work Items are built in — no additional tooling, no additional cost, same context as the ASTRAL drift PRs and restore pipeline. This is the default.
 **Setup**: Create a Work Item type called "Security Finding" (or use the built-in Bug or Task type with a tag). Create a board with columns: `New → Triaged → In Progress → Blocked → Closed`. Add custom fields: Priority (P0/P1/P2), Source (Brownhat / BloodHound / Elysium / ASTRAL / PULSAR / Module N), and Target Date.
 **Why it works**: Consultants who are already reviewing ASTRAL drift PRs see the backlog in the same tool. The client's IT lead who owns remediation works in the same project. No context switching.
 **Teams tab**: Pin the ADO board directly into the relevant Teams channel as a tab — built into the Azure DevOps app for Teams, no additional setup. The IT lead who lives in Teams can view and update Work Items without opening ADO in a browser. This is the recommended surface for non-technical owners: it is always visible, requires no context switch, and keeps the canonical data in ADO.
 **Power Automate (optional)**: If you need to push notifications into Teams or create tasks in Planner when a P0 item is opened or a target date is missed, Power Automate can bridge ADO to the M365 ecosystem. This adds complexity and a dependency on Power Automate flows — use it only if the Teams tab alone is not driving the right behaviour. There is no native ADO → Planner sync without Power Automate.
 **ASTRAL integration**: When ASTRAL raises a drift PR for an unauthorised configuration change that cannot be immediately restored, link the ADO Work Item to the ASTRAL PR. The PR description, the before/after diff, and the reviewer decision are all in the same project — the Work Item is the remediation task, the ASTRAL PR is the evidence.
 ---
 ### Option 2 — CISO Assistant (upgrade path for clients building toward GRC)
 [CISO Assistant](https://github.com/intuitem/ciso-assistant-community) is an open-source GRC platform already in the [Sovereign Tool Stack](../playbooks/sovereign-tool-stack.md). It provides risk register functionality, compliance framework mapping (NIS2, ISO 27001, DORA), evidence tracking, and audit-ready reporting — all self-hosted.
 **When to use it instead of ADO Work Items**: When the client has an intent to build a formal risk management programme and needs a tool that can grow into it. CISO Assistant bridges the gap between a pragmatic backlog and a formal risk register: the same findings that start as backlog items can be promoted to documented risks with treatment plans, residual risk assessments, and compliance evidence links.
 **How the backlog feeds CISO Assistant**: During each module, findings are entered into the backlog in ADO or a flat file. At quarterly review, P1 and significant P2 items that are not yet closed are promoted to CISO Assistant as risk entries with the evidence collected during the engagement. The backlog is operational; CISO Assistant is the strategic record.
 **Deployment**: Docker Compose, ~30 minutes. Self-hosted on the client's infrastructure or on a VPS. See the sovereign tool stack for deployment guidance.
 ---
 ### Option 3 — Git flat file (fallback for clients without ADO or preference for simplicity)
 A Markdown file committed to the same repository as ASTRAL (or a dedicated security repository). Zero additional tooling. Fully auditable via Git history. Works offline.
 **When to use it**: Clients who have the technical capability to maintain a Markdown file in Git and prefer minimal tooling. Also useful as a transitional format before ADO Work Items are fully configured.
 **Limitation**: No native assignment notifications, no Planner sync, no board view. Progress is visible only to people who look at the repository. For clients where the IT lead needs to be nudged, a flat file will be ignored. Use ADO Work Items or CISO Assistant instead.
 The flat file template is provided below.
 ---
 ## Design Principles
 **It must live where the client actually opens things.** If the backlog is in a tool the client never looks at, it does not exist. The three options above are ordered by likelihood of adoption. ADO Work Items wins because ASTRAL is already there — the path of least resistance is the path most likely to be walked.
 **Every finding has an owner.** A finding without a named owner is not tracked — it is archived. The owner does not need to be the person who fixes it. They need to be the person who is accountable for whether it gets fixed.
 **Priority drives the housekeeping stream.** The backlog is the input queue for Rule 4 (housekeeping as a permanent stream). The monthly housekeeping cycle picks from the backlog, resolves what it can, and updates statuses. Without the backlog, the housekeeping stream has no queue to work from.
 **It accumulates from all sources.** Every diagnostic, every module, every ASTRAL drift alert, every PULSAR-flagged event, every BloodHound finding, every Elysium result feeds the backlog. Not just the big assessments. The backlog is the single source of truth for everything that has been found and not yet fixed.
 ---
 ## The Format
 A minimal backlog entry has six fields. Do not add more until the client is actually maintaining this one.
 | Field | What it contains |
 |-------|-----------------|
 | **ID** | Sequential identifier (B-001, B-002…). Never reuse an ID. |
 | **Finding** | One sentence: what is wrong. Not "review accounts" — "47 user accounts belong to staff who have left; credentials remain valid." |
 | **Source** | Which assessment or tool produced this: Brownhat Diagnostic, BloodHound, Elysium, ASTRAL drift, PULSAR alert, Module 6, etc. |
 | **Priority** | P0 / P1 / P2 — using the kill chain test (see below) |
 | **Owner** | Named person, not a team. "AD Team" is not an owner. "Marek Novák" is. |
 | **Status** | Open / In Progress / Blocked / Closed |
 Optional fields that add value once the basic discipline is established:
 | Field | What it contains |
 |-------|-----------------|
 | **Target date** | The date by which this should be resolved. Not when the project ends — when this specific item should be done. |
 | **Effort** | S / M / L — rough estimate; S = fixable in a day or less, M = a few days, L = needs planning |
 | **Notes** | Blockers, context, related items, change window requirements |
 | **Closed date** | When it was actually closed. Important for demonstrating progress to auditors. |
 ---
 ## Priority Assignment
 Use the kill chain test from the [Consultant Field Guide](../core/consultant-field-guide.md):
 **P0 — Kill chain node.** If exploited, the organisation fails to operate. Fix before anything else. Examples: admin accounts without MFA, unpatched internet-facing system with known active exploit, backup that has never been restored, KRBTGT password over 365 days old.
 **P1 — Material damage.** If exploited, significant harm results but the organisation survives. Fix within the current engagement. Examples: service accounts with non-expiring passwords, open RDP from internet, legacy authentication not blocked, stale privileged accounts.
 **P2 — Should be fixed.** Real finding, real risk, but not existential. Goes into the housekeeping queue for the next available cycle. Examples: weak password policy on non-privileged accounts, missing DNS security filtering, unreviewed firewall rules from two years ago, undocumented vendor access with low privilege.
 > **On priority inflation**: The most common backlog failure is everything being marked P0. If everything is urgent, nothing is. Be ruthless. An environment with more than 5–10 P0 items either has a genuinely catastrophic security posture (in which case the immediate conversation is with the executive sponsor) or the priority assignments are wrong.
 ---
 ## Backlog Template (Flat File Version)
 For clients whose teams work in a repository (preferred — the backlog lives alongside ASTRAL):
 ```markdown
 # Findings Backlog
 Last reviewed: [DATE] | Owner: [NAME] | Next review: [DATE]
 ## P0 — Kill Chain (fix immediately)
 | ID | Finding | Source | Owner | Status | Target |
 |----|---------|--------|-------|--------|--------|
 | B-001 | | | | | |
 ## P1 — Material Risk (fix this engagement)
 | ID | Finding | Source | Owner | Status | Target |
 |----|---------|--------|-------|--------|--------|
 | B-010 | | | | | |
 ## P2 — Housekeeping Queue (monthly cycle)
 | ID | Finding | Source | Owner | Status | Target |
 |----|---------|--------|-------|--------|--------|
 | B-100 | | | | | |
 ## Closed
 | ID | Finding | Closed date | Closed by |
 |----|---------|-------------|-----------|
 | | | | |
 ```
 Use ID ranges to signal priority at a glance: B-001–B-009 for P0, B-010–B-099 for P1, B-100+ for P2.
 ---
 ## Populating the Backlog
 ### On Day 30 (from the Brownhat Diagnostic)
 The Brownhat Diagnostic produces the first population of the backlog. Every finding from the diagnostic gets an entry. Quick wins that are closed immediately during the engagement go straight to Closed with the closing date. Everything else — including findings the client acknowledges but cannot act on immediately — goes into the backlog with the appropriate priority.
 The consultant populates the initial backlog as part of the diagnostic deliverable. It is not a separate engagement. It is what happens to findings instead of filing them in a PDF.
 ### From subsequent modules
 Every module completion package includes an update to the backlog:
 - Findings that were closed by the module move to Closed
 - New findings discovered during the module are added
 - The risk register update in the module completion package cross-references the backlog IDs
 ### From continuous tools (ASTRAL, PULSAR, Elysium)
 - **ASTRAL** — when a drift PR is raised for an unauthorised configuration change, a backlog entry is created if the change is not immediately remediated. The ASTRAL PR link goes in the Notes field.
 - **PULSAR** — when an alert is investigated and reveals a structural gap (not just an event), a backlog entry is created. The PULSAR event ID goes in the Notes field.
 - **Elysium** (quarterly run) — each new compromised or weak credential that cannot be immediately reset gets a backlog entry.
 - **BloodHound** (quarterly run) — new or persistent attack paths get backlog entries with the path description.
 ---
 ## The Housekeeping Cycle
 The monthly housekeeping cycle (Rule 4 of [Move Fast and Fix Things](../core/move-fast-and-fix-things.md)) works from the backlog. The cycle has a simple structure:
 1. **Review open P0 items.** If any P0 is still open and not blocked by an external dependency, it is the first priority. If it has been open more than 30 days without progress, it escalates to the executive sponsor.
 2. **Work P1 items.** Each cycle resolves at least one P1 item. If no P1 items are being resolved, the housekeeping stream is not functioning — find the blocker.
 3. **Advance P2 items.** Move through the P2 queue at the capacity available. Not every cycle will close P2 items. Every cycle should move at least one P2 item to In Progress.
 4. **Review and reprioritise.** As the environment changes, priorities shift. A P2 item that has been open for six months may be a P0 in disguise if nothing above it has been fixed.
 5. **Update statuses.** Every item touched in the cycle gets a status update, even if the update is "Blocked — waiting for change window."
 The output of each cycle is a one-page summary: items closed this cycle, items In Progress, blockers, and the current P0/P1 count. This summary goes to the named client lead. If retained capability is in scope, it goes to the CQRE consultant as well.
 ---
 ## Relationship to the Risk Register
 If the client has a working risk register, the backlog and the risk register coexist:
 - The backlog is **operational** — it is where findings live while they are being worked
 - The risk register is **strategic** — it captures the risk that the finding represents, the treatment decision, and the residual risk after treatment
 - When a P0 or P1 item is closed, the consultant works with the client to create or update the corresponding risk register entry with the closure evidence
 If the client does not have a working risk register, the backlog is effectively doing the risk register's job at a reduced level of formality. Do not pretend otherwise. If the client ever needs to demonstrate risk management to a regulator or auditor, the backlog — with its closure dates, ownership, and priority history — is defensible evidence. A GRC tool with empty fields is not.
 For clients who want to build a proper risk register: the backlog entries, once they have closure evidence attached, become the input for the risk register's treatment and closure records. The backlog is not wasted effort — it is the work that feeds the register.
 ---
 ## Backlog Health Indicators
 These are warning signs that the backlog is not functioning:
 | Indicator | What it means |
 |-----------|--------------|
 | P0 items open for more than 30 days with no progress | Executive escalation required; a P0 that nobody is moving is a political problem, not a technical one |
 | More than 20 items in the backlog with no owner | The backlog was populated but not handed over properly; go back and assign owners |
 | No items closed in the last monthly cycle | The housekeeping stream is not running; find the responsible person and re-establish the cadence |
 | All items are P2 | Priority inflation has happened in reverse; the consultant needs to revisit severity assignments |
 | The backlog has not been updated since the last engagement | The backlog is a report, not a system; the client has reverted to treating findings as documentation rather than as work |
 ---
 *Related: [Module Completion Report](module-completion-report.md) — each module updates the backlog as part of its completion package.*
 *Related: [Antifragile Risk Register](antifragile-risk-register.md) — the formal risk register template; the backlog feeds into it.*
 *Related: [Move Fast and Fix Things — Rule 4](../core/move-fast-and-fix-things.md#rule-4-run-housekeeping-as-a-permanent-stream) — the backlog is the queue that Rule 4 works from.*
 *Related: [Engagement Model](../core/engagement-model.md) — backlog setup is part of every module kickoff.*
@@ -153,14 +153,14 @@ Next review:      14 April 2025
 | **Impact** | 3 — Significant. Primarily a compliance and investigation impact rather than operational failure. |
 | **Traditional risk score** | 9 — P3 (elevated to P2 due to regulatory exposure) |
 | **Optionality impact** | Moderate. Once logs are deleted, the option to investigate and prove scope is permanently lost. |
-| **Convexity** | High. Extending retention to 180 days requires E3 Compliance Add-on (≈€8/user/month) or ingestion into a long-term log store (AOC + blob storage). Cost vs. cost of regulatory non-compliance is asymmetric. |
+| **Convexity** | High. Extending retention to 180 days requires E3 Compliance Add-on (≈€8/user/month) or ingestion into a long-term log store (PULSAR + blob storage). Cost vs. cost of regulatory non-compliance is asymmetric. |
-| **Current control** | M365 Unified Audit Log at 90-day default. No secondary storage. AOC not yet deployed. |
+| **Current control** | M365 Unified Audit Log at 90-day default. No secondary storage. PULSAR not yet deployed. |
-| **Antifragile move** | 1. Deploy AOC to ingest and persist audit logs beyond the 90-day window into the organisation's own infrastructure (MongoDB + blob storage). 2. Alternatively, evaluate E3 Compliance Add-on for extended Microsoft-native retention. 3. Document retention policy and verify it meets applicable regulatory requirements (NIS2 Article 21 recommends 12+ months). |
+| **Antifragile move** | 1. Deploy PULSAR to ingest and persist audit logs beyond the 90-day window into the organisation's own infrastructure (MongoDB + blob storage). 2. Alternatively, evaluate E3 Compliance Add-on for extended Microsoft-native retention. 3. Document retention policy and verify it meets applicable regulatory requirements (NIS2 Article 21 recommends 12+ months). |
 | **Owner** | CISO / IT Manager |
 | **Target date** | 30 April 2025 (P2 — within 90 days) |
 | **Status** | Open |
-| **Stress-to-signal mandate** | If an incident reveals log gaps: AOC deployed immediately post-incident; retention policy reviewed and extended to regulatory minimum; board notified of compliance gap. |
+| **Stress-to-signal mandate** | If an incident reveals log gaps: PULSAR deployed immediately post-incident; retention policy reviewed and extended to regulatory minimum; board notified of compliance gap. |
-| **Verification method** | AOC deployed with log ingestion confirmed. Oldest ingested log age exceeds 180 days within 6 months of deployment. Retention policy documented and signed off. |
+| **Verification method** | PULSAR deployed with log ingestion confirmed. Oldest ingested log age exceeds 180 days within 6 months of deployment. Retention policy documented and signed off. |
 ---
@@ -178,8 +178,8 @@ Next review:      14 April 2025
 | **Traditional risk score** | 12 — P2 |
 | **Optionality impact** | Moderate. Without detection, the organisation cannot exercise the option to contain and eject an attacker early. |
 | **Convexity** | High. Building a detection engineering cell (1 FTE equivalent) costs ≈€150K/year and makes the €102K/year MSSP investment 3× more effective. |
-| **Current control** | MSSP with generic ruleset. AOC not deployed. No custom detection rules. MSSP SLA measures ticket response time, not detection coverage. |
+| **Current control** | MSSP with generic ruleset. PULSAR not deployed. No custom detection rules. MSSP SLA measures ticket response time, not detection coverage. |
-| **Antifragile move** | 1. Conduct a purple team TTP coverage test against the MSSP (5 TTPs, as described in the Retained Capability document). 2. Deploy AOC to add M365-specific detection on top of the MSSP. 3. Write 3–5 custom detection rules for the highest-priority Meridian-specific TTPs (OT/IT boundary crossing, service account anomalies, large SharePoint exports). 4. Add detection coverage rate to the MSSP SLA. 5. Consider a retained capability arrangement to maintain and extend the custom ruleset. |
+| **Antifragile move** | 1. Conduct a purple team TTP coverage test against the MSSP (5 TTPs, as described in the Retained Capability document). 2. Deploy PULSAR to add M365-specific detection on top of the MSSP. 3. Write 3–5 custom detection rules for the highest-priority Meridian-specific TTPs (OT/IT boundary crossing, service account anomalies, large SharePoint exports). 4. Add detection coverage rate to the MSSP SLA. 5. Consider a retained capability arrangement to maintain and extend the custom ruleset. |
 | **Owner** | IT Manager / outsourced CISO |
 | **Target date** | 30 June 2025 (P2 — within 90 days to start; sustained programme) |
 | **Status** | Open |
@@ -0,0 +1,380 @@
 # Self-Service Security Cadence
 > *What you run between our engagements. When something in here surprises you, that's when you call us.*
 **Last updated:** June 2026
 **Produced by:** [engagement name / consultant name]
 **For:** [client name] — [named admin / IT lead]
 **Next full engagement:** [date or "TBD"]
 **Next review of this document:** January 2027
 ---
 ## What this is
 We ran the adversarial validation. We fixed the structural issues we found. The work does not stop when we leave.
 This document is your recurring checklist — things you can run yourself, with the tools we set up, on a regular cadence. None of it requires a security background. Most of it takes under an hour per month. The point is to catch drift before it becomes a problem, and to know when to call us before it becomes a crisis.
 **The most important thing:** when something in here produces a result that surprises you, do not sit on it. Log it, screenshot it, and send it to us. The earlier we see a problem the cheaper it is to fix.
 ---
 ## Tools you need (all installed during the engagement)
 | Tool | What it does | Where to get it |
 |------|-------------|-----------------|
 | **PingCastle** | Scans Active Directory and produces a security report with a score and specific findings | [pingcastle.com](https://www.pingcastle.com) — free Community edition |
 | **Purple Knight** | Scans Active Directory for indicators of exposure — simpler output than PingCastle, good complement | [purple-knight.com](https://www.purple-knight.com) — free |
 | **CAExporter** | Exports all Conditional Access policies to JSON files you can compare over time | [github.com/vibecoding/CAExporter](https://github.com/vibecoding/CAExporter) |
 | **Microsoft Graph PowerShell** | The PowerShell module for the scripts in this document | `Install-Module Microsoft.Graph` |
 | **Microsoft 365 Defender portal** | alerts.microsoft.com — your alert queue and Secure Score |  |
 | **Microsoft Entra portal** | entra.microsoft.com — your identity dashboard |  |
 The scripts in this document are saved in `[location agreed during engagement — e.g., C:\SecurityRunbook\Scripts\]`.
 ---
 ## Monthly checks — 30 to 45 minutes, portal-based
 Do these on the first working day of each month. They require no special tools — just a browser logged in as a Global Admin or Security Reader.
 ---
 ### M1. Microsoft Secure Score
 **Where:** [Microsoft 365 Defender portal](https://security.microsoft.com) > Secure Score
 **What to do:**
 1. Note the current score.
 2. Compare to last month's score (the history graph shows it).
 3. Look at the "Recommended actions" tab — filter to "Not addressed."
 4. Any new items that appeared since last month? Note them.
 **What you are looking for:** Score going down month-over-month without a known reason. New recommended actions you did not create. Completed actions that have reverted to "not addressed" (this means configuration drifted back).
 **Call us if:** Score drops more than 5 points in a month without a documented reason, or if a completed action you remember implementing shows as "not addressed."
 ---
 ### M2. Entra ID Recommendations
 **Where:** [Entra portal](https://entra.microsoft.com) > Overview > Recommendations
 **What to do:**
 1. Look at all open recommendations.
 2. Note any that are new since last month.
 3. Note the impact rating (High / Medium / Low) on new ones.
 **What you are looking for:** New high-impact recommendations that appeared since last month. Specifically watch for anything related to admin accounts, Conditional Access, legacy authentication, or risky sign-ins.
 **Call us if:** Any new High-impact recommendation appears. We will help you assess whether to act immediately or schedule it.
 ---
 ### M3. Sign-in risk review
 **Where:** Entra portal > Identity Protection > Risky sign-ins
 **What to do:**
 1. Filter to the last 30 days.
 2. Look at sign-ins with risk level "High" that were not dismissed or remediated.
 3. For any admin account (Global Admin, Exchange Admin, Security Admin) with any risky sign-in event — investigate before dismissing.
 **What you are looking for:** Admin accounts appearing in the risky sign-in list. Any high-risk sign-in that auto-remediated (meaning the user passed an MFA challenge) where the geography or device does not make sense.
 **Call us if:** Any admin account has a risky sign-in event. Any high-risk event that was remediated from an unexpected location.
 ---
 ### M4. Alert queue health
 **Where:** Microsoft 365 Defender portal > Incidents & alerts > Alerts
 **What to do:**
 1. Filter to "New" and "In progress" alerts.
 2. How many are sitting open for more than 48 hours?
 3. Are there categories of alert that appear repeatedly? (Recurring alerts on the same user or asset are a pattern, not noise.)
 **What you are looking for:** Alert queue growing over time without being worked. The same alert firing repeatedly on the same account or resource. Any alert tagged as "High severity" that is more than 24 hours old without assignment.
 **Call us if:** A High-severity alert is more than 24 hours old and you do not know what to do with it. Or if the same alert keeps firing on the same account.
 ---
 ### M5. New admin assignments
 **Where:** Entra portal > Identity > Roles & admins > All roles > Global Administrator > Assignments
 **What to do:**
 1. Check the current member list against last month's.
 2. Any new members? Were they expected?
 3. Check at minimum: Global Administrator, Exchange Administrator, Security Administrator, SharePoint Administrator.
 **What you are looking for:** Anyone in a privileged role who should not be, or who appeared without a formal request.
 **Call us if:** Any new privileged role assignment you did not authorize or do not recognize.
 ---
 ### M6. Break-glass confirmation (30 seconds)
 **What to do:**
 1. Confirm the break-glass account credentials are still in the agreed storage location.
 2. Confirm the contact for "break-glass alert fired" is still the right person.
 Do not log in to the break-glass account during this check — any sign-in triggers an alert. Just confirm the credentials are accessible.
 **Call us if:** Credentials cannot be found. Or if the break-glass alert fires without a drill scheduled.
 ---
 ## Quarterly checks — 2 to 3 hours, tools required
 Do these in the first week of each quarter (January, April, July, October). These require running the installed tools and saving the output.
 ---
 ### Q1. PingCastle AD scan
 **How to run:**
 1. Log in to the domain controller (or any domain-joined machine) as a Domain Admin.
 2. Run `PingCastle.exe --healthcheck --server <your-domain-FQDN>`.
 3. It produces an HTML report. Save it to `[agreed location]` with the date in the filename: `PingCastle-2026-Q3.html`.
 4. Open the report and note the score and any findings marked "Critical" or "High."
 5. Compare to the previous quarter's report — is the score going up or down?
 **What you are looking for:** Score trending down quarter-over-quarter. New Critical or High findings that were not present last quarter. Specifically watch the "Stale Objects" section (accounts nobody uses) and the "Privileged Access" section.
 **Call us if:** The score drops more than 10 points since last quarter. Any new Critical finding. Any finding in the "Privileged Access" category that was clean last quarter.
 ---
 ### Q2. Purple Knight AD scan
 **How to run:**
 1. Download and run Purple Knight on a domain-joined machine with Domain Admin credentials.
 2. It is a GUI tool — click through the scan, wait for it to finish.
 3. Save the PDF report with the date: `PurpleKnight-2026-Q3.pdf`.
 4. Look at the "Identity Security Indicators" with status "Exposed" or "Critical."
 5. Compare to the previous quarter.
 **What you are looking for:** New exposed indicators that did not appear last quarter. Any indicator flagged as Critical. The tool is organized by MITRE ATT&CK category — pay particular attention to "Credential Access" and "Privilege Escalation."
 **Call us if:** Any new Critical indicator. Or if the same Medium indicators keep appearing quarter after quarter without being resolved (this means the fix did not stick).
 ---
 ### Q3. KRBTGT and AZUREADSSOACC age check
 **How to run:** Open PowerShell as Domain Admin and run the following:
 ```powershell
 Write-Host "=== KRBTGT ===" -ForegroundColor Cyan
 Get-ADUser krbtgt -Properties PasswordLastSet |
  Select-Object @{N="Account";E={"krbtgt"}},
               PasswordLastSet,
               @{N="AgeDays";E={((Get-Date) - $_.PasswordLastSet).Days}}
 Write-Host "=== AZUREADSSOACC ===" -ForegroundColor Cyan
 Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet -ErrorAction SilentlyContinue |
  Select-Object @{N="Account";E={"AZUREADSSOACC"}},
               PasswordLastSet,
               @{N="AgeDays";E={((Get-Date) - $_.PasswordLastSet).Days}}
 ```
 Record the age in days in your tracking spreadsheet.
 **What you are looking for:** KRBTGT older than 365 days = P1 (schedule rotation with us). KRBTGT older than 180 days = note and plan. AZUREADSSOACC never rotated since initial sync setup = note.
 **Call us if:** KRBTGT is over 365 days old and there is no scheduled rotation. Or if either account shows a password age younger than expected (meaning someone rotated it without telling you — that is a finding too).
 ---
 ### Q4. Cloud-only Global Admins check
 **How to run:**
 ```powershell
 Connect-MgGraph -Scopes "Directory.Read.All"
 $gaRoleId = (Get-MgDirectoryRole -Filter "displayName eq 'Global Administrator'").Id
 $gaMembers = Get-MgDirectoryRoleMember -DirectoryRoleId $gaRoleId
 Write-Host "=== Global Admins ===" -ForegroundColor Cyan
 $gaMembers | ForEach-Object {
  $user = Get-MgUser -UserId $_.Id -Property DisplayName,UserPrincipalName,OnPremisesSyncEnabled
  [PSCustomObject]@{
    Name            = $user.DisplayName
    UPN             = $user.UserPrincipalName
    SyncedFromAD    = $user.OnPremisesSyncEnabled
  }
 } | Format-Table -AutoSize
 ```
 Any row where `SyncedFromAD` is `True` is a P0 — call us immediately.
 **What you are looking for:** Any Global Admin that is synced from on-prem AD. Any new GA you did not create.
 **Call us if:** Any synced GA appears. Any GA you do not recognize.
 ---
 ### Q5. Service principal secrets check — expiring and never-expiring
 **How to run:**
 ```powershell
 Connect-MgGraph -Scopes "Application.Read.All"
 $today = Get-Date
 $warningDays = 60
 Write-Host "=== Non-expiring secrets ===" -ForegroundColor Red
 Get-MgApplication -All | ForEach-Object {
  $app = $_
  $app.PasswordCredentials | Where-Object { $_.EndDateTime -eq $null } | ForEach-Object {
    [PSCustomObject]@{ App = $app.DisplayName; Secret = $_.DisplayName; Expires = "NEVER" }
  }
 } | Format-Table
 Write-Host "=== Secrets expiring within $warningDays days ===" -ForegroundColor Yellow
 Get-MgApplication -All | ForEach-Object {
  $app = $_
  $app.PasswordCredentials | Where-Object {
    $_.EndDateTime -ne $null -and $_.EndDateTime -lt $today.AddDays($warningDays)
  } | ForEach-Object {
    [PSCustomObject]@{ App = $app.DisplayName; Secret = $_.DisplayName; Expires = $_.EndDateTime }
  }
 } | Sort-Object Expires | Format-Table
 ```
 **What you are looking for:** Non-expiring secrets on any app registration. Secrets about to expire (these will break an application if not rotated — but they also need reviewing: is the app still needed?).
 **Call us if:** You find a non-expiring secret on an app you do not recognize. Or if you find an expiring secret and do not know which application or service it belongs to.
 ---
 ### Q6. Stale guest review
 **How to run:**
 ```powershell
 Connect-MgGraph -Scopes "User.Read.All", "AuditLog.Read.All"
 $cutoff = (Get-Date).AddDays(-90)
 Get-MgUser -Filter "userType eq 'Guest'" -All -Property DisplayName,Mail,CreatedDateTime,SignInActivity |
  ForEach-Object {
    $lastSignIn = $_.SignInActivity.LastSignInDateTime
    [PSCustomObject]@{
      Name        = $_.DisplayName
      Email       = $_.Mail
      Created     = $_.CreatedDateTime
      LastSignIn  = $lastSignIn
      DaysSinceSignIn = if ($lastSignIn) { ((Get-Date) - $lastSignIn).Days } else { "Never" }
    }
  } |
  Sort-Object DaysSinceSignIn -Descending |
  Format-Table -AutoSize
 ```
 **What you are looking for:** Guests who have not signed in for 90+ days. Guests you do not recognize (external parties from concluded projects or former vendors).
 **Call us if:** The count of stale guests is growing quarter-over-quarter and nobody is pruning them. Or if a guest account appears that belongs to an external party from a concluded engagement and still has active access.
 ---
 ### Q7. Anonymous link count
 **How to run:** Connect using PnP PowerShell (installed during engagement):
 ```powershell
 Connect-PnPOnline -Url "https://[tenant]-admin.sharepoint.com" -Interactive
 $sites = Get-PnPTenantSite -IncludeOneDriveSites
 $anonLinks = foreach ($site in $sites) {
  Connect-PnPOnline -Url $site.Url -Interactive
  Get-PnPSharingLinks | Where-Object { $_.SharingLinkType -eq "Anonymous" } |
    ForEach-Object { [PSCustomObject]@{ Site = $site.Url; Link = $_.ShareLink; Expires = $_.ExpirationDateTime } }
 }
 Write-Host "Total anonymous links: $($anonLinks.Count)" -ForegroundColor Yellow
 $anonLinks | Sort-Object Site | Format-Table
 ```
 Record the count. Save the export.
 **What you are looking for:** Count increasing quarter-over-quarter (means new anonymous links are being created despite the policy). Links with no expiration date.
 **Call us if:** Count is increasing despite the restriction we put in place. Or if you find anonymous links on sites that hold sensitive data (HR, Finance, M&A).
 ---
 ### Q8. CA policy diff — detect drift
 **How to run:**
 ```powershell
 # CAExporter is set up from the engagement — run from its directory
 .\CAExporter.ps1 -ExportPath "C:\SecurityRunbook\CA-Exports\CA-$(Get-Date -Format 'yyyy-MM-dd')"
 ```
 Then compare this quarter's export folder to last quarter's using any file diff tool (WinMerge, VS Code with the "compare folders" extension, or simply `Compare-Object` in PowerShell):
 ```powershell
 $old = Get-ChildItem "C:\SecurityRunbook\CA-Exports\CA-2026-04-01" -File | Select-Object -ExpandProperty Name
 $new = Get-ChildItem "C:\SecurityRunbook\CA-Exports\CA-2026-07-01" -File | Select-Object -ExpandProperty Name
 Compare-Object $old $new
 ```
 Then for any policy that changed, open the JSON files and compare manually. The changed lines are the configuration drift.
 **What you are looking for:** Policies deleted since last quarter. Policies whose parameters changed (exclusions added, scope narrowed, MFA grant changed to "grant without controls"). New policies in report-only mode that should have been enabled.
 **Call us if:** Any CA policy has changed without a corresponding change record. A policy that was enforcing is now in report-only mode. A new exclusion was added to a critical policy (legacy auth block, admin MFA, device compliance).
 ---
 ## "Call us" trigger list
 These are the situations where you stop, take a screenshot, and contact us — even outside a scheduled check:
 | What you see | How urgent | What to do first |
 |---|---|---|
 | Break-glass alert fires unexpectedly | Immediate | Disable any active sessions for the break-glass account, then call us |
 | New Global Admin you did not create | Immediate | Do not remove it yet — screenshot first, then call us |
 | Synced account in Global Admin role | Same day | Do not change anything — screenshot and call us |
 | DCSync alert from Defender for Identity | Immediate | Isolate the source host from the network if possible, then call us |
 | External auto-forward rule found on any executive mailbox | Same day | Disable the rule, check for mail forwarded, call us |
 | PingCastle score drops more than 10 points | Within 48 hours | Send us the report alongside the previous quarter's |
 | Any alert sitting at High severity for more than 24 hours you do not know how to triage | Within 24 hours | Screenshot, note what the alert says, call us |
 | Backup restore fails or produces corrupt data | Same day | Do not delete anything — call us |
 | Something that feels wrong but is not on this list | Use your judgement | A wrong feeling is data. Document what you noticed and send it. We will tell you if it is nothing. |
 ---
 ## Tracking spreadsheet columns
 Keep a simple spreadsheet (Excel or SharePoint list) with one row per check per quarter:
 | Date | Check | Result / Count | vs. Last Quarter | Action taken | Escalated to consultant? |
 |------|-------|---------------|-----------------|--------------|--------------------------|
 The trend matters more than any individual value. A metric that is consistently getting worse is a finding even if no single value crosses a threshold.
 ---
 ## When to schedule the next full engagement
 Use this as a rule of thumb:
 - **Annual:** Full adversarial validation (the engagement that produced this document). Recommended even if the monthly and quarterly checks are clean — they catch drift, not adversarial paths.
 - **Triggered:** Any time a "call us immediately" event fires, or PingCastle / Purple Knight produces a new Critical finding.
 - **Project-triggered:** Before any major change to the estate — AD migration, new cloud service onboarding, M365 license change, acquisition or merger, significant IT staff change.
 ---
 *Self-service cadence for [client name]. Produced June 2026. Review and update January 2027 alongside the field guide update.*
@@ -0,0 +1,194 @@
 # The Antifragile Handbook for M365 & Active Directory
 ## Book I — Principles & Judgement
 > *Move fast and fix things.*
 ---
 ## Why this book exists
 This is not a benchmark. It will not give you a number to report to a steering committee. It will not tell you that your tenant is 87% compliant, because that number is a lie that makes everyone feel safe while the building burns. Compliance frameworks — CIS, NIST, ISO, the lot — answer one question: *did you do the things on the list?* That is a useful question. It is not the important one. The important question is: **when this gets attacked, does it get weaker, stay the same, or get stronger?** A system that gets stronger from being stressed is antifragile. Almost no M365 + AD estate is antifragile by default. Most are the opposite: a flat domain synced to a cloud tenant, where one phished helpdesk account quietly becomes domain dominance becomes Global Admin. That is fragility wearing a compliance certificate. A consultant trained on benchmarks knows *what* the settings should be. A consultant trained on this book knows *which settings matter, why, and what breaks if they're wrong* — and can walk into a tenant they've never seen and find the thing that will actually kill the client. That is the difference between a technician and an independent professional. We are trying to raise the second kind.
 ### What "move fast and fix things" actually means
 It is a deliberate edit of the old Silicon Valley creed. The original assumed things were whole and that breaking them was the cost of speed. Our world is the reverse: **the things are already broken.** Legacy auth is still on. Service accounts from 2014 still have domain admin. Nobody has tested the break-glass account since it was created. Speed, here, is not recklessness — it is refusing to let a thirty-page risk-acceptance process protect a fragility that a teenager with a phishing kit will remove for free. So:
 - **Fast** — bias to action. A fix shipped this week beats a perfect fix discussed for a quarter. Fragility compounds while you deliberate.
 - **Fix** — actually change the structure, not the documentation. A risk you *accepted* is a risk you still have.
 - **Things that matter** — and this is the whole craft — the discrimination to know that disabling legacy auth outranks renaming forty GPOs to match a naming standard. Most of the checklist is noise. Find the signal.
 ### How compliance still fits (read this before you get smug)
 We are not anti-compliance. We are anti-*thoughtless* compliance. Your clients have auditors, contracts, and regulators, and you will still help them pass. The relationship is this:
 > **Compliance is a floor and a by-product. It is never the target.**
 If you build an antifragile estate, you will pass CIS almost by accident, and you will be able to explain *why* every control exists — which is more than most auditors can. But you will also do things no benchmark asks for (game-days, kill-switch drills, deliberate removal of features) and you will *skip* things benchmarks demand when they add fragility or cost without reducing blast radius. When you skip, you skip **on the record, with a written reason**. That is the difference between independent judgement and laziness.
 ---
 ## The governing question
 Before the principles, the one question that sits above all of them. Ask it of every account, every trust, every sync, every app registration:
 > **If this is owned tonight, what is the largest thing an attacker reaches before hitting a wall — and can I draw that wall?**
 If you cannot draw the wall, there is no wall. In M365 + AD the wall is almost always missing in the same place: the **identity bridge** between on-prem AD and Entra ID. Internalise this and half the job is done.
 ---
 ## The Principles
 Nine of them. They overlap on purpose — antifragility is a way of seeing, not a checklist (the irony would be unbearable). Each comes with **judgement prompts**: the questions an independent consultant asks instead of looking up the "correct" value. Learn the questions, not the answers. The answers change with every tenant; the questions don't.
 ---
 ### 1. Via Negativa — subtract before you add
 The strongest control is the thing that no longer exists. It cannot be misconfigured, cannot be exploited, cannot drift, and costs nothing to maintain. Benchmarks are addition machines — every control is something *more* to deploy and watch. Start the other way: what can we **delete**? In M365 + AD, the highest-leverage deletions are usually: legacy/basic auth, NTLM and unconstrained delegation, standing privileged role assignments, dormant service accounts and their static secrets, unused federation, public folders, orphaned app registrations with tenant-wide consent, and "temporary" firewall or CA exclusions that became permanent. **Judgement prompts**
 - If I removed this control/feature/account, would *anyone* notice within 90 days? If not, why does it exist?
 - What is the oldest thing here still running, and who decided it should keep running — or did nobody decide?
 - Every exclusion is a tiny hole punched in a wall. List the exclusions. Who asked for each, and is that person still here?
 - Am I about to *add* a control to compensate for something I could *remove* instead?
 ---
 ### 2. The Barbell — protect the irreplaceable, let the rest stay cheap
 Compliance scoring spreads effort evenly: every control worth the same point. Reality is not evenly distributed. A handful of things are irreplaceable — tenant root, Tier 0 / domain controllers, break-glass accounts, backups, the sync engine. Everything else is, in principle, rebuildable. Put **paranoid, expensive, redundant** protection on the irreplaceable few. Let everything else be **cheap, fast, and replaceable** — even disposable. Do not spend your political capital hardening a kiosk laptop while a Global Admin has no phishing-resistant MFA. The middle — moderate protection spread thinly over everything — is where budgets and attention go to die. **Judgement prompts**
 - Name the five things in this estate that, if lost, cannot be rebuilt. Are they protected differently from everything else, or the same?
 - Where is effort being spent evenly that should be spent asymmetrically?
 - Is anything in the "cheap and replaceable" bucket actually load-bearing in disguise? (The "temporary" script on someone's laptop that runs payroll.)
 - Could I afford to let this thing be *destroyed* and just rebuild it? If yes, stop gold-plating it.
 ---
 ### 3. Blast Radius is the metric — not the control count
 This is the governing question turned into a habit. Compliance counts inputs (controls present). Antifragility measures **propagation** (how far a compromise travels). A tenant with 200 controls and a flat AD→Entra trust is more fragile than a tenant with 50 controls and a real tier boundary. The defining fragility of hybrid M365 is **coupling**: Password Hash Sync or PTA, Entra Connect running as a quasi-Tier-0 service, AD admins who are also cloud admins, devices that are both domain-joined and the user's MFA device. Each coupling means one compromise becomes two. Antifragile design **decouples** — it turns the identity bridge from a conduit into a firebreak. **Judgement prompts**
 - Draw the attack path from a single phished standard user to Global Admin. How many *independent* barriers are there? Independent, not "two MFA prompts from the same provider."
 - Which single account, if compromised, ends the engagement? How many are there? (If the answer is more than zero, that's the project.)
 - If on-prem AD fell completely, would the cloud survive — and vice versa? Or are they one organism wearing two badges?
 - What runs the sync, and what could that identity reach? Trace it.
 ---
 ### 4. Optionality — buy cheap escape hatches
 Pay a small, certain cost now for the *option* to survive an uncertain disaster later. Break-glass accounts, a tested "kill the sync" runbook, a way to revoke all tokens at once, an offline copy of recovery keys, a documented path to a clean tenant. These look like waste to an auditor and like wisdom on the worst day of the client's year. Optionality is the opposite of optimisation. An optimised system has no slack and shatters at the first surprise. Deliberately keep some slack. **Judgement prompts**
 - When the primary path fails, what's the second path — and has anyone walked it?
 - If we had to sever AD from Entra in the next 30 minutes to contain a breach, *how*? Is that written down where someone panicking can find it?
 - Break-glass: does it exist, is it phishing-resistant, is it excluded from the CA policy that would otherwise lock it out, and when was it last *used* in a drill (not just created)?
 - What are we optimising so hard that we've removed all room to manoeuvre?
 ---
 ### 5. Stress it on purpose — hormesis, not hope
 Muscle, bone, and immune systems get stronger from controlled stress and weaker from protection. Systems are the same. **An untested control is a broken control** — you simply don't know it yet. The benchmark says "the setting is configured." The antifragile consultant says "we revoked the token at 14:00 on a Tuesday and watched what actually happened." Run game-days. Disable a CA policy and observe the fallout in a controlled window. Simulate Entra Connect failure. Pull a Global Admin's session. Kill a DC. You *want* to discover brittleness on a quiet afternoon, cheaply, with the right people watching — not at 3 a.m. during a real intrusion. **The corollary: declared state is not enforced state.** Underneath "untested = broken" sits a harder truth about *why* you must test — every representation the platform hands you (a config blade, an inventory record, a compliance dashboard, a green tick) is a **claim about reality, not reality itself**, and the two diverge silently and routinely. Two examples that should haunt you:
 - A Conditional Access policy can display a flawless configuration and **enforce nothing** — the evaluated object has desynced from the one you're looking at. Every config review, export-diff, and benchmark audit passes. Only a real sign-in reveals it fails open. (Worked example in Book IV.)
 - A CMDB or device inventory shows a clean, managed fleet while the sign-in logs show a different, larger, partly-unknown population actually touching the data. The inventory is a wish; the authentication record is the fact. (Worked example in Book IV.)
 So the rule that governs the whole craft: **verify by observation, never by inspection.** Trust what the system *does* under test over what any artefact *says* it does. Reading the config is not knowing the behaviour; counting the inventory is not knowing the fleet. Where the representation and the observed behaviour disagree, the behaviour is the truth and the representation is the bug. **Judgement prompts**
 - What here has never once been tested by actually breaking it?
 - What do we *believe* is true about this estate that we've never verified by observation? (Belief is not evidence. The portal showing a green tick is not the same as the control firing under attack.)
 - Which "facts" about this estate come from a *representation* (config screen, CMDB, dashboard) rather than from *observed behaviour*? Which have we confirmed the system actually does, versus merely says?
 - Where would a silent divergence between declared and enforced state hurt most — and how would we even notice it?
 - When did this client last deliberately break something to learn from it? If "never," that's the most important finding in your report.
 - What's the smallest, safest experiment that would tell us whether X is real?
 ---
 ### 6. Every incident must change the structure
 This is the actual definition of antifragile — *gaining from disorder.* A robust system survives a shock unchanged. An antifragile system comes out **structurally different and harder to hit the same way twice.** Pain that closes a ticket without changing the architecture is wasted pain, and it guarantees the same incident again. After every incident, near-miss, failed game-day, or even a noisy false positive: what *structural* thing changes? Not "we reminded users to be careful." A removed permission, a severed coupling, a new firebreak, a deleted feature. **Judgement prompts**
 - For the last three incidents (or alerts) here — what changed in the *structure* afterwards? If the answer is "a training reminder," nothing changed.
 - Does this organisation treat incidents as embarrassments to bury or as fuel? (Blameless on people, ruthless on structure.)
 - Are we fixing the instance or the class? Patching this account, or removing the pattern that made it possible?
 - What did the last false positive *teach* us that we threw away?
 ---
 ### 7. Convexity — prefer bounded cost, unbounded upside
 Choose controls whose downside is small and known, and whose upside is large and broad. Conditional Access is convex: cheap to run, fails gently, and one good policy blocks whole classes of attack. A sprawling, hand-tuned DLP ruleset is concave: expensive to maintain, brittle, and it fails in surprising, expensive ways at the worst moment. Favour the convex. Be deeply suspicious of any control that needs constant tending to keep working. **Judgement prompts**
 - When this control fails, does it fail *safe and quietly*, or *open and catastrophically*? (Fail-open is concave and usually a trap.)
 - How much ongoing care does this need to keep working? High-maintenance controls rot the moment attention moves on.
 - Does this control block a *class* of attacks or just one specific instance? Prefer the class.
 - Are we buying a complex product to solve a problem that one CA policy and a deletion would solve?
 ---
 ### 8. Lindy — trust what has survived
 The longer a mechanism has survived, the longer it's likely to keep working. Boring, time-tested controls (least privilege, network segmentation done right, hardware-backed keys, tiered admin) beat the newest preview blade in the portal. New features arrive with unknown failure modes and unknown attack surface; they have not yet been stress-tested by the world. Use them when they earn it, not because they're new. Equally: an attack technique that has worked for fifteen years (NTLM relay, Kerberoasting, consent phishing) will probably work next year — prioritise accordingly. **Judgement prompts**
 - Is this control time-tested, or are we the QA team for a feature that shipped last month?
 - What are the oldest, most reliable attacks against this estate — and have we actually closed them, or chased novel ones while the classics stay open?
 - If this shiny feature vanished tomorrow, would we be exposed? If yes, we built on sand.
 - Are we solving a 2015 problem with a 2026 product because the product is new?
 ---
 ### 9. Skin in the game — whoever designs it, lives with it
 Security theatre is what happens when the people imposing controls never carry the pager. A consultant who recommends a control they'd never have to operate is selling fragility dressed as diligence. The person who designs the break-glass process should be woken up by the drill. The architect who couples AD to Entra should be the one who has to uncouple it under fire. This applies to you. Don't recommend what you wouldn't run. Don't hand a client a 40-page hardening guide you've never operated. Your reputation is your skin in the game — stake it on advice that survives contact with reality. **Judgement prompts**
 - Does the person who designed this control have to live with its consequences? If not, expect theatre.
 - Am I recommending this because it's right, or because it's defensible if something goes wrong? (Defensive medicine is fragility you can bill for.)
 - Would I bet my own reputation that this works under real attack? If I hesitate, why am I asking the client to bet theirs?
 - Who gets the 3 a.m. call when this fails — and were they in the room when it was designed?
 ---
 ## How to spot fragility (the field skill)
 You will walk into estates with no documentation and no time. Fragility has a smell. Train your nose on these tells:
 - **Folklore.** Configurations only one person understands, justified by "we've always done it that way." If they leave, it becomes un-auditable. Folklore is fragility with tenure.
 - **Single points of failure wearing a uniform.** One service account that runs everything. One admin who holds all the keys. One unreplicated DC. One sync server treated as cattle but actually a pet.
 - **Tight coupling.** Compromise one thing → automatically own a second. AD↔Entra, identity-device-MFA all on one phone, prod and admin in one forest.
 - **Things never tested.** Backups never restored. Break-glass never used. DR plans never run. "It should work" is the sound of a fragile system.
 - **Permanent "temporary."** Exclusions, exceptions, pilot configs, and risk acceptances older than 18 months.
 - **Even spreading.** Effort distributed uniformly is a sign nobody asked what matters. The barbell is missing.
 - **Green dashboards, untested reality.** Everything compliant, nothing ever stress-tested. The most dangerous estate of all, because it feels safe.
 ---
 ## The anti-benchmark: what we measure instead of compliance %
 We don't score controls passed. If the client needs a number, give them these — and explain why each beats a compliance percentage:
 - **Blast radius** — from a single phished standard user, how many independent barriers to tenant/domain dominance? (Higher is better. Most estates: zero or one.)
 - **Mean time to recover** — measured by *actually doing it* in a drill, not by the RTO written in a policy.
 - **Single points of failure** — counted, named, and owned. The goal is a shrinking list, not a green tick.
 - **Untested assumptions** — the number of load-bearing beliefs never verified by observation. The goal is to drive this toward zero.
 - **Time-to-remove** — how fast can we delete a fragilizer (legacy auth, a standing admin) once found? Velocity *is* a security metric.
 None of these are easy to fake, which is exactly why they're worth measuring.
 ---
 ## How to use this handbook
 Book I is the lens. The domain books that follow — Hybrid Identity, Privileged Access, Devices, Data & Collaboration, Recovery, Detection-as-feedback — each apply this same lens in the same shape:
 1. **Fragility inventory** — where does this domain break, and what's the blast radius?
 2. **Via negativa** — what do we remove first?
 3. **The barbell** — what gets paranoid protection, what stays cheap?
 4. **Optionality & recovery** — what are the escape hatches, and are they tested?
 5. **Stressor** — how do we deliberately break this to learn?
 If you ever find yourself reaching for "because the benchmark says so," stop. Go back to the governing question. Draw the wall. If you can't draw it, you've found your work.
 ---
 *Book I of the Antifragile Handbook. Principles over checklists. Judgement over obedience. Move fast and fix things.*
@@ -0,0 +1,167 @@
 # The Antifragile Handbook for M365 & Active Directory
 ## Book II — Hybrid Identity
 > *Draw the wall between on-prem and cloud. In most estates there isn't one — there's a hallway with the door propped open.*
 ---
 ## Why this is the keystone
 If you only ever fix one domain, fix this one. Every other book — privileged access, devices, data — assumes identity holds. In a hybrid M365 + AD estate, identity usually doesn't hold, and the reason is always the same: on-prem AD and Entra ID are not two systems with a guarded border. They are **one organism wearing two badges**, joined by a bridge that most organisations cannot draw, do not monitor, and have never tested severing.
 The governing question, applied here:
 > **If on-prem AD is ransomwared or domain-dominated tonight, does the cloud survive — or is it already poisoned by inheritance?**
 For the overwhelming majority of estates the honest answer is "poisoned," and nobody has ever said it out loud. Your job is to say it out loud, then build the wall.
 ---
 ## 1. Fragility inventory — anatomy of the bridge
 You cannot harden what you can't draw. Here is the bridge, piece by piece, with the blast radius of each. Learn to find all of these on day one of an engagement.
 ### The sync engine (the single most dangerous server you'll forget about)
 Entra Connect Sync (the old Azure AD Connect) or Entra Cloud Sync runs the synchronisation. Whatever the diagram says, **this server is Tier 0** — because of the accounts it holds:
 - **The on-prem connector account.** Under the old "express" install, this account was granted *Replicate Directory Changes* and *Replicate Directory Changes All* — which is **DCSync**. That means the sync server holds an identity that can pull every password hash in the domain. Read that again. The box your infra team treats as a middling utility VM can dump the entire domain.
 - **The Entra connector account** (Directory Synchronization Accounts role) — can manipulate synced objects in the cloud.
 So: compromise the sync server → DCSync on-prem **and** tamper with cloud objects. One box, both kingdoms. If this server is domain-joined to the production domain (it usually is), then anything that reaches prod-tier reaches your DCSync machine. That is the central coupling of the entire estate.
 **Where it's worse than you think:** the sync server is often internet-facing for updates, runs a local SQL Express nobody patches, sits on an OS build from the project that installed it, and has not had its connector account rights reviewed since go-live.
 ### The authentication method (decides whether the cloud lives or dies with AD)
 Three options, three completely different fragility profiles. Know which one you're actually on before you say anything — the diagram and the reality often disagree.
 - **Password Hash Sync (PHS).** A hash-of-a-hash is synced to Entra; the cloud can authenticate on its own. *This is the most resilient for availability* — if on-prem dies, cloud auth keeps working. The transport is fine and not trivially reversible to the plaintext password; the risk is **not** "PHS leaks passwords," it's that the connector account doing the sync can DCSync. Don't let anyone fragilise availability to "fix" a risk that lives in the connector account, not the hash.
 - **Pass-through Authentication (PTA).** Credentials are validated against on-prem AD in real time by PTA agents. **Coupling: on-prem outage = cloud auth outage.** Worse, the agent must handle the credential to validate it, so a compromised PTA agent is a plaintext-credential harvesting position. PTA agents are Tier 0 and a juicy target, and PTA is a conduit, not a firebreak. (You can enable PHS *alongside* PTA as failover — cheap optionality, see §4.)
 - **Federation / AD FS.** The catastrophe. See below — it gets its own treatment because it's usually the single largest fragility in the estate.
 ### AD FS and Golden SAML (the thing that ends careers)
 If AD FS issues tokens, then the **token-signing key** can forge a SAML assertion for *any* user — including bypassing MFA when MFA is enforced at the federation layer — and the cloud will trust it because it's validly signed. This is **Golden SAML**. It is how nation-state actors turned a single on-prem foothold into silent, total, persistent cloud impersonation (the SolarWinds intrusions). It is nearly invisible: the IdP is forging legitimate tokens, so there's no failed login, no anomalous password, nothing for a benchmark to catch.
 The token-signing certificate is a single catastrophic point of failure that most orgs never rotate, store poorly, and don't monitor. If you take one thing from this book: **AD FS is fragility incarnate, and the correct long-term answer is to remove it** (§2), not to harden it.
 ### Seamless SSO (the forgotten Kerberos key)
 Seamless SSO creates the `AZUREADSSOACC` computer account in AD. Its Kerberos decryption key, if never rotated (it usually never is), is a silver-ticket / token-forging exposure. Classic Lindy fragility: old, unrotated, forgotten, exploitable.
 ### The writebacks (reverse conduits nobody counts)
 Every writeback turns the bridge two-way and creates *reverse* blast radius:
 - **Password writeback** — cloud SSPR can change on-prem passwords. Useful; also a path from cloud to on-prem.
 - **Device writeback / group writeback** — cloud objects written into AD. Group writeback (v2), where cloud security groups become AD objects that gate on-prem resource access, means a **cloud group compromise now affects on-prem access** — a coupling people rarely diagram.
 Each writeback may be justified. None should be silent. Count them, name the blast radius of each.
 ### The admin coupling (one organism, two badges)
 The deepest fragility isn't a setting, it's the people and accounts:
 - The same humans are Domain Admins **and** Global Admins.
 - Cloud admin accounts are **synced from on-prem**, so on-prem compromise → harvest → cloud admin.
 - Admins use the same workstation for AD and Entra, and that workstation is also their email/MFA device.
 If on-prem privilege flows into cloud privilege through any of these, there is no wall. There's a hallway.
 ### Source of authority (why you can't fix it in the cloud)
 For synced objects, **on-prem is authoritative**. You cannot durably fix a synced object purely cloud-side; the next sync cycle overwrites you. This matters enormously in incident response: if AD is owned, your cloud objects are downstream of poison and "just fix it in Entra" doesn't hold.
 ---
 ## 2. Via negativa — what to remove (in priority order)
 Hybrid identity is where subtraction pays the highest dividend in the whole estate. In rough order of leverage:
 1. **Remove AD FS. Migrate to cloud authentication** (PHS, or PTA if you have a hard real-time-validation requirement), and move MFA and access decisions to Conditional Access in Entra where they belong. This deletes Golden SAML as a class, shrinks attack surface massively, and removes a SPOF you were never rotating anyway. This is the single highest-leverage deletion in this book.
 2. **Stop syncing privileged on-prem accounts to the cloud.** Domain Admins, Enterprise Admins, Tier 0 — filter them *out* of sync scope. They have no business being cloud objects. A synced privileged account is a free bridge for the attacker.
 3. **Make cloud admins cloud-only.** Global Admins and other Entra privileged roles should be cloud-only accounts (`.onmicrosoft.com`), phishing-resistant, never derived from or synced with on-prem identity. This is the firebreak in one move (see §3).
 4. **Trim the writebacks.** Keep only the ones with a named owner and a justified reverse blast radius. Delete the rest.
 5. **Rotate or remove Seamless SSO.** If you don't need it, remove the `AZUREADSSOACC` account. If you keep it, rotate the key on a schedule — and the fact that nobody has is itself a finding.
 6. **Reduce sync scope.** OU-filter aggressively. Don't sync what the cloud doesn't need. Every synced object is attack surface and a potential bridge. The default "sync everything" is laziness, not architecture.
 For each deletion the test from Book I applies: *if I removed this, would anyone notice in 90 days?* For AD FS the honest answer, after migration, is usually "no — and the attackers will notice it's gone."
 ---
 ## 3. The barbell — what gets paranoia, what stays cheap
 **The irreplaceable few (paranoid protection, redundancy, monitoring):**
 - **The sync server.** Treat it as Tier 0 *in practice*, not just on the diagram: dedicated admin tier, no internet browsing, hardened OS, least-privileged connector account (use a gMSA; strip DCSync rights if your topology allows the scoped permission model), restricted logon, alerting on the connector account's behaviour.
 - **The connector accounts.** Least privilege, gMSA where supported, monitored. An account that can DCSync should scream in your SIEM if it ever behaves like a domain controller from the wrong host.
 - **The AD FS token-signing key** — if AD FS still exists, the key belongs in an HSM, monitored, rotated on a real schedule (remember the rollover cert). But the better barbell move is §2.1: don't own this liability at all.
 - **Cloud-only break-glass Global Admins** (from Book I) — phishing-resistant, excluded from the CA policy that would lock them out, tested.
 **The firebreak — the one design decision that builds the wall:**
 > **Cloud privilege must not be reachable from on-prem compromise.**
 Cloud-only admin accounts + not syncing privileged on-prem accounts + separate privileged workstations = on-prem can fall completely and the attacker still hits a wall at the cloud admin boundary. *That wall is the entire point of this book.* Draw it, then verify an attacker can't walk around it through the sync server (which is why the sync server is in the paranoid bucket).
 **Everything else stays cheap.** Standard user sync, normal device registration, the bulk of the directory — these are replaceable and don't deserve the attention that the sync server and the admin boundary demand. Don't gold-plate the directory while the connector account can dump it.
 ---
 ## 4. Optionality & recovery — escape hatches, tested
 - **The "kill the sync" runbook.** A written, rehearsed procedure to stop sync fast when on-prem is compromised, so poison stops flowing cloud-ward. Know the nuance per auth method, because severing behaves differently:
  - *PHS:* disabling sync stops new changes flowing, but already-synced hashes remain — containment of *propagation*, not instant revocation. Pair with token revocation and credential resets.
  - *PTA / Federation:* severing the bridge can take cloud auth down with it unless you've pre-staged a fallback. Which is why —
 - **Pre-stage the federated-to-managed conversion.** Know, in advance, how to convert the domain from federated (or PTA) to managed/cloud auth (PHS) *fast*, so that during an on-prem incident you can cut the dependency and keep the cloud alive on its own. Rehearse it. "We think we could" is not a plan.
 - **PHS as failover under PTA.** Cheap optionality: run PHS alongside PTA so a PTA-agent or on-prem outage doesn't lock everyone out of the cloud. Small certain cost now, large uncertain payoff later. Classic Book I optionality.
 - **Cloud-only admin path that survives AD death.** Because cloud admins are cloud-only (§3), you retain full control of the tenant even if AD is gone. This *is* the recovery path — verify it actually works without any on-prem dependency (including MFA that doesn't secretly route through on-prem).
 - **Accept the source-of-authority reality.** Your IR plan must account for the fact that synced objects are downstream of on-prem. Decide *in advance* whether, during a domain-dominance incident, you sever first and rebuild authority cloud-side. Discovering this mid-incident is how recoveries fail.
 ---
 ## 5. Stressor — break it on purpose
 Untested = broken. Game-days for hybrid identity, smallest/safest first:
 - **Pull the sync server** (planned window). Does cloud auth survive? The answer *proves* which auth method you're really on and whether your availability assumptions are true. Most teams are surprised. That surprise is the point.
 - **Revoke / disable the connector account and watch your SIEM.** Did anything alert? An account that can DCSync going dark, or behaving oddly, should be the loudest alarm you own. If nothing fired, you've found a detection gap worth more than any control you could add.
 - **Golden SAML tabletop** (if AD FS exists). Walk through: attacker has the token-signing key — what do you detect, how do you contain, how fast can you rotate, and could you tell at all? If the honest answer is "we couldn't tell," escalate the §2.1 removal from "roadmap" to "now."
 - **Break-glass under sync-down.** Test the cloud-only break-glass account *while the bridge is severed*. It must work with zero on-prem dependency. If it silently relied on something on-prem, you just found it on a Tuesday instead of during the breach.
 - **DCSync detection drill.** Have someone simulate DCSync from an unexpected host and confirm detection fires. The connector account is the one place DCSync is "normal," which is exactly why attackers love to look like it.
 Every one of these, per Book I principle 6: whatever breaks must produce a **structural** change, not a calendar reminder.
 ---
 ## Honest uncertainty (read this, don't trust a handbook on moving parts)
 This book teaches stable mechanisms — the coupling between AD and Entra, Golden SAML, the DCSync-via-connector path, the PHS/PTA/federation trade-offs. Those don't change much; they're Lindy.
 What **does** move, and what you must verify against current Microsoft documentation rather than trusting any 2026-vintage handbook:
 - **Connect Sync vs Cloud Sync feature parity.** Microsoft has been steering new deployments toward the lighter Cloud Sync agent (no SQL, multiple agents for HA — better optionality), but parity for specific scenarios (certain writebacks, device sync, large/complex topologies, passthrough nuances) has been evolving. **Check the current parity matrix before you recommend a migration.** Don't let me, or any document, freeze this for you.
 - **AD FS deprecation / migration tooling.** Direction of travel is clearly away from AD FS toward Entra-native auth, with staged-rollout and migration tooling to ease it. Exact timelines, tool capabilities, and supported paths shift — verify current state when you scope the work.
 - **Connector account hardening guidance** (gMSA support, least-privilege permission models, the scoped alternative to full DCSync rights) continues to improve — confirm what's available for your topology and version.
 If a client's safety depends on a current-version specific, **look it up and cite it**, don't quote your memory or this book. Honest "I need to verify the current parity" beats confident and wrong every time. That's not weakness; that's the job.
 ---
 ## Consolidated judgement prompts
 The questions to carry into any hybrid estate:
 - Which auth method are we *actually* on — and does the cloud survive on-prem death? (Verify by testing, not by asking.)
 - Is the sync server Tier 0 in practice or only on the diagram? What can its connector account reach? Can it DCSync?
 - Are any privileged on-prem accounts synced to the cloud? Are Global Admins cloud-only or synced?
 - Can on-prem privilege reach cloud privilege by *any* path — accounts, workstations, the sync server, writebacks? Draw every path. Each one is a hole in the wall.
 - Do we have AD FS? *Why?* What exactly would removing it take, and what's the honest reason it hasn't happened?
 - When was the Seamless SSO key / AD FS token-signing cert last rotated? ("Never" is a finding, not an answer.)
 - Which writebacks are on, and what reverse blast radius does each create?
 - If we severed the bridge in the next 30 minutes, what breaks, and is the procedure written where someone panicking can run it?
 ---
 *Book II of the Antifragile Handbook. The wall between on-prem and cloud is the most important structure you will ever draw — because in most estates, it isn't there. Move fast and fix things.*
@@ -0,0 +1,147 @@
 # The Antifragile Handbook for M365 & Active Directory
 ## Book III — Privileged Access
 > *Privilege is blast radius with a time axis. Standing privilege reaches everything, forever. The whole job is to collapse both: less reach, less time.*
 ---
 ## The governing question
 Book I asked you to draw the wall. Book II built it between on-prem and cloud. This book is about the credentials that can knock any wall down. Ask of every privileged identity — human, service account, or app:
 > **If this credential leaks tonight, how long does it stay useful, and how far does it reach?**
 A permanent Domain Admin answers *"forever, everything."* A permanent Global Admin answers *"forever, the whole tenant."* A JIT, scoped, time-boxed role answers *"for one hour, for one task."* Every technique in this book exists to turn the first kind of answer into the second. That's it. That's the whole craft of privileged access: **shrink the reach, shrink the time.**
 Compliance counts whether you "have a PAM solution." Wrong question. The question is whether privilege *evaporates when not in use* and whether a leaked credential hits a wall in minutes instead of owning the estate forever.
 ---
 ## 1. Fragility inventory — where privilege rots
 ### Standing privilege (the original sin)
 An account that is *always* an admin is a loaded gun left on the table, every hour of every day, whether anyone's using it or not. Its blast radius is constant and maximal. Permanent Domain Admins, permanent Enterprise Admins, permanent Global Admins — every one of them is a credential whose value to an attacker never drops to zero. **The single most important number in this book is: how many identities hold standing privilege?** In most estates it's an order of magnitude too high, and nobody has ever counted.
 ### Service accounts and service principals (the dark matter)
 This is where the bodies are buried, on both sides of the wall:
 - **On-prem service accounts** — over-permissioned ("we made it Domain Admin to make it work"), static passwords that haven't changed since 2016, an SPN attached so they're **Kerberoastable** (request the ticket offline, crack the weak password at leisure), owned by nobody, documented nowhere, and impossible to turn off because something unknown will break.
 - **Cloud service principals / app registrations** — the same disease in a new body. Client secrets that never expire, **tenant-wide admin consent**, and Microsoft Graph permissions that are quietly catastrophic: `RoleManagement.ReadWrite.Directory`, `AppRoleAssignment.ReadWrite.All`, `Application.ReadWrite.All` — any of which is a privilege-escalation path to Global Admin. Service principals **cannot do MFA**, usually hold **standing** privilege, and live in a blind spot no benchmark looks at hard enough.
 Service identities are dark matter: most of the privileged mass of the estate, invisible in the usual diagrams, and gravitationally dominant when something goes wrong.
 ### Tier violations (the wall with a hole kicked in it)
 The Lindy core of on-prem security is the tier model (Tier 0 = identity control plane: DCs, AD, ADCS, the sync server from Book II; Tier 1 = servers; Tier 2 = workstations). Microsoft has since reframed it as the Enterprise Access Model reaching into the cloud, but the rule never changed:
 > **A higher-tier credential must never be exposed on a lower-tier system.**
 Every Domain Admin who RDPs into a workstation, every admin whose daily-driver laptop also touches a DC, every shared jump box used for both Tier 0 and Tier 1 — that's a tier violation, and it's how `pass-the-hash` / `pass-the-ticket` turns one phished workstation into domain dominance. The clean-source principle is absolute: **you cannot securely manage a system from a less-secure one.**
 ### The escalation plumbing nobody maps
 - **AD ACL backdoors** — who can reset whose password, who has `WriteDACL` / `GenericAll` on what. Privilege hides in object permissions, not just group membership. Attackers map this in minutes; defenders rarely map it at all.
 - **Delegation** — unconstrained delegation is a standing golden-ticket risk; constrained/RBCD misconfigurations are escalation paths.
 - **ADCS** — the certificate services escalation paths (the ESC-series misconfigurations) turn a forgotten CA template into domain compromise. ADCS is **Tier 0** and is almost always treated as Tier 1 or forgotten entirely.
 - **KRBTGT** — the master key behind golden tickets. Rarely rotated; if an attacker ever had it, they may still have it.
 - **LAPS absent** — without per-machine local admin password randomisation, one cracked local admin hash unlocks lateral movement across every machine sharing it.
 ### The recovery paradox
 The accounts that can rebuild the estate after a disaster are, by definition, the most powerful — and therefore the most valuable to an attacker. Break-glass done carelessly is just standing privilege with a heroic name. (Handled in §4.)
 ---
 ## 2. Via negativa — what to remove (in priority order)
 Privilege is the domain where deletion is the entire strategy. Adding "privileged access controls" on top of unmanaged standing privilege is rearranging furniture in a burning room.
 1. **Eliminate standing privilege.** Roles become *eligible*, not *active*. Cloud-side this is PIM (§3). On-prem it's harder and the tooling is weaker — be honest about that (§ honest uncertainty) — but time-bound group membership and JIT elevation tooling exist; use them. The target state: at rest, almost nobody is an admin.
 2. **Empty the top groups toward the irreducible minimum.** Drive Domain Admins, Enterprise Admins, and standing Global Admins down to the smallest number that reality permits (plus break-glass). Delegate specific rights instead of handing out god-mode. "Empty Domain Admins" is an achievable goal, not a fantasy.
 3. **Kill, convert, or constrain service identities.** Remove the ones nobody can justify (apply the 90-day-scream test). Convert the rest to managed identities — **gMSA** on-prem (the established, Lindy fix: automatic password rotation, no static secret, not Kerberoastable in the same way), **managed identities** in Azure where possible. Strip every excess right. For app registrations: remove the dangerous Graph permissions, expire and rotate secrets, prefer certificate credentials or managed identities over secrets, and delete unused registrations and stale consent grants.
 4. **Remove tier violations.** No high-tier credential on a low-tier box, ever. This is mostly subtraction — taking admin rights *off* daily-driver machines and shared boxes.
 5. **Fix the escalation plumbing by removal.** Decommission unused ADCS templates, remove unconstrained delegation, prune dangerous ACLs, deploy LAPS so standing shared local admin passwords cease to exist.
 6. **Remove standing local admin from users.** Most don't need it. The ones who think they do usually need it for ten minutes a month — which is a JIT problem, not a standing-rights problem.
 ---
 ## 3. The barbell — paranoia for the control plane, cheap for the rest
 **The irreplaceable few (paranoid, redundant, monitored):**
 - **Tier 0** — DCs, AD, ADCS, KRBTGT, and the sync server from Book II. This is the control plane; if it falls, everything falls.
 - **The handful of break-glass Global Admins** (§4).
 - **The PIM / role-management configuration itself** — because whoever controls *who can become admin* is effectively admin. Privileged Role Administrator and Privileged Authentication Administrator are crown roles; treat them as such.
 **Paranoid protection for privileged work means, non-negotiably:**
 - **PAWs — the principle and the practical reality.** The principle: all Tier 0 / Global Admin work from a clean, hardened, single-purpose device that never reads email or browses the web. The admin's normal laptop is Tier 2. This is right. The practical reality: physical PAWs almost never get deployed. The hardware procurement, the second device on the desk, the behaviour change — all of it defeats the project before it starts. The deployable alternative that preserves the essential properties is a **cloud-hosted admin workstation** — a Windows 365 or Azure Virtual Desktop VM provisioned from a hardened template, enrolled in the management overlay, used only for privileged tasks. The admin connects from their normal device via browser or RDP. Privileged credentials live in the cloud VM, not on the admin's local device. If the VM is compromised: wipe it, reprovision from template in 20 minutes. The security property is the same — credentials isolated from the daily-use device — without the hardware problem. This is the practical PAW. Recommend it before recommending a dedicated physical device; it will actually get deployed.
 - **The management overlay** connects the admin workstation (cloud VM or physical PAW) to the systems it manages without exposing those systems to the general network. The T0/T1 split matters here and maps directly to the tier model: T0 systems (DCs, ADCS, sync server) get an overlay with no external runtime dependency (Nebula with pre-distributed certificates); T1 systems (member servers, cloud workloads, multi-cloud resources) get an overlay with identity-aware access and per-session MFA (Tailscale with Entra OIDC). The realistic T0 node count for a 5,000-person organisation is 15–25 nodes — small enough to manage with a documented certificate ceremony and a spreadsheet, not a full PKI team. The management overlay is what makes remote and hybrid admin work possible without either a traditional VPN's flat-network problem or physical-presence-only access.
 - **Phishing-resistant MFA only** for admins — FIDO2 / passkeys / certificate-based. SMS and push-approve are not admin-grade; they're phishable, and admins are the phishing prize. For the management overlay, this means Tailscale configured with key expiry and an Entra OIDC IdP enforcing FIDO2 — so the WireGuard device trust and a per-session identity assertion are both present, not just the device key.
 - **Separate, cloud-only privileged identities** for cloud admin (the Book II firebreak, enforced here). On-prem admin identity must not be the cloud admin identity.
 - **JIT for everything** via PIM: eligible-not-active, time-boxed, MFA on activation, justification logged, and **approval workflow on the crown roles**.
 - **Conditional Access scoped to admins** — privileged roles usable only from PAWs / compliant devices / named locations.
 **Everything else stays cheap.** Standard RBAC, normal user access, ordinary app permissions — don't pour the privileged-access budget evenly across the whole directory. Concentrate it ferociously on the tiny set of identities that own the control plane. A thousand hardened standard users won't save you if one permanent Domain Admin uses `Password1!` on a Kerberoastable SPN.
 ---
 ## 4. Optionality & recovery — escape hatches, tested
 - **Break-glass done right.** This is the deliberate exception to "no standing privilege" — you *need* an account that works when PIM, MFA infrastructure, or the IdP is down. So it's standing by necessity, which means it is protected differently: cloud-only, phishing-resistant credential stored offline/split, excluded from the CA policy that would otherwise lock it out, and **wired so that any use at all triggers a screaming alert.** Standing privilege you can't remove, you watch like a hawk. And you **test it** — an untested break-glass account is Schrödinger's recovery.
 - **KRBTGT rotation on demand.** Can you rotate KRBTGT (twice, with the required interval) the moment you suspect golden tickets — without taking the forest down? Is it rehearsed? If not, you have a theoretical control, not a real one.
 - **Fast session revocation / admin disable.** A one-move way to kill a compromised admin's sessions and tokens and disable the account, on both sides of the wall. Rehearse it; the breach is not the time to discover the command.
 - **No single human as the only recovery path** — balanced against blast radius. You want enough redundancy that one person under a bus (or under coercion) doesn't end recovery, without so many standing admins that you've recreated the problem. The barbell, again.
 - **Tier 0 / forest rebuild path** — links forward to Book V (Recovery). Know it exists, know it's been tested, know it doesn't secretly depend on a credential that the incident just compromised.
 ---
 ## 5. Stressor — break it on purpose
 - **Pull an admin's standing access and route them through PIM for a week.** Does real work still flow? If JIT activation is too slow or broken, people will route around it — and you'll have found that in a drill instead of discovering the shadow standing-admin account they created in revenge.
 - **Kerberoast yourself.** Run the attack against your own directory. Which service accounts crack? Did anything *detect* the ticket requests? Two findings in one cheap test.
 - **Attempt a tier violation in a test window.** Try to use a Tier 0 credential on a Tier 2 box. Is it blocked? Detected? Silent? Silence is the worst answer and the most common.
 - **Run attack-path analysis as routine, not as a once-a-year pentest.** Tools that map "who can reach Domain Admin / Global Admin in N hops" turn privilege escalation into a number you can track over time. **The count of paths to domain/tenant dominance is a better security metric than any compliance percentage.** Drive it down; watch it not creep back up.
 - **Simulate a malicious consent grant / over-permissioned app.** Register an app requesting a dangerous Graph scope. Does anything flag it? Can you find every existing app holding those scopes today? (You should be able to. Most can't.)
 - **Break-glass drill** — yes, again, and on a schedule. The recurring test in this whole handbook.
 Per Book I principle 6: each of these must yield a **structural** change — a removed right, a severed path, a new alert — not a note that says "be careful."
 ---
 ## Honest uncertainty (the moving parts — verify, don't trust this book)
 Stable and Lindy (teach with confidence): standing privilege is the core risk; the tier / clean-source model; Kerberoasting, pass-the-hash, golden/silver tickets, DCSync; the gMSA pattern; JIT/eligibility as the goal. These don't churn.
 What moves, and what you must verify against current Microsoft documentation:
 - **The management overlay pattern** (covered in §3 above) is stable in principle — the T0/T1 split, the clean-source reasoning for isolating the management plane, the cloud admin VM as the deployable PAW substitute. What moves: the specific tooling. Nebula's CA and ACL model, Tailscale's per-session MFA configuration and OIDC integration, and the Windows 365 / AVD provisioning model all evolve. Verify current implementation guidance before deploying, and confirm Tailscale's key-expiry and IdP enforcement behaviour is still available as described.
 - **PIM capabilities, role definitions, and the risk classification of specific Graph permissions** evolve continually. Confirm which scopes are escalation-grade *today* rather than trusting a 2026 list.
 - **On-prem JIT/PAM tooling is genuinely weaker and more fragmented than the cloud story.** Native time-bound group membership, MIM PAM, and third-party PAM all have trade-offs that shift. Don't promise a client a clean AD-native JIT experience without checking current reality — and be honest that on-prem eligibility is harder than PIM makes cloud look.
 - **gMSA vs dMSA.** gMSA is the established, Lindy answer for managed service accounts. **dMSA** (delegated managed service accounts, introduced with the Windows Server 2025 generation) targets the real gap — migrating a standing service account and disabling the original — but newer mechanisms carry newer attack surface, and there has been published privilege-escalation research against the dMSA migration path. **Verify current patch and hardening guidance before you recommend dMSA**; this is exactly the kind of new-and-shiny that Book I principle 8 warns about. gMSA until you've checked dMSA's current state.
 - **Enterprise Access Model vs the classic three-tier model** — same logic, evolving names and cloud extensions. Use whichever vocabulary the client knows; don't get religious about the label.
 If a client's safety hinges on a current specific, look it up and cite it. "I need to verify the current Graph permission classification" beats confidently quoting a stale one. That posture *is* the independence this handbook is trying to build.
 ---
 ## Consolidated judgement prompts
 - How many identities hold **standing** privilege — human, service account, and service principal — counted, named, and owned? (If you can't produce the number, that's finding #1.)
 - For each privileged credential: leaked tonight, how long is it useful and how far does it reach? Where's the wall?
 - Where are the tier violations? Which high-tier credentials touch low-tier systems? Does any admin's daily laptop reach Tier 0?
 - Which service accounts are Kerberoastable? Which app registrations hold escalation-grade Graph permissions or non-expiring secrets?
 - Are cloud admins cloud-only and phishing-resistant, or synced and push-MFA'd? (Book II firebreak — verify it's actually enforced here.)
 - Does privilege **evaporate when idle** (PIM/JIT) or sit loaded on the table?
 - Is ADCS treated as Tier 0? When was KRBTGT last rotated? Is LAPS deployed?
 - Break-glass: does it exist, is it monitored to scream on use, and when was it last *tested* — not created, tested?
 - How many paths to Domain Admin / Global Admin exist right now, and is that number going up or down?
 - What does an admin use to reach a domain controller remotely — and if that path is compromised, what does the attacker get? Is the management access path independent of the estate it manages?
 - Are privileged credentials ever typed into or stored on a device that is also used for email and browsing? If yes, the session isolation that PAWs are meant to provide does not exist, regardless of what the policy says.
 ---
 *Book III of the Antifragile Handbook. Privilege is blast radius with a clock on it. Shrink the reach, shrink the time, and watch the credentials that can rebuild the world. Move fast and fix things.*
@@ -0,0 +1,172 @@
 # The Antifragile Handbook for M365 & Active Directory
 ## Book IV — Devices & Endpoint (Intune)
 > *The device will be compromised. Compliant is not the same as secure, and the portal toggle is not the same as the device's behaviour. Build for the compromise, not against it.*
 ---
 ## The governing question
 Most endpoint programmes are built on a wish: *make the device trusted.* That wish is unwinnable — a device in a user's hand, on a network you don't control, running an OS you didn't write, will eventually be compromised, and no amount of hardening changes that. So flip the question:
 > **Assume every device is already compromised. What still holds?**
 If the answer is "nothing, because a compromised-but-compliant device gets full access," you've built fragility with a green tick on it. The antifragile endpoint posture stops trying to own the device and instead builds a boundary that **survives an untrusted device**: the data lives behind a wall, the device is cheap and disposable, and "compliant" is treated as what it actually is — a *signal that can be wrong*, not a guarantee.
 That reframe — **compliance is a signal, not a checkbox** — is the spine of this whole book.
 ---
 ## 1. Fragility inventory — where the endpoint betrays you
 ### The fleet is a fiction: managed, unmanaged, shadow, dark
 Before any of the controls below mean anything, confront the foundational lie of endpoint security: **you do not know your fleet.** The whole book so far has said "the managed devices" as if that set is the fleet. It isn't. The managed devices are the part you *chose to count* — and in most estates they're the bigger part only *if you're lucky.* The blast radius lives in everything else.
 The honest spectrum of what touches your data:
 - **Managed** — enrolled (MDM) or app-managed (MAM). The devices you can see and control. The part the programme is about, and the part everyone fixates on.
 - **Known-but-unmanaged** — devices that authenticate and reach data but aren't managed. Entra-registered-but-not-compliant, BYOD that hit OWA or a SharePoint link in a browser. They're in the sign-in logs; they're not under your control.
 - **Shadow** — devices the org never sanctioned but users brought anyway: a personal phone, a contractor's laptop, a home PC pulling files through the web client. Shadow IT at the device layer.
 - **Dark** — access you have *no device-level visibility into at all.* Legacy- protocol sign-ins that bypass Conditional Access and never produce a clean device signal. Long-lived tokens issued once and never re-evaluated. App passwords. Service principals and automation that aren't devices but reach data like one (the "dark matter" of Book III, wearing a different hat). This is the end of the spectrum that should frighten you, because it never trips a sensor.
 And the inventory of record — the CMDB — is almost always **more wish than reality.** It's populated by *process* (someone files a ticket), and process decays the moment attention moves on. The real device population is populated by *behaviour* — what is actually authenticating right now. The gap between those two is precisely your shadow and dark population, and it's invisible exactly where it matters most.
 This is the Book I corollary made flesh: **the inventory is a claim; the sign-in log is the fact.** Stop deriving your fleet from the CMDB (declarative, decaying, wishful) and start deriving it from observed authentication (behavioural, current, honest). You can't manage what you can't see, and you can't see what you decided not to look at.
 The reframe that saves you is the same barbell from §3: the goal is **not** to manage every device — that's impossible, and chasing it is fragile. The goal is (a) to *know the real population* by observation, and (b) to *gate the data* so that an unmanaged or unknown device gets limited, app-contained, or no access. The question was never "is this device managed." It's **"can a device I don't control reach the data, and what happens when it does?"** An unmanaged device forced through an app-protection boundary in a browser session control is contained. An unmanaged device holding a fat client and a never-re-evaluated token is a hole in the wall you didn't know was open.
 ### The compliance signal lies (in both directions)
 "Require compliant device" in Conditional Access is the real control. But the compliance signal underneath it is softer than the toggle suggests:
 - **It's stale.** Compliance is evaluated on a check-in cadence, not continuously. There's a window where a device falls out of compliance — gets rooted, drops encryption, falls behind on patches — and still carries a "compliant" state and a valid token. The signal lags reality.
 - **It's spoofable.** Root/jailbreak detection is an arms race, not a wall. A motivated attacker (or a determined user with a YouTube tutorial) steps over the tripwire. Treat detection as a tripwire, never as a barrier.
 - **It's shallow.** "Compliant" usually means a handful of boxes — PIN set, encrypted, OS version, not-jailbroken. None of those stop malware running with the user's own token on a device that passed every check.
 - **It fails both ways.** A false *compliant* over-trusts a hostile device. A false *non-compliant* locks a legitimate user out at the worst possible moment — and anyone who's run endpoint at scale has watched a flaky signal brick access for someone important mid-flight. Both failure modes are real; design for both.
 ### The ghost policy: displayed config ≠ enforced config
 This one is field-earned and genuinely frightening, because it defeats every form of inspection there is. A Conditional Access policy can show a **perfectly correct configuration in the portal** — every condition, assignment, and grant exactly as intended — and yet **never enforce anything.** The backend state has desynced or corrupted; the object you're *looking at* is not the object being *evaluated*. Recreating the policy from scratch with byte-identical parameters restores enforcement. Nothing in the displayed config ever told you it was broken.
 Sit with what that means. A config review passes. An export-and-diff passes. A CIS audit ticks it green. Every parameter is "correct." And the control is doing nothing — a CA policy that **fails open, silently.** This is the worst failure on the convexity axis: the control you trusted to be convex (fails safe, blocks a class) is quietly behaving concave (fails open, protects nothing), and *no artefact you can read reveals it.* A benchmark cannot catch this. It is invisible to inspection by construction.
 There is exactly one thing on earth that detects it: **observed enforcement under test.** This is not an edge case to file away — it is the single hardest piece of evidence for why the entire stressor discipline in this handbook exists. The iron rule that follows (and it is non-negotiable):
 > **A CA policy's displayed configuration is a claim, never proof. The only proof is a real sign-in producing the expected outcome. Define the expected results *before* you build or change the policy, and test against them every time.**
 Concretely: for the users and conditions that matter, write down the required outcome first — *user X, condition Y → MUST be blocked / granted / MFA-prompted* — so you're testing against a pre-committed expectation, not rationalising whatever you observe. Use the What If tool as a first pass, but understand its limit: What If evaluates the *configuration logic*, so it will happily tell you a ghost policy "applies" while the live evaluator ignores it. **Only a real authentication attempt is proof.** And when behaviour and config disagree, **recreate the policy from scratch — do not re-edit it**, because editing a corrupt object can carry the corruption forward. Re-test after tenant-level changes too, not just after policy edits; the desync can appear without you having touched the policy at all.
 ### The join-state coupling (Book II reaches the desktop)
 Entra hybrid join drags the Book II fragility down to the device: the device identity now depends on on-prem AD, the SCP, the sync, and line-of-sight to a DC for some flows. It's the device-layer version of "one organism, two badges," and it exists almost entirely to service legacy app/auth dependencies. Pure Entra join + Intune is the cloud-native path that severs that coupling.
 ### The PRT is the device's golden ticket
 The Primary Refresh Token on a managed device is its key to seamless cloud SSO. A compromised endpoint with a live PRT is a serious blast-radius problem. TPM binding (the session key sealed in hardware) is what raises the cost of stealing it — so "is the PRT TPM-bound?" is a real question, not a checkbox.
 ### MAM / App Protection is a *porous* boundary
 Managing the data layer without owning the device (MAM-WE / App Protection Policies) is the right idea — wall the data, don't try to own a personal phone. But the wall has seams, and the data leaks through them: the OS share sheet, copy/paste where it isn't blocked, screenshots, "open in unmanaged app," local save paths, backups and cloud sync, and unmanaged browsers. A **"Block" in the policy is a claim, not a guarantee** — there are documented cases where the data goes out a path the policy was supposed to close. And enforcement is **not symmetric across iOS and Android**: different OS capabilities, different companion app requirements, different gaps that shift release to release. Never assume parity, and never trust the toggle without watching the device.
 ### Enrollment is a trust-establishment moment
 Autopilot and enrollment are when a device becomes "trusted." That makes the enrollment path — tokens, the Autopilot device list, enrollment restrictions — a target: hijack it and you enrol a hostile device as a friend. Most programmes harden the device after enrollment and never look hard at the enrollment trust itself.
 ### The legacy and standing-privilege drag
 - **GPO + co-management overlap** — on-prem-coupled config (Book II again), conflicts with Intune, and a migration most estates have half-finished for years.
 - **Standing local admin** on endpoints — the device-layer version of Book III's original sin; one cracked local admin path = lateral movement.
 - **Legacy auth that bypasses CA entirely** — the device controls are irrelevant on a protocol that never consults Conditional Access.
 ### Patch velocity, and its evil twin
 A fleet you can patch in 24 hours is antifragile; one that takes six weeks of change control is fragile, and the attackers know your patch latency better than you do. But the *opposite* failure is just as real: a fast push to **everything at once** with no staging is how a single bad update bricks an entire fleet — the 2024 CrowdStrike mass-BSOD event was exactly this, a security vendor's own update shipped fast to everyone with no canary. Velocity without an escape hatch is concave (see §4).
 ---
 ## 2. Via negativa — what to remove
 1. **Go cloud-native.** Move to Entra join + Intune + Autopilot and retire hybrid join, domain join, and GPO wherever the legacy dependency can actually be killed. This severs the Book II coupling at the device layer and deletes a whole class of "the desktop broke because the DC/sync/SCP did" failures.
 2. **Stop trying to trust the device.** This is a *deletion* — stop pouring effort into making BYOD a trusted device. Wall the data instead (MAM/App Protection) and treat the device as untrusted by default. Subtracting the impossible goal is the move.
 3. **Remove data from the endpoint.** If the data lives in managed apps and the cloud, there's less on the device to leak or lose. Shrink the local footprint and the compromise gets cheaper to absorb.
 4. **Remove standing local admin.** JIT elevation (Endpoint Privilege Management) instead — Book III's "shrink the time" at the desktop.
 5. **Kill legacy auth and the protocols that bypass CA.** A device control you can route around isn't a control.
 6. **Prune the cruft** — conflicting/duplicate config profiles, dead enrollment profiles, stale Autopilot registrations, orphaned compliance policies nobody can explain. Each one is drift waiting to surprise you.
 ---
 ## 3. The barbell — cheap devices, protected boundary
 **The device is cattle, not a pet.** This is the central barbell of the book. A lost, stolen, or compromised endpoint should be a **shrug**: selective-wipe the corporate data (BYOD) or full-wipe and re-provision via Autopilot in about an hour (corporate). If losing a laptop is a crisis, you've made the device irreplaceable — which means you protected the wrong thing.
 **Protect the irreplaceable boundary instead:**
 - **The access decision** — Conditional Access. This is the convex control of the endpoint world (Book I): one well-built policy blocks whole classes of attack, cheaply. It is also one of the few things that can brick an entire tenant if misconfigured, so it gets paranoid change discipline (§4).
 - **The data boundary** — the managed-app container / App Protection policy set, tested at the seams (§5), not trusted at the toggle.
 - **The PRT and enrollment trust** — TPM-bound credentials, hardened enrollment restrictions, device-bound phishing-resistant auth (links Book III).
 **Don't gold-plate the disposable.** Spending weeks locking down a kiosk's wallpaper policy while the CA policy set has a legacy-auth hole is the endpoint version of even-spreading. Concentrate on the decision and the data wall.
 ---
 ## 4. Optionality & recovery — escape hatches, tested
 - **Wipe-and-reprovision as the recovery primitive.** Autopilot makes the device replaceable; *that* is your endpoint recovery plan. But "replaceable in an hour" is a slide claim until you've timed it on a real device. Drill it.
 - **Selective wipe for BYOD** — the clean escape hatch that pulls corporate data without touching the user's photos. The thing that makes MAM politically survivable.
 - **Update rings and canaries — velocity *with* a brake.** The answer to the CrowdStrike failure mode isn't "patch slowly," it's "patch fast through rings with a real canary, and keep the ability to **halt or roll back** a bad push before it reaches everyone." Fast *and* reversible. This is the barbell and optionality fused: speed on the upside, a bounded blast radius on the downside.
 - **Break-glass exclusion from device requirements.** A flaky compliance signal must never lock out recovery. The break-glass accounts (Book I/III) sit outside the "require compliant device" gate — and that exclusion is monitored, not forgotten.
 - **Fast device-trust revocation.** A one-move way to disable a device, revoke its tokens, and drop it from CA trust. Rehearse it.
 - **Continuous Access Evaluation** is the mechanism shrinking the stale-token window — near-real-time response to critical events instead of waiting for token expiry. It narrows §1's "the signal is stale" gap. Coverage is not universal across every app and flow (verify current state, §honest uncertainty).
 ---
 ## 5. Stressor — break it on purpose
 This domain rewards hands-on stress more than any other, because the gap between *policy* and *behaviour* only shows up on a real device.
 - **Reconcile the four lists and hunt the deltas.** Pull Intune-enrolled devices, Entra-registered devices, devices appearing in sign-in logs, and the CMDB. None of them will agree. The **disagreements are the findings**: devices authenticating that nobody manages, CMDB entries that never sign in, registered devices that fell out of management. Then go further — count legacy-auth sign-ins and long-lived sessions (the dark end), and run network device discovery for the unmanaged things on the wire. The size of the gap between "the fleet we think we have" and "the population actually touching data" is one of the most honest metrics you can put in a report.
 - **Attack your own MAM boundary, per platform.** Try to get corporate data out through every seam: share sheet, copy/paste, screenshot, save-as-local, open-in- unmanaged-app, backup/sync, an unmanaged browser. Find where "Block" doesn't actually block. Do it **separately on iOS and Android** — they will not behave the same, and the difference is the finding. (When you find a gap that survives reinstall and reset, that's an escalation to the vendor, not a config you missed.)
 - **Spoof the compliance signal.** Root/jailbreak a test device. Is it caught? How long until the signal flips and CA reacts? That latency is your real exposure window.
 - **Prove every CA policy actually enforces.** Never sign off a policy on its displayed config. With expected results written down beforehand, drive real sign-ins for each user/condition that matters and confirm the *observed* outcome matches. Treat What If as a hint, not proof. If a policy that looks correct doesn't enforce, recreate it from scratch rather than editing — the displayed object and the evaluated object can diverge silently, and a ghost policy fails open without ever telling you.
 - **Lock yourself out on purpose.** In report-only mode, simulate a false non-compliant on a privileged user. Watch the CA decision. Confirm break-glass sails through. Better to find the lockout in a drill than during an outage.
 - **Push a deliberately bad config/update to the canary ring.** Confirm the ring *contains* it and that halt/rollback works. An untested canary is just the first domino with a friendly name.
 - **Time a wipe-and-reprovision.** Is the device truly replaceable in an hour, or is that a fiction the recovery plan rests on?
 - **Compromise a test endpoint.** What does its PRT reach? Does EDR detect it? Does the device-risk signal actually flow into CA and revoke access — or does it stop at a dashboard nobody watches?
 Per Book I principle 6: every gap found becomes a **structural** change — a closed seam, a tightened ring, a severed coupling, an escalation raised — not a line in a test log that dies there.
 ---
 ## Honest uncertainty (endpoints are the worst offender — verify on a real device)
 Stable and Lindy (teach with confidence): the device will be compromised; trust the boundary, not the device; cheap-and-reprovisionable beats hardened-and- precious; compliance is a signal; velocity needs a brake. None of that churns.
 What moves — and on the endpoint, it moves *faster and more quietly* than anywhere else in this handbook:
 - **MAM / App Protection enforcement is version-, platform-, and OS-build- dependent, and it has gaps that shift release to release.** iOS and Android are not symmetric and never have been; companion app requirements and managed- browser support change. The portal will tell you a policy is enforced while the device quietly does something else. **The only reliable test is on a real device, on the current OS build, every release** — the documentation and the hardware disagree more than Microsoft likes to admit. If you live anywhere in this handbook, live here.
 - **Continuous Access Evaluation coverage** is expanding but not universal — which apps and flows honour near-real-time revocation changes; verify current coverage before you promise it closes the stale-token window.
 - **Windows LAPS, Endpoint Privilege Management, Autopatch, Smart App Control / WDAC** capabilities and management surfaces all evolve; confirm current state and licensing before recommending.
 - **Cloud-native vs hybrid-join guidance and the GPO→Settings-Catalog migration tooling** keep shifting toward cloud-native; check what's actually supported for the client's app estate before promising the coupling can be cut.
 If a client's safety hinges on a specific enforcement behaviour, **test it on the device and, if needed, cite the current Microsoft doc** — and when the device behaviour contradicts the doc, believe the device. Confident-but-wrong about an endpoint control is how data walks out a seam everyone swore was closed.
 ---
 ## Consolidated judgement prompts
 - If this device is compromised right now, what does the attacker get, how fast do we know, and how fast is it gone? Is the device a shrug or a crisis?
 - Do we know our *real* device population — derived from what's authenticating — or are we trusting a CMDB that's more wish than reality? How big is the gap between managed, known-unmanaged, shadow, and dark? What dark access bypasses CA entirely?
 - Is "compliant" being treated as a guarantee or as a signal that can be stale, spoofed, or shallow? What happens when it's wrong — in *both* directions?
 - Is the boundary the data (MAM/CA) or the device? Have we tested the data wall at every seam, on every platform, on the current OS build — or just toggled it?
 - Are devices hybrid-joined out of genuine need, or out of habit? What would it take to go cloud-native and cut the Book II coupling?
 - Can we patch the fleet fast — and can we *halt* a bad push before it reaches everyone? Do we have rings and a real canary, or hope?
 - Is the PRT TPM-bound? Is enrollment trust hardened, or can a hostile device enrol as a friend?
 - Does standing local admin still exist? Does legacy auth still bypass CA?
 - For every CA policy that matters: has it been proven to enforce by a *real sign-in* against pre-written expected results — or are we trusting the displayed config of a policy that might be a ghost?
 - Has anyone timed a wipe-and-reprovision, tested break-glass against the device gate, or watched the device-risk signal actually reach a CA decision?
 ---
 *Book IV of the Antifragile Handbook. Stop defending the device; assume it's already lost and build the boundary that survives it. Trust the device behaviour over the portal toggle, every time. Move fast and fix things.*
@@ -0,0 +1,140 @@
 # The Antifragile Handbook for M365 & Active Directory
 ## Book V — Data & Collaboration (Exchange, SharePoint, Teams, OneDrive)
 > *Data is liquid. It leaves where you put it — copied, shared, forwarded, synced, linked. The question is never "is it locked down" but "where can it flow, who can reshare it, and can you see and reverse the flow?"*
 ---
 ## The governing question
 Books II–IV protected the *containers*: identity, privilege, devices. This book is about the *contents*, and contents obey a different physics. You can perfectly secure a container and still lose the data, because data doesn't stay put — it's duplicated into an email, dropped in a Team, synced to a laptop, handed to a guest who reshares it to someone you've never heard of. Perimeter thinking dies here.
 > **Every share is a copy of your blast radius handed to a party you don't control. Can you see where it went, and can you pull it back?**
 For most estates the honest answers are "no" and "no": nobody can enumerate the external shares, nobody reviews the guests, and a file shared to "Anyone with the link" three years ago is still reachable by anyone who ever held that link.
 ---
 ## 1. Fragility inventory — how data leaks
 ### "Anyone" links: bearer tokens for your data
 Anonymous "Anyone with the link" sharing in SharePoint/OneDrive is the single largest data-exposure fragility in M365. A link is a **bearer token** — whoever holds it has access, no identity, no MFA, no device check, often no expiry, and it's forwardable. Its blast radius is everyone the link ever reaches, forever, including the open web if it leaks into an email thread or a crawler. Conditional Access, compliant devices, all of Books II–IV — none of it applies to a bearer link. It's a hole punched clean through every wall you built.
 ### Reshare, and the chain you can't see
 Once data is shared — especially externally — the recipient can usually reshare, download, and copy it. You've handed your blast radius to an org (or a personal account) whose security posture you don't control and can't observe. Guests reshare to other guests. The chain of custody becomes invisible after the first hop. And the controls that govern this in Teams collaboration are **split across several layers** — Teams policy, SharePoint org- and site-level sharing, OneDrive, tenant sharing settings, and B2B/cross-tenant access — that interact in non-obvious ways and don't always agree. (More in §honest uncertainty; this is a place where the policy matrix and the observed behaviour routinely diverge.)
 ### Guest sprawl: standing blast radius at the data layer
 Guests accumulate and nobody prunes them. The guest invited for one project in 2022 still has a foothold. Each is an external identity governed by *their* security, not yours — the data-layer cousin of standing privilege (Book III) and shadow devices (Book IV). Unreviewed guest access is a slowly metastasising external attack surface, and most tenants cannot even produce the list of who has it and to what.
 ### Email: the oldest, most Lindy exfil channel
 Auto-forwarding rules are the classic business-email-compromise move — a quiet hidden rule that copies all mail to an external address, persistent and invisible. Add attachment-save paths that escape policy, and mail remains the most reliable way data walks out the door. External auto-forward should be off by default, and its presence should scream.
 ### The hybrid Exchange anchor (Book II at the data layer)
 An on-prem Exchange server is a Tier-0-adjacent liability — historically one of the most catastrophic on-prem attack surfaces, where mailbox/management permissions can escalate toward AD. Hybrid Exchange drags that liability into the estate, and subtle functionality dependencies keep the last server alive long past its welcome. The via-negativa prize is decommissioning on-prem Exchange entirely (§2) — verify the current management/recipient tooling first.
 ### Internal oversharing
 External isn't the only blast radius. "Everyone," "All company," and "Everyone except external users" permissions on a site holding HR, finance, or M&A data mean one compromised *internal* account reaches it all. Default-open SharePoint sites and self-service site creation produce internal data sprawl that no one maps.
 ### Collaboration sprawl by design
 Every Team spins up a SharePoint site, an M365 group, a mailbox, and more — each with its own sharing and guest settings, each a potential leak. Self-service creation means ungoverned proliferation of data containers, and collaboration tools carry subtle data-visibility behaviours (who sees what history, what a late joiner can read) that surprise even experts. Sprawl nobody inventories is fragility nobody can see.
 ### Illicit OAuth consent: data exfil through a "legitimate" app
 A user clicks OK on an app requesting `Mail.Read` or `Files.Read.All`, and now a third party reads tenant data through a sanctioned-looking grant. This is the data-layer face of Book III's app-registration dark matter — exfil that needs no malware and trips no device control.
 ### Retention as hoarded blast radius
 Keeping everything forever makes every breach maximal: the attacker gets fifteen years of data instead of one. Over-retention is hoarding fragility — every byte you keep is a byte that can be stolen. (Its opposite, no recoverable copy at all, is Book VI's problem. The art is disposing of what you don't need while protecting what you do.)
 ---
 ## 2. Via negativa — what to remove
 1. **Kill anonymous "Anyone" links.** Default external sharing to authenticated, time-limited, least-permission (view, not edit). Remove the bearer token from your data entirely where you can.
 2. **Decommission on-prem Exchange.** Remove the Tier-0-adjacent liability; get off hybrid Exchange where the dependency can actually be cut (verify current tooling — §honest uncertainty).
 3. **Block external auto-forwarding by default.** Delete the quietest exfil channel there is.
 4. **Prune guests ruthlessly.** Access reviews, expiration, entitlement management. Stale external access gets removed, and new guest access expires by default. Treat guest sprawl like standing privilege: minimise and time-box it.
 5. **Minimise retention.** Dispose of stale data on a schedule. Shrink the prize so every breach is smaller. Data you no longer hold cannot be exfiltrated.
 6. **Remove broad internal shares** ("All company"/"Everyone") from anything sensitive. Sensitive data should live in *few, known* places with *narrow* access.
 7. **Govern self-service creation and clean up the dead.** Curb ungoverned Team/ site/app creation; archive and delete orphaned, inactive containers.
 8. **Restrict user consent and revoke illicit grants.** Users shouldn't be able to hand tenant data to arbitrary apps; admin-consent workflow for anything sensitive, and sweep out the over-permissioned grants already there.
 ---
 ## 3. The barbell — find the crown jewels, free the rest
 **Name the crown jewels.** Which handful of data sets — the IP, the regulated data, the executive and M&A comms, the source of the company's value — would, if leaked, actually end the business? Most organisations cannot name them, and *that inability is finding #1.* You cannot protect asymmetrically until you know what the asymmetry is for.
 **Paranoid protection for the crown jewels:**
 - **Sensitivity labels with encryption that travels with the file.** This is the convex control of the data world (Book I, principle 7): one label protects the file *everywhere it goes*, forever — even after it leaves the tenant, lands on an unmanaged device, or is forwarded to a stranger. The protection is bound to the data, not the container. That's the only thing that survives data's liquidity.
 - **Restricted sites, no external sharing, tight access with recurring reviews.**
 - **Conditional Access app control / session controls** — browser-only, block-download for sensitive data on unmanaged devices (the Book IV boundary applied to content).
 - **Heightened monitoring** on crown-jewel access (feeds Book VI).
 **Free everything else.** Most collaboration data is low value and should flow *fast* — velocity is a feature (Book I creed). Don't lock the lunch-menu SharePoint with M&A-vault rigour. Spreading DLP and restriction evenly across all data is the concave failure: enormous maintenance, false positives that train users to click through, and the real exfil lost in the noise. **DLP is a scalpel for known high-value patterns (card numbers, national IDs, the labelled crown jewels), not a dragnet over everything.**
 ---
 ## 4. Optionality & recovery — escape hatches, tested
 - **The label *is* the escape hatch.** Because encryption travels with the file, a leaked crown-jewel document is still encrypted wherever it lands — you pre-paid for the data to survive being stolen. That is optionality bound into the byte.
 - **Fast share revocation.** Can you, in 30 minutes, enumerate and *kill* every external share and anonymous link? If you can't produce the list, you can't pull it back — build the report and the revocation muscle before you need them.
 - **Audit and content forensics — switched on and retained.** "Who accessed and downloaded what" is your post-incident truth, but only if audit logging is actually enabled and retained long enough to matter. Verify it's on; don't assume (§honest uncertainty).
 - **Guest access reviews as recurring pruning** — the recovery loop for sprawl.
 - **Immutable/held copies of crown-jewel data** — the bridge to Book VI backup.
 ---
 ## 5. Stressor — break it on purpose
 - **Exfiltrate a labelled crown-jewel file yourself.** Email it externally, share it anonymously, download it through CAA session control, open it on an unmanaged device. Does the label encryption hold? Does DLP fire? Does anything alert? You are testing the *behaviour*, not the policy screen (Book I corollary).
 - **Plant a canary document** seeded with a detectable pattern and try to move it out every way you can. What catches it? What doesn't?
 - **Enumerate the external surface.** Produce the full list of "Anyone" links, external guests, and externally-shared files. The exercise of *trying* usually reveals you can't — which is the finding.
 - **Simulate the BEC forward rule.** Set a test external auto-forward. Is it blocked? Alerted? Silent? Silence is the BEC attacker's favourite answer.
 - **Test the reshare chain.** Share to a test guest, have them reshare onward. Can you see it? Stop it? Pull it back?
 - **Reconcile declared vs enforced sharing.** The tenant sharing setting says one thing; walk the actual per-site and per-link reality. They diverge — the ghost-policy cousin from Book IV, at the data layer.
 Per Book I principle 6: every leak path found becomes a **structural** change — a killed link type, a pruned guest population, a label applied, a coupling removed — not a note in a spreadsheet.
 ---
 ## Honest uncertainty (the sharing matrix moves — test, don't trust it)
 Stable and Lindy (teach with confidence): data is liquid; bearer links are exposure; protection must travel with the data; minimise the prize; DLP is a scalpel not a dragnet; guests are standing blast radius. None of that churns.
 What moves, and what you must verify by testing rather than reading:
 - **External sharing enforcement is split across many interacting layers** — Teams policy, SharePoint org/site sharing, OneDrive, tenant settings, B2B/cross-tenant access, and the Premium tiers — and they don't always agree. Enforcement can differ by client and platform, and the documented matrix and the observed behaviour diverge often enough that you should **confirm the real behaviour on a real client, not from the policy screen.** When you find an inconsistency that survives reconfiguration, that's a vendor escalation, not your error.
 - **On-prem Exchange decommissioning** and the "last server for management" story — the tooling has evolved; verify the current supported path before promising the coupling can be cut.
 - **Purview / sensitivity labels / auto-labelling / DLP** capabilities churn fast, including the branding. Verify current coverage and licensing.
 - **Cross-tenant access settings (B2B collaboration and direct connect)** are comparatively new and evolving — verify current behaviour.
 - **Audit log retention defaults and licensing have changed over time.** Confirm what's actually captured and for how long *before* you rely on it for forensics.
 If a client's safety hinges on a specific sharing behaviour, test it on a live client and cite the current doc — and where the client behaviour contradicts the doc, believe the client.
 ---
 ## Consolidated judgement prompts
 - Can we name the crown jewels? If not, that's finding #1 — everything else is guesswork until we can.
 - Can we enumerate every external share, anonymous link, and guest *right now*? Can we revoke them fast?
 - Does protection travel *with* the crown-jewel data (labels/encryption), or only with the container it currently sits in?
 - Where can this data flow — reshare, forward, sync, download, OAuth app — and is any of that flow visible or reversible?
 - Are guests treated as standing blast radius (minimised, time-boxed, reviewed) or left to accumulate?
 - Is DLP a scalpel on known high-value patterns, or a dragnet generating noise everyone clicks through?
 - Is on-prem Exchange still anchoring the estate? What would it take to cut it?
 - Is audit logging actually on and retained long enough to reconstruct an incident?
 - Does the tenant's *declared* sharing posture match what the sites and links *actually* enforce?
 ---
 *Book V of the Antifragile Handbook. You cannot wall in a liquid. Name the few things that would end the company, bind protection to the data itself, shrink the prize, and make every flow visible and reversible. Move fast and fix things.*
@@ -0,0 +1,154 @@
 # The Antifragile Handbook for M365 & Active Directory
 ## Book VI — Recovery & Detection-as-Feedback
 > *Robust means you survive the shock unchanged. Antifragile means you come back stronger. The shock is coming either way — the only choice is what you do with it.*
 ---
 ## The governing question
 This is the capstone, because it's the book that decides whether everything before it was merely *robust* or genuinely *anti*fragile. The first five books harden the estate; this one builds the machine that turns every shock into improvement. Ask:
 > **When — not if — this fails, do you come back weaker, the same, or stronger?**
 A fragile estate comes back weaker (if at all). A robust estate comes back the same and waits for the next identical hit. An antifragile estate comes back *different and harder to hit the same way twice* — because it ran the shock through a feedback loop and changed its own structure. That loop is the entire subject of this book.
 The reframe that powers it: most organisations treat detection and recovery as the sad afterthought — the thing they hope never to need. Invert it. **Incidents, alerts, failed drills, and near-misses are the most valuable intelligence the system ever produces** — honest, real-world data about where the fragility actually is, bought in the cheapest currency available *if you harvest it.* The org that buries incidents stays fragile. The org that treats them as fuel becomes antifragile. Your job is to build the machine that converts disorder into structural strength.
 ---
 ## 1. Fragility inventory — where recovery and detection rot
 ### Backups that have never been restored
 The biggest recovery lie in the industry: *"we have backups."* Having a backup is not the same as being able to recover, and an untested backup is Schrödinger's recovery — simultaneously fine and worthless until someone actually opens the box. Two M365-specific traps make this worse:
 - **"Microsoft backs it up for us."** Microsoft provides geo-redundancy, recycle bins, and limited native retention — *not* point-in-time backup against your own ransomware, malicious deletion, or retention expiry. Under the shared- responsibility model, **your data is your responsibility.** Most tenants have no real, independent, point-in-time M365 backup, and discover this during the incident.
 - **Attackers target backups first.** Ransomware operators delete or encrypt the backups *before* they hit production, because they know it's your only way out. A backup reachable from the compromised estate is not a backup; it's another victim.
 ### AD forest recovery: the nightmare nobody rehearses
 Recovering a compromised or destroyed AD forest is one of the hardest operations in all of IT — clean OS installs, authoritative restore of one DC per domain, metadata cleanup, double krbtgt reset, trust resets, the whole brutal sequence. Almost no one has practised it. So when ransomware takes AD, "restore from backup" is a multi-day, error-prone, improvised ordeal performed for the first time under maximum pressure. Entra recovery is less apocalyptic but has its own teeth: the hard-delete window for objects, and the fact that tenant *configuration* (CA policies, Intune, roles) has no native "undo" unless you captured it as code.
 ### Recovery that depends on what the incident destroyed
 The fatal circular dependency: backups authenticated by the AD that's down. The recovery runbook stored in the SharePoint that's encrypted. The break-glass that needs the MFA service that's offline. The recovery admin whose credentials the attacker already has. **A recovery path that depends on the thing it's recovering is not a recovery path** — it's the clean-source principle (Book III) applied to survival.
 ### Detection that fires into a void
 Logs not collected. Audit logging never enabled or silently aged out. A SIEM full of alerts nobody triages. And the specific blind spots the earlier books planted: the unmonitored DCSync (Book II), the unwatched break-glass use (Book III), the device-risk signal that dies on a dashboard (Book IV), the BEC forward rule nobody sees (Book V). Detection that nobody acts on is theatre with a subscription fee.
 ### Alert fatigue: the boy who cried wolf, automated
 Too many low-fidelity alerts is itself a fragility — the real signal drowns in noise, and the analyst who's dismissed a thousand false positives dismisses the one that mattered. More alerts is not more security; past a point it's *less.*
 ### MTTR that exists only on paper
 RTO/RPO numbers in a policy document, never once validated by an actual restore, are fiction. (Book I anti-benchmark: MTTR is measured by *doing it*, not by declaring it.)
 ### Incidents that close without changing anything
 The post-incident review that concludes "remind users to be more careful" has wasted the disorder entirely and guaranteed the recurrence. And a blame culture destroys the feedback loop at the source — if surfacing an incident gets you punished, incidents get buried, and the system goes blind.
 ### No known-good to return to
 If your tenant configuration lives only as click-ops in a portal, you have no golden image of "correct," so you can neither rebuild it fast nor detect drift *from* it — and you can't catch a ghost policy (Book I/IV) because you have nothing to diff against. No config-as-code means no known-good.
 ---
 ## 2. Via negativa — what to remove
 1. **Delete the false comfort that Microsoft backs you up.** Removing the dangerous belief comes before adding the real backup.
 2. **Sever recovery's dependencies on the estate it recovers.** Recovery credentials, runbooks, and backups must not depend on prod AD/Entra/SharePoint. Decouple, so the lifeboat doesn't sink with the ship.
 3. **Cut alert noise.** Ruthlessly remove low-fidelity alerts so the high-fidelity ones become visible. Via negativa applied to detection: fewer, louder, truer.
 4. **Remove blame from the post-incident process.** Blameless on people so people surface incidents — then ruthless on structure so the incident actually changes something. Removing the incentive to hide *protects the feedback loop itself.*
 5. **Remove click-ops from critical configuration.** Move control-plane config (CA, Intune, roles) to code, so a known-good exists to rebuild from and diff against.
 ---
 ## 3. The barbell — paranoid recovery for the irreplaceable, best-effort for the rest
 **The irreplaceable few** — the identity control plane (Books II/III) and the crown-jewel data (Book V) — get **real, tested, immutable, offline/isolated backup** and **rehearsed** recovery. AD forest recovery is practised, not theorised. Recovery objectives for these are measured in a drill, in minutes or hours, not asserted in a policy.
 **The recovery capability is itself a crown jewel.** Backups are a top attacker target, so protect them like break-glass: immutable, offline or in a separate trust domain, unreachable even from full domain dominance. A backup the attacker can reach is not a control.
 **Everything else is best-effort and tiered.** Don't gold-plate recovery for the lunch-menu SharePoint. Tier recovery objectives to value — crown jewels get immutable and fast; bulk collaboration gets good-enough. And concentrate **high-fidelity detection** on the control-plane and crown-jewel signals (the screaming break-glass, the anomalous DCSync, the impossible-travel admin, the crown-jewel mass-download) rather than spreading shallow alerting evenly across everything.
 ---
 ## 4. Optionality & recovery — the heart of the book
 - **Tested restores on a schedule.** The only proof of recovery is a restore that happened. Make the restore drill routine, time it, and verify integrity — that time *is* your real MTTR.
 - **Immutable + offline/isolated backups** — the escape hatch that survives the attacker reaching production. Ransomware-resilient by design, not by hope.
 - **Rehearsed AD forest and Entra recovery runbooks, stored independently** — on paper or offline, reachable when the estate is dark, not in the SharePoint that's encrypted.
 - **Configuration-as-code (IaC) for the control plane** — instant rebuild *and* a known-good baseline to detect drift and ghost configuration against. This single practice serves recovery, drift detection, and the Book I corollary at once.
 - **A clean-room / isolated recovery environment** — somewhere to rebuild that the attacker isn't already inside.
 - **The fail-over-vs-clean-in-place decision pre-made.** When do we rebuild rather than try to clean a compromised estate? Decide the criteria *before* the incident; it's the Book II "sever the sync" decision generalised to the whole estate.
 ---
 ## 5. Stressor — the hormesis engine (the climax of the handbook)
 This is where the entire handbook either runs or rusts. Everything else is preparation for the loop; this is the loop turning.
 - **Live restore of a crown-jewel dataset and the control plane.** Not a tabletop — an actual restore, integrity-verified and timed. The number you get is the truth; the number in the policy was always fiction.
 - **Rehearse AD forest recovery.** The first time you perform the hardest recovery in IT must not be during the real disaster. Run it. Find what's missing. Fix the runbook.
 - **Inject attacks end-to-end and follow them all the way through.** DCSync, malicious consent, break-glass use, impossible-travel admin, crown-jewel mass- download. Confirm not just that the alert *exists*, but that it's **triaged, and someone acts.** Detection that fires into a void fails this test on purpose, so you can fix it.
 - **Run a ransomware game-day** that assumes Tier 0 is owned and backups are the first target. Watch your decoupling hold or fail.
 - **Purple-team as routine, not annually.** Standing, escalating, blast-radius- controlled stress — hormesis, not a once-a-year audit ritual.
 - **Measure the loop itself.** Track *time from incident to structural change.* If drills and incidents close without a removed right, a severed coupling, or a new firebreak, the loop is broken and you are merely robust.
 ---
 ## The feedback loop — what makes all six books antifragile
 Name the loop explicitly, because it's the thread that ties the whole handbook together and the thing that converts robustness into antifragility:
 **Detect** (see the stressor) → **Respond** (contain it) → **Recover** (come back) → **Learn structurally** (come back *stronger*) → which feeds back into **Removal and redesign** across every prior book — a fragilizer deleted (Book I via negativa), a coupling severed (Book II), a standing privilege collapsed (Book III), a device boundary tightened (Book IV), a data flow closed (Book V).
 The first three steps are robustness; plenty of organisations reach them and call it security. **The fourth step is the whole game.** A shock that produces no structural change has been wasted, and the system will meet the same shock again, unchanged. A shock that *does* produce structural change has made the estate stronger — which is the literal definition of antifragile, and the only honest justification for everything in this handbook.
 ---
 ## Honest uncertainty (verify the moving parts)
 Stable and Lindy (teach with confidence): untested backup is no backup; attackers hit backups first; recovery must not depend on what it recovers; detection without action is theatre; alert fatigue is fragility; every shock must change the structure. None of that churns — these are the oldest truths in operational security.
 What moves, and what you must verify:
 - **M365 native backup/retention specifics and the shared-responsibility boundary** — what Microsoft does and does not cover, recycle-bin and hard-delete windows — evolve. Verify current reality, and **test what you can actually recover** rather than trusting either "Microsoft has us covered" or a vendor pitch.
 - **Entra recovery and configuration-backup tooling** (deleted-object windows, Graph/IaC options for capturing CA, Intune, and roles as code) evolve — verify current capability.
 - **AD forest recovery** is Lindy in principle (it is brutal; rehearse it), but automation and tooling evolve — confirm the current supported procedure.
 - **Detection tooling** (the XDR/SIEM signal catalogue) churns continuously. Verify which detections exist *today* and test them end-to-end; the principle (high-fidelity over noise, tested through to action) is what's permanent.
 - **Audit log retention and licensing** have changed over time — confirm what's captured and for how long *before* relying on it for forensics.
 If recovery hinges on a current specific, verify it and test it. "We confirmed the restore works and it takes four hours" beats any RTO ever written in a policy.
 ---
 ## Consolidated judgement prompts
 - When this fails, do we come back weaker, the same, or stronger? What's the mechanism that makes it *stronger*?
 - When was a backup of the crown jewels and the control plane last *restored* — not taken, restored — and how long did it take?
 - Are the backups reachable from the estate they protect? (If yes, they're another victim.) Are they immutable and offline?
 - Has anyone ever rehearsed AD forest recovery? Is the runbook reachable when the estate is dark?
 - Does any part of the recovery path depend on the thing the incident destroyed — credentials, runbook location, MFA, the recovery admin?
 - Does detection fire into action, or into a void? Is there so much noise the real signal is lost?
 - Does control-plane config exist as code (a known-good to rebuild and diff against), or only as click-ops?
 - For the last three incidents and drills: what *structural* thing changed? If the answer is "a reminder," the loop is broken.
 - How long from incident to structural change — and is that time getting shorter?
 ---
 ## Coda — the whole arc
 Six books, one idea. Book I is the **lens**: subtract before you add, protect the irreplaceable, measure blast radius, buy optionality, stress on purpose, and make every shock change the structure — verifying by observation, never by inspection. Books II–V apply that lens to the **containers and contents**: the identity bridge made a firebreak, privilege collapsed in reach and time, the device assumed hostile and the boundary moved to the data, and the data itself made to carry its own protection as it flows. Book VI is the **loop** that makes it all antifragile rather than merely robust — the machine that feeds every incident back into removal and redesign.
 None of this is a checklist, and if a consultant trained on it ever reaches for "because the benchmark says so," they've missed the point. The point is judgement: draw the wall, find the fragility, fix what matters, and let every stress make the estate stronger than it was.
 Move fast and fix things.
 ---
 *Book VI of the Antifragile Handbook, and the close of the arc.*
@@ -0,0 +1,203 @@
 # The Antifragile Handbook for M365 & Active Directory
 ## Book VII — Vulnerability Management
 > *The patch cycle was built for a world where you had weeks. That world is gone. Exploitation now arrives in hours, the patch arrives in days, and no amount of "patch faster" closes a gap that runs the wrong way by two orders of magnitude. Stop racing the attacker to the patch. Change the race.*
 ---
 ## The governing question
 The first six books were written for a world in which the dominant way into an estate was a person — phished, tricked, talked past the controls. That assumption is now wrong. As of the 2026 Verizon DBIR, **exploitation of vulnerabilities is the leading initial-access vector in confirmed breaches — roughly twice phishing, for the first time in the report's history.** The front door changed. This book changes the lens to match.
 The governing question is the same as everywhere else in the handbook, pointed at the vulnerability surface:
 > **When — not if — a vulnerability on your estate is exploited, does the estate come back weaker, the same, or stronger?**
 A fragile estate treats every CVE as a race it has already lost and patches by score until the analyst burns out. A robust estate patches the important ones fast and survives. An antifragile estate **stops treating the vulnerability list as the unit of work at all** — it asks where the vulnerability sits on the kill chain, removes the false urgency that hides the real targets, contains the few that matter in hours, and feeds every exploited path back into architecture so the *next* vulnerability on that path is a non-event.
 The reframe that powers the book: **you cannot win a speed race against machine-speed exploitation by moving your humans faster, and you do not have to.** The winning move is not to patch the long tail before the attacker reaches it — that is arithmetically impossible and getting worse. The winning move is to make most vulnerabilities not matter (blast-radius and reachability), contain the few that do in the time you actually have (hours, not weeks), and convert every near-miss into a permanently shorter kill chain.
 ---
 ## Why the old model is finished — the arithmetic
 Four numbers end the debate, and they are worth saying out loud to a client in a room:
 - **Time-to-exploit has collapsed** from a median of 771 days in 2018 to roughly **4 hours** by 2024. The window the entire patch-management model was built around — the weeks between disclosure and exploitation — has effectively closed.
 - **Patching still takes weeks.** The 2026 DBIR puts median remediation of edge-device vulnerabilities at **43 days**, with only **54% remediated within a year.** 43 days versus 4 hours is the whole story.
 - **Volume has gone vertical.** ~59,000 new CVEs were projected for 2025, a ~50% year-on-year increase, and 2026 is on pace to exceed it. The enrichment infrastructure has buckled under the load — NIST reclassified ~29,000 backlogged CVEs to "Not Scheduled," meaning the data you relied on to prioritise is arriving late or never.
 - **Exploitation is being automated.** Autonomous exploitation research has demonstrated AI systems exploiting 174 of 178 CISA Known-Exploited Vulnerabilities at an average of ~21 minutes each, with no human in the loop, and an ~87% success rate against one-day vulnerabilities in real software. The attacker side automates faster than the defender side because generating a working exploit for a known bug is a clean, verifiable, deterministic problem — exactly what machines are good at — while *defending* requires environmental context, which is exactly what they have historically been bad at.
 The honest conclusion: **a human-paced, score-sorted patch programme is now structurally incapable of keeping pace.** This is not a maturity problem to be solved with more analysts. It is a model that has run out of road. Everything below is the replacement.
 One piece of good news hides in the data, and the whole framework leans on it: **roughly 90% of "critical" vulnerabilities are not actually exploitable in a given environment once compensating controls, reachability, and segmentation are properly mapped.** The fragility is not that you have 40,000 criticals. It is that you cannot yet tell which ~10% are real, so you treat all 40,000 as equally urgent and drown. Antifragile vulnerability management is, before anything else, the discipline of removing the 90% of false urgency so the real targets become visible.
 ---
 ## 1. Fragility inventory — where vulnerability management rots
 ### CVSS as the prioritisation engine
 The original sin. CVSS scores *severity in the abstract* — it knows nothing about whether the vulnerable asset is internet-reachable, whether it sits on the kill chain, whether an exploit exists, or whether an existing control already neutralises it. A 9.8 on a segmented, non-privileged, unreachable host is noise; a 7.5 on an internet-facing box one hop from a domain controller is a P0. Sorting 40,000 findings by CVSS produces a list that is precisely uncorrelated with where the attacker will actually go. It feels like prioritisation. It is sorting by the wrong key.
 ### The infinite, undifferentiated backlog
 "We have 40,000 criticals" is not a vulnerability problem; it is a *triage* problem wearing a vulnerability costume. An undifferentiated backlog has no front — every item looks equally urgent and equally hopeless — so the team either patches by score (wrong key) or freezes. The backlog grows faster than any human process can drain it, which means a backlog-draining strategy is a strategy to fall behind forever.
 ### Patch velocity treated as the only lever
 The reflex when the AI-exploitation story lands is "we need to patch faster." It is the wrong reflex, and it is the most expensive one. You cannot out-patch a 4-hour exploitation window with a 43-day cycle by trimming the cycle to 30 days. Velocity is a real lever for the long tail, but as the *primary* response to the speed problem it is a fragilizing illusion — it consumes the entire budget defending a race you mathematically cannot win, and leaves nothing for the moves that actually change the outcome (reachability, blast radius, containment, architecture).
 ### The half-done remediation — the ghost patch
 Book I's ghost-policy corollary, applied to vulnerabilities. A patch deployed to 80% of the fleet, a compensating rule applied but never verified to actually block, a "remediated" ticket closed against a host that quietly rolled back — these are *worse* than an open finding, because the open finding is at least honest. A remediation that displays as done while enforcing nothing is a vulnerability with a clean bill of health. **A vulnerability that is partly fixed is not partly safe; it is fully exploitable and now invisible.**
 ### The unscanned and the unscannable
 You cannot prioritise what you cannot see. The fleet you don't scan (Book IV's shadow and dark device populations), the appliance whose firmware no scanner reads, the SaaS you don't own, the dependency buried three layers into a container image — these are the dangerous quanta precisely because they carry no score at all. An estate that congratulates itself on draining the *known* backlog while the unknown surface grows is optimising the lit area under the streetlight.
 ### Reachability and compensating controls left unmapped
 If you have not mapped which assets are internet-reachable, which sit behind a WAF or EDR, which are segmented away from the crown jewels, then you have no way to perform the one subtraction that matters — collapsing 40,000 criticals to the ~10% that are genuinely exploitable here. Without reachability and control context, every finding is theoretically critical and therefore practically un-prioritisable.
 ### Remediation as the silent bottleneck
 Detection is largely solved — most teams are *drowning* in findings, not short of them. The bottleneck is everything after: triage, ownership, change windows, approvals, deployment, verification. Each human handoff in that chain costs hours or days, and there are usually five or six of them. In a world of 4-hour exploitation, a six-handoff remediation pipeline *is* the vulnerability.
 ### Detection without a feedback path to architecture
 A vuln gets exploited (or nearly), it gets patched, the ticket closes, and the *path* the attacker used — the flat segment, the over-privileged service account, the reachable management interface — stays exactly as it was, waiting for the next CVE to land on it. The incident produced a patch but no structural change. The disorder was wasted. This is the Book VI failure mode pointed at the vulnerability layer, and it is the difference between a programme that gets stronger and one that runs in place forever.
 ---
 ## 2. Via negativa — what to remove
 The defining act of antifragile vulnerability management is **subtraction before addition.** You remove false urgency, false comfort, and false work before you add a single new tool.
 1. **Remove CVSS as the sort key.** It does not go away — it stays as one input — but it stops being the thing that orders the queue. The queue is ordered by kill-chain position and exploitability in *this* environment.
 2. **Remove the ~90% of criticals that aren't exploitable here.** Map reachability and compensating controls and *delete the false urgency* on everything segmented, unreachable, or already neutralised. This is the single highest-leverage move in the entire programme: it turns "40,000 criticals" into "400 that are real and 40 that are on fire," and it is pure subtraction.
 3. **Remove the undifferentiated backlog.** A backlog with no structure is itself a fragility. Replace it with quanta (Section 3) — time-budgeted, atomic, completable units. An item that cannot be placed in a quantum is either not real (delete it) or not yet understood (route it to discovery).
 4. **Remove "patch faster" as the headline strategy.** Demote velocity to what it is — a lever for the long tail — and stop letting it consume the budget that belongs to reachability, blast radius, and containment.
 5. **Remove the half-done remediation from the "done" column.** A fix is not done until it is *verified to enforce* against a real test, not until the ticket is closed. Every quantum closes with a signal or it does not close. (Book I: validate by observation, never by inspection.)
 6. **Remove human handoffs from the hours-lane.** The steps in the critical-quantum pipeline that require no judgement — detection, reachability assessment, work-item generation, routing — get automated within policy guardrails so the scarce human judgement is spent only where judgement is actually required. You are not removing the human; you are removing the human from the steps that were only ever latency.
 ---
 ## 3. Quantum vulnerability management — the core model
 Here is the model the rest of the book turns on, and the direct answer to "how do we size remediation to a world that moves in hours."
 A **quantum** is the smallest unit of remediation that (a) fully closes a specific exploitable path, (b) is sized to a time budget it can *actually be completed within*, and (c) ends in a verifiable signal. The word is deliberate. A quantum is *atomic* — you cannot ship half of it and claim half the protection (that is the ghost patch). And it is *discrete* — work is packetised into units that fit the time you have, not smeared across an infinite backlog.
 The sort key is not severity. It is **time-to-existential-impact**, which is a function of three things the estate actually determines:
 > **kill-chain position × reachability × exploit availability**
 A vulnerability that sits on the path to existential compromise, is reachable by the adversary, and has a working exploit in the wild has a time-to-impact measured in hours. The same vulnerability, segmented away and unreachable, has a time-to-impact measured in months — or never. **The vulnerability is identical; its quantum is different, because its position is different.** This is the Book I principle (kill-chain position changes priority, not the CVE) made operational.
 That sort produces three live quanta and one that is more dangerous than all of them:
 ### Critical quantum — the hours lane
 On the kill chain, reachable, exploitable now. The time budget is **hours**, and that fact dictates the response: **you cannot wait for a patch cycle, so the critical quantum is closed by a compensating control, not necessarily the patch.** Block it at the edge, sever the reachability, disable the vulnerable feature, isolate the host, pull it behind the WAF. The patch follows later in the standard lane on the normal change calendar. The critical quantum's job is to **move the asset out of the hours-window** — to convert a 4-hour time-to-impact into a non-urgent one — by the cheapest fast control available. This is the lane that must be partly autonomous (Section 6), because human-paced execution cannot meet an hours budget.
 ### Severe quantum — the days lane
 Material risk, reachable with friction, or where a compensating control already buys partial cover. The time budget is **days**. These are batched into a days-sized packet of work that can be fully completed and verified inside a single short change window — not started and left at 80%.
 ### Standard quantum — the sprint lane
 The long, real, non-urgent tail. The time budget is a **sprint**. The discipline here is batching: the long tail is drained in sprint-sized quanta of work that *can actually be finished*, each one atomic and verified, rather than as an ever-growing list nobody ever reaches the bottom of. This is the only lane where "patch velocity" is the right tool, and it is fine for it to be slow, because by definition nothing in it is on fire.
 ### Dark quantum — the unsized unknown
 The most dangerous quantum is the one you cannot size, because you cannot yet see the asset, cannot establish reachability, or cannot determine exploitability. An unsized quantum is not a low priority — it is an *uncharacterised* one, and uncharacterised risk on an unknown asset is exactly how estates die. The antifragile response is not to ignore it (it has no score, so the old model does) but to **route it to discovery and to the Kill Chain Assessment** — to spend effort turning a dark quantum into a sized one, because a known severe is safer than an unknown nothing. This lane is why discovery (Book IV, the zero-budget discovery playbooks, the Kill Chain Assessment app) is part of vulnerability management and not separate from it.
 **The quantum discipline in one line:** size every remediation to the time you actually have, make each unit atomic and verifiable, and spend your scarce judgement converting dark quanta into sized ones — not re-sorting the known list by the wrong key.
 ---
 ## 4. The barbell — fast containment and deep architecture, nothing in the fragile middle
 The vulnerability barbell has two ends and a lethal middle.
 **One end: cheap, fast, reversible containment.** The hours-lane compensating controls — edge blocks, reachability cuts, feature disables, isolation. Low cost, high speed, applied within policy, reversible when the patch lands. This end exists to win the time race the patch can never win.
 **The other end: slow, structural, blast-radius reduction.** Segmentation, least privilege, T0 protection, assume-breach architecture (the whole of Books II–V). This is the end that makes the ~90% of vulnerabilities *not matter*, because a vulnerability that cannot reach anything important and cannot pivot is a finding, not an incident. It is slow and expensive and it is the only durable bet — architecture beats velocity in the vulnerability race, and it is the only race you can actually win.
 **The fragile middle to avoid: the aging critical-patch backlog.** A months-long queue of "critical" patches is neither fast containment nor structural fix. It is the worst of both — it carries the urgency of the hours-lane but moves at the speed of the sprint-lane, so it spends maximum anxiety for minimum protection while the attacker clears it for you, one exploited host at a time. The barbell says: contain it fast *or* architect it away. Do not let it sit in the middle, aging, pretending that "we're working through the criticals" is a posture.
 The asymmetric-payoff reading (Pillar 5): a few hours of compensating-control work on a kill-chain node prevents a catastrophe, and a segmentation project that costs a quarter makes a thousand future CVEs irrelevant. Both ends of the barbell are convex. The fragile middle is concave — maximum cost, minimum return.
 ---
 ## 5. Optionality & recovery — designing so most vulnerabilities can't matter
 - **Reachability as a control surface.** If you can cut a vulnerable asset off from the adversary faster than you can patch it — and you almost always can — then reachability *is* your fastest remediation. Build the capability to sever reachability quickly (edge policy as code, network isolation on demand) and you have an answer to every hours-lane finding that does not depend on a vendor patch existing yet.
 - **Compensating-control inventory, mapped in advance.** The ~90% reduction only works if you already know, per asset, what controls are in front of it. Map EDR coverage, WAF rules, segmentation, and internet reachability *before* the incident, so that when a zero-day drops you can answer "are we actually exposed?" in minutes instead of days. This map is the single most valuable artefact in the programme.
 - **Blast-radius limitation as vulnerability management.** Every segmentation boundary and every collapsed standing privilege is a vulnerability-management control, because it converts "exploit one thing, own everything" into "exploit one thing, contain it." The cheapest way to manage a vulnerability is to have already made it survivable.
 - **Known-good baselines and config-as-code (ASTRAL).** When a vulnerability is exploited, the ability to restore the affected control plane to a verified baseline collapses the cost of exploitation. A reachable, recoverable, version-controlled estate treats a successful exploit as an inconvenience, not a catastrophe.
 - **The pre-made "isolate vs patch vs rebuild" decision.** Decide the criteria before the incident: when do we contain-and-wait, when do we emergency-patch, when do we rebuild from known-good? Deciding under fire is how the half-done remediation gets created.
 ---
 ## 6. Stressor — the autonomy and the feedback loop
 Two stressors run this book, and the second is the one that makes it antifragile rather than merely fast.
 ### Autonomy in the hours-lane — matching machine speed with machine speed
 The article that prompted this book is right about the core asymmetry: **attackers are executing at machine speed and defenders are still running remediation through human-paced processes designed for a world with weeks of lead time.** The hours-lane cannot be served by a pipeline with five human handoffs. So the critical quantum's execution — detect the new exposure, cross-reference the asset inventory, assess reachability and compensating controls, generate the work item with context, route it, and in the clear cases *apply the compensating control* — runs autonomously **within human-defined guardrails.**
 The repo's standing scepticism applies and sharpens the point rather than contradicting it: **AI on a broken foundation is expensive noise.** Autonomy without environmental context just generates tickets faster — "faster noise," the exact toil that makes developers dread security. The autonomy only works *because* the foundation is in place: the compensating-control map, the reachability model, the known-good baseline, the segmented architecture. Autonomy is the accelerator on the hours-lane; architecture is still the durable bet. The human role moves up a level — from doing the remediation to **governing the policy**: which classes of action the system may take, which severity thresholds trigger automated containment, which changes still require a human. That is a better use of scarce security talent and the only operating model that survives the volume. The concrete blueprint for this lane is in [AI-Assisted TVM](../playbooks/ai-assisted-tvm.md); this book is the principle, that playbook is the build.
 The guardrail is the whole game. Autonomous does not mean uncontrolled. The most defensible implementations keep the human at the policy boundary and delegate only execution — and they apply compensating controls (reversible, contained) far more readily than irreversible changes. Start the autonomy on the safest, highest-value action: cutting reachability on a confirmed-exploitable, internet-facing, kill-chain asset.
 ### The feedback loop — every exploited path becomes a shorter kill chain
 This is the climax, and it is the same machine as Book VI. A vulnerability that was exploited, or nearly exploited, is the cheapest penetration test you will ever get — honest, real-world data about exactly where a path to the crown jewels was open. Patching the CVE wastes that data. The antifragile move is to **sever the path**: the flat segment gets a boundary, the over-privileged service account gets collapsed, the reachable management interface gets pulled behind the bastion — so that the *next* vulnerability that lands on that path is a non-event before it is ever disclosed.
 Measure the loop, not just the lane. MTTR tells you how fast you patch; it does not tell you whether you are getting stronger. The antifragile metric is: **after each exploited-or-near vulnerability, did the kill chain get shorter?** If the last ten vulnerability incidents produced ten patches and zero severed paths, the loop is broken and you are merely fast. If they produced ten patches and six structurally shortened kill chains, the estate is getting harder to compromise every time it is tested — which is the only honest definition of antifragile.
 ---
 ## Honest uncertainty (verify the moving parts)
 Stable and Lindy (teach with confidence): CVSS is not a priority; kill-chain position is. Most criticals aren't reachable. A half-done remediation is a hidden full vulnerability. You cannot out-patch machine-speed exploitation; you can make most vulnerabilities not matter and contain the few that do. Every exploited path should shorten the kill chain. None of that churns — it is the architecture-beats-velocity thesis applied to vulnerabilities, and it will outlive every tool named here.
 What moves, and what you must verify:
 - **The headline statistics churn annually.** The "exploitation is #1, ~2× phishing" finding is the 2026 DBIR; the 4-hour and 43-day figures, the ~59,000-CVE projection, the autonomous-exploitation benchmarks — all of these are point-in-time and will move. The *direction* (exploitation rising, time-to-exploit collapsing, volume exploding) is the stable signal; the specific numbers need re-checking against the current year's DBIR, M-Trends, and FIRST/CVE data before you put them on a slide.
 - **The enrichment infrastructure is actively degrading.** NVD's backlog and the "Not Scheduled" reclassification mean the data you use to prioritise is itself unreliable and getting worse. Verify what enrichment you can actually trust *today*, and lean harder on your own reachability and exploitability signals precisely because the public ones are thinning.
 - **The autonomous-execution tooling is immature and fast-moving.** The Zero-Day-Agent-class pattern (autonomous detect → reachability assessment → compensating control) is real and operational but the products, their accuracy, and their guardrail models are evolving monthly. Verify current capability and, more importantly, current *failure modes* before you delegate any action — and start with reversible compensating controls, never irreversible change.
 - **The ~90%-not-exploitable figure is environment-specific.** It is a defensible industry estimate, not a law. The real number depends entirely on how well your compensating controls are actually mapped and enforced — and a mapped control that has rotted into a ghost is a false negative that will hurt you. Test the controls you are counting on, do not trust the map.
 - **Exploit-availability and threat-intelligence feeds** (CISA KEV, exploit databases, vendor advisories) are reliable in principle but vary in latency and coverage — verify which feeds are current and how fast they update before you wire them into the hours-lane.
 If a prioritisation decision hinges on a current specific, verify it and test it. "We confirmed this asset is internet-reachable and the EDR rule actually blocks the exploit" beats any CVSS score ever published.
 ---
 ## Consolidated judgement prompts
 - When a vulnerability on this estate is exploited, do we come back weaker, the same, or stronger? What's the mechanism that makes it stronger?
 - Are we sorting by CVSS, or by kill-chain position × reachability × exploit availability?
 - Of our "criticals," how many are actually reachable by an adversary right now? If we don't know, that is the first finding.
 - For our top exploitable findings: can we sever reachability faster than we can patch? If yes, why are we waiting for the patch?
 - Is anything in the "done" column a ghost patch — closed but never verified to enforce?
 - What is sitting in the fragile middle — the aging critical-patch backlog that is neither contained fast nor architected away?
 - How many human handoffs are in our hours-lane, and which of them require actual judgement versus just adding latency?
 - What's in the dark quantum — the unscanned, the unscannable, the unowned — and what are we doing to size it?
 - For the last ten vulnerability incidents: how many produced a severed path versus just a patch? Is the kill chain getting shorter?
 ---
 ## Where this book sits in the arc
 Books II–V harden the containers and contents; Book VI builds the loop that makes shocks pay. Book VII is what happens when the dominant shock stops being a phished human and becomes an exploited vulnerability arriving at machine speed. The answer is not a seventh thing bolted on — it is the same antifragile lens (subtract the false, protect the irreplaceable, contain the few that matter, feed every shock back into structure) applied to the surface the attacker now prefers. The vulnerability list was never the unit of work. The kill chain always was.
 Move fast and fix things.
 ---
 *Book VII of the Antifragile Handbook. Pairs with the [Quantum Vulnerability Management](../core/quantum-vulnerability-management.md) framework and the [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md); the build-level companion is the [AI-Assisted TVM Blueprint](../playbooks/ai-assisted-tvm.md).*
@@ -0,0 +1,101 @@
 # The Antifragile Handbook for M365 & Active Directory
 Most M365 estates are fragile. Not because nobody has run the benchmarks — they have, and the scorecards look fine. They're fragile because a compliance certificate and a hardened estate are different things, and the industry has spent years teaching people to chase the first while missing the second.
 This handbook is the attempt to close that gap. It is written for consultants who want to walk into a tenant they've never seen and find the thing that will actually kill the client — not the thing that fails the CIS audit. It is opinionated, sequenced, and deliberately uncomfortable. If you want a checklist, the CIS Benchmark is free. If you want to understand *why* the checklist exists, what breaks when the controls fail, and how to build an estate that gets stronger under attack rather than just surviving it, start here.
 The governing question in every book is the same:
 > **When — not if — this fails, does the estate come back weaker, the same, or stronger?**
 ---
 ## The books
 ### [Book I — Principles & Judgement](00-principles-and-judgement.md)
 *The craft before the controls.*
 Everything else in this series rests on the discrimination developed here: the ability to distinguish signal from noise, to know that disabling legacy auth outranks renaming forty GPOs, and to understand why compliance is a floor and a by-product rather than the target. This book also introduces the "move fast and fix things" operating principle — a deliberate inversion of the Silicon Valley creed, because the things are already broken and speed means refusing to let a thirty-page risk-acceptance process protect a fragility a teenager with a phishing kit will remove for free.
 Read this first, even if you're experienced. Especially if you're experienced.
 ---
 ### [Book II — Hybrid Identity](01-hybrid-identity.md)
 *Draw the wall between on-prem and cloud. In most estates there isn't one — there's a hallway with the door propped open.*
 In a hybrid estate, on-prem AD and Entra ID are not two systems with a guarded border. They're one organism wearing two badges, joined by a bridge that most organisations cannot draw, do not monitor, and have never tested severing. This book maps the bridge — the sync engine, the connector accounts, the authentication method, the writeback paths — and explains why a single compromise of the sync server gives an attacker DCSync on-prem *and* cloud object manipulation at the same time. Then it shows how to build the actual wall.
 If you only ever fix one domain, fix this one. Everything else assumes identity holds.
 ---
 ### [Book III — Privileged Access](02-privileged-access.md)
 *Privilege is blast radius with a time axis. Standing privilege reaches everything, forever. The whole job is to collapse both: less reach, less time.*
 The most dangerous accounts in any estate are the ones nobody is watching — the permanent Domain Admins that have always existed, the service accounts with Kerberoastable SPNs and passwords from 2016, the app registrations with `RoleManagement.ReadWrite.Directory` and admin consent that nobody remembers granting. This book names them, shows how they become privilege-escalation paths, and builds the case for Just-in-Time access, Entra PIM, and a rigorous service-principal audit as the core of any engagement.
 The single most important number in this book: how many identities hold standing privilege right now?
 ---
 ### [Book IV — Devices & Endpoint (Intune)](03-devices-and-intune.md)
 *The device will be compromised. Compliant is not the same as secure, and the portal toggle is not the same as the device's behaviour.*
 Endpoint programmes are usually built on a wish: make the device trusted. That wish is unwinnable. This book flips the question — assume every device is already compromised, and ask what still holds — and uses that reframe to expose the gap between a "compliant" device in the portal and a device that is actually behaving as expected. It covers the hidden fleet (managed, unmanaged, shadow, dark), the Conditional Access misconfiguration patterns that most estates share, and how to build posture that survives an untrusted device rather than depending on the device being clean.
 The spine of the book: compliance is a signal, not a checkbox.
 ---
 ### [Book V — Data & Collaboration](04-data-and-collaboration.md)
 *Data is liquid. The question is never "is it locked down" but "where can it flow, who can reshare it, and can you see and reverse the flow?"*
 Books II–IV protect the containers: identity, privilege, devices. This book is about the contents, and contents obey different physics. An "Anyone with the link" SharePoint share is a bearer token — no identity, no MFA, no device check, often no expiry, forwardable to anyone, reachable by the open web if it leaks. Guest sprawl hands your blast radius to external identities you don't govern. Email is the oldest exfil channel in the industry and almost never properly monitored. This book maps the exposure patterns across Exchange, SharePoint, Teams, and OneDrive, and builds the controls that let you see — and reverse — the data flow.
 For most estates the honest answer to "can you see where it went?" is no. That's the starting point.
 ---
 ### [Book VI — Recovery & Detection-as-Feedback](05-recovery-and-detection.md)
 *Robust means you survive the shock unchanged. Antifragile means you come back stronger. The shock is coming either way — the only choice is what you do with it.*
 The capstone, because it decides whether everything before it was merely robust or genuinely antifragile. Detection and recovery are not the sad afterthought — they're the feedback loop that changes the structure of the estate after every shock. An org that buries incidents stays fragile. An org that treats them as fuel becomes antifragile. This book covers the recovery lies the industry tells itself (untested backups, undocumented break-glass, AD forest recovery nobody has practised), builds the detection architecture, and — most importantly — describes the machine that turns incidents, alerts, and near-misses into structural improvement.
 Read this once you've built something worth protecting — it closes the original defensive arc (Books I–VI).
 ---
 ### [Book VII — Vulnerability Management](06-vulnerability-management.md)
 *The patch cycle was built for a world where you had weeks. That world is gone. Stop racing the attacker to the patch — change the race.*
 The first six books assume the dominant way into an estate is a phished human. As of the 2026 Verizon DBIR that assumption is wrong: **exploitation of vulnerabilities is now the leading initial-access vector, roughly twice phishing.** This book changes the lens to match. It refuses the two losing moves — sorting 40,000 findings by CVSS, and trying to "patch faster" against a 4-hour exploitation window — and replaces them with the antifragile alternative: subtract the ~90% of criticals that aren't actually reachable, size the rest into **quanta** by time-to-existential-impact (hours / days / sprint, plus the dangerous *dark* quantum you can't yet size), contain the few that matter with compensating controls rather than waiting for a patch, and feed every exploited path back into a shorter kill chain.
 It pairs with the [Quantum Vulnerability Management](../core/quantum-vulnerability-management.md) framework and the [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md). Read it when the threat landscape — not the maturity model — forces the question.
 ---
 ## Field Guide (2026 Edition)
 The books are principles; they are deliberately stable. Two field guides apply them in practice:
 **[Field Guide — 2026 Edition](field-guide-2026.md):** Concrete actions and current tooling for foundational engagements. The "do this" companion to the handbook. Review January 2027.
 **[Field Guide — Adversarial Validation](field-guide-adversarial-validation.md):** For clients who have done the foundational work. Tests declared controls against observed behaviour, domain by domain. Closes with a client leave-behind cadence so the admin can self-monitor between engagements. Review January 2027.
 For inspection checklists, see the [assessment templates](../assessment-templates/): the [Engagement Checklist](../assessment-templates/engagement-checklist.md) (foundational), the [Adversarial Validation Checklist](../assessment-templates/adversarial-validation-checklist.md) (phase 2), and the [Self-Service Cadence](../assessment-templates/self-service-cadence.md) (client leave-behind).
 ---
 ## How to use this series
 The books are sequenced deliberately — each one assumes the previous — but an experienced practitioner can use them as field references. The fragility inventories at the start of each book are designed to be usable on day one of an engagement, before you've had time to read everything. The "governing question" at the start of each section is designed to be asked out loud, to a client, in a room where someone will have to answer it.
 The goal throughout is not compliance. Compliance is a by-product. The goal is an estate that gets harder to compromise every time it's tested — and is tested often enough to know.
@@ -0,0 +1,568 @@
 # M365 + AD Field Guide — 2026 Edition
 > *The books are principles. This is practice — concrete actions, current tooling, and 2026-specific decisions. It will need updating next year. That is the point.*
 **Last updated:** June 2026
 **Companion to:** The Antifragile Handbook for M365 & AD (Books I–VI)
 **Next review:** January 2027
 ---
 ## What this is
 The Antifragile Handbook teaches judgement. This document teaches actions — what to do, in 2026, with the tooling that exists now, in the estates you will actually walk into. Where the handbook says "eliminate AD FS," this document says how and what blockers to expect. Where the handbook says "test the CA policy," this document says what a ghost policy looks like when you find one.
 Read the books first. Use this document on-site.
 ---
 ## Notation
 **P0** — attacker already through; fix before leaving this session
 **P1** — closes in this engagement
 **P2** — roadmap item, documented
 **2026 note** — something that has changed or become clearer since the handbook was written
 ---
 ## 1. Hybrid Identity
 ### Remove AD FS — this is now a P0 conversation
 In 2026, Microsoft's migration tooling has matured to the point where AD FS is a choice, not an inevitability. Every client still running it should have a migration plan or a written, named reason for not having one.
 **Why it is a P0:** Golden SAML is still an active nation-state technique. The token-signing private key in most tenants has never been rotated, is stored on the AD FS servers, and is not monitored. One foothold on any on-prem system that can reach the AD FS servers ends cloud identity entirely — silently, with validly-signed tokens, no failed logins, nothing for a SIEM to catch.
 **What to do:**
 - In the Entra portal, go to Identity > Applications > AD FS activity (if it appears). This gives you the relying party trust inventory and migration readiness per application. This is your conversation starter.
 - Enumerate relying party trusts: `Get-AdfsRelyingPartyTrust | Select-Object Name, Enabled, Identifier`. Each enabled one is a blocker that needs a cloud equivalent or decommission plan.
 - Check the token-signing cert: `Get-AdfsCertificate -CertificateType Token-Signing`. Note the NotAfter date and when it was last rotated. "Has not been rotated since installation" is the expected answer and is itself a finding.
 - Staged rollout in Entra lets you migrate users incrementally — you do not have to cut over all at once. Use it.
 **Migration target:** Password Hash Sync (PHS) + Entra-managed MFA via Conditional Access. This removes the on-prem dependency for cloud authentication and kills Golden SAML as a class.
 **2026 note:** The AD FS migration activity report and staged rollout tooling make this significantly more tractable than it was in 2023–2024. Remove the roadmap language and have the P0 conversation.
 ---
 ### Connect Sync vs Cloud Sync — new deployments
 **2026 recommendation:** For new hybrid sync deployments and organizations without complex topologies (no device writeback, no large object filtering requirements, no multi-forest writeback scenarios), **Entra Cloud Sync** is the preferred deployment. Smaller attack surface than Connect Sync (no SQL Express, no full-blown sync engine, multiple lightweight agents for HA), easier to harden, no single machine that holds DCSync-capable credentials.
 **Connect Sync stays correct for:** Large/complex topologies, specific writeback scenarios (check the current parity matrix at Microsoft Learn before promising Cloud Sync covers a client's requirements — this changes).
 **For existing Connect Sync deployments:** The migration path to Cloud Sync exists. Check current documentation for topology compatibility. Do not promise the migration before confirming the client's scenario is supported.
 **In either case, the sync server is Tier 0.** See the hardening actions below.
 ---
 ### Sync server hardening — concrete actions
 The sync server (Connect or Cloud Sync agent host) is typically treated as a utility VM. It holds an identity capable of DCSync. Treat it accordingly.
 **Immediate checks:**
 - Is the server domain-joined to the production domain? If yes, its blast radius is one hop from any Tier 1 or Tier 2 compromise. Ideal: join it to a dedicated Tier 0 or management forest, or isolate it behind jump-box access only.
 - What account runs the connector service, and what permissions does it have? For Connect Sync, the on-prem connector account needs `Replicate Directory Changes` and `Replicate Directory Changes All`. Confirm it is a dedicated service account (ideally gMSA), not a human admin account that doubled up.
 - Has the server ever been patched? Check `Get-HotFix | Sort-Object InstalledOn -Descending | Select-Object -First 5`. If nothing in the last 60 days, that is a finding.
 - Is the Entra connector account (Directory Synchronization Accounts role) monitored? Any sign-in from any host other than the sync server should alert immediately.
 - Are local administrators on the sync server documented and minimal?
 ---
 ### Cloud-only Global Admins — enforce it on day one
 **P0 if not in place.** Synced accounts holding Global Admin are the most common single finding across all engagements and the most direct path from a ransomwared on-prem AD to cloud dominance.
 **Find the synced GAs:**
 ```powershell
 # Connect-MgGraph -Scopes "Directory.Read.All"
 $gaRoleId = (Get-MgDirectoryRole -Filter "displayName eq 'Global Administrator'").Id
 Get-MgDirectoryRoleMember -DirectoryRoleId $gaRoleId |
  Where-Object { $_.AdditionalProperties['userPrincipalName'] -notlike "*.onmicrosoft.com" }
 ```
 Every result is a synced account. Every synced account in GA is a P0.
 **Remediation path:**
 1. Create a new cloud-only account (`user@tenant.onmicrosoft.com` format), assign GA, configure phishing-resistant MFA.
 2. Validate the new account works — sign in, confirm PIM activation if PIM is in place.
 3. Remove GA from the synced account.
 4. Add a Conditional Access policy blocking synced account UPNs from holding privileged roles (belt-and-suspenders; requires knowing the UPN pattern).
 ---
 ### Seamless SSO key — rotate it
 `AZUREADSSOACC` was created when Seamless SSO was enabled and is almost certainly unrotated. The Kerberos key on this account is a silver-ticket / cloud token-forging exposure if the on-prem is compromised.
 **Check last password set:**
 ```powershell
 Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet | Select-Object PasswordLastSet
 ```
 If this matches the approximate go-live date of the Microsoft 365 tenant, it has never been rotated.
 **Rotate it:** Use the `Update-AzureADSSOForest` PowerShell command (in the MSOnline / Entra Connect tooling). Run it twice per domain — same discipline as KRBTGT rotation. If Seamless SSO is not needed (Entra join and modern auth only), remove `AZUREADSSOACC` entirely.
 ---
 ### Writebacks — name and own each one
 Enumerate which writebacks are enabled (password writeback, group writeback, device writeback) in Connect Sync or Cloud Sync configuration. For each:
 - Who owns the decision to have it enabled?
 - What does an attacker reach if the cloud side is compromised — can they write into on-prem AD?
 - Is the reverse blast radius documented?
 Password writeback is usually justified (SSPR usability). Group writeback creates a two-way channel between cloud security groups and on-prem AD — the blast radius should be explicit. If there is no current owner or justification for a writeback, disable it.
 ---
 ## 2. Privileged Access
 ### PIM: table stakes in 2026
 If the client has Entra ID P2 (included in Microsoft 365 E5, Business Premium, and available as an add-on) and is not using PIM for Entra administrative roles, that is a P0. There is no acceptable reason in 2026 for standing Global Admin, Privileged Role Administrator, Security Administrator, or Exchange Administrator assignments when PIM provides JIT elevation.
 **What to confirm during engagement:**
 - Global Admin: eligible only, not active. Any active (permanent) GA assignment that is not a break-glass account is a finding.
 - Privileged Role Administrator: requires approval workflow on activation, not just MFA. This role controls who becomes admin — it should require a second human to approve.
 - Security Administrator and Exchange Administrator: eligible, MFA on activation, justified time box (8 hours maximum for a working day).
 - PIM activation requires phishing-resistant MFA. If it accepts push-approve, it is phishable.
 **2026 note:** PIM now supports custom role definitions. If a client is assigning built-in broad roles (like Global Admin) to do a narrow task, check whether a custom role or a more scoped built-in (e.g., Intune Administrator instead of Global Admin) applies.
 ---
 ### Service principals: the 2026 audit
 Service principals hold more standing privilege in most tenants than all human admins combined. They cannot do MFA. They are almost never reviewed. This is the dark matter of privileged access.
 **Escalation-grade Graph permissions — find every app holding these in 2026:**
 - `RoleManagement.ReadWrite.Directory` — can grant any Entra role
 - `AppRoleAssignment.ReadWrite.All` — can assign any app role, including to itself
 - `Application.ReadWrite.All` — can modify any application and create new ones
 - `Directory.ReadWrite.All` — broad directory write
 - Any API permission scoped `Full` or ending in `.ReadWrite.All` for sensitive services
 ```powershell
 # Find service principals with dangerous Graph permissions (application permissions)
 Get-MgServicePrincipal -All | ForEach-Object {
  $sp = $_
  Get-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $sp.Id |
    Where-Object { $_.PrincipalId -eq $sp.Id }
 } # — pipe to filter on the dangerous role IDs listed above
 ```
 For every hit: who created this app registration, when, is the permission still needed, is there an expiring secret or certificate, and can it be replaced with a managed identity?
 **Secrets never expire — find them:** In the Entra portal > App registrations > All applications > sort by "Certificate & secrets expiration." Filter for never-expiring secrets. Every one is a standing credential with no forced rotation.
 ---
 ### On-prem service accounts: gMSA yes, dMSA wait
 **gMSA (Group Managed Service Accounts):** The right answer for on-prem service accounts in 2026. Automatic password rotation (no static secret), not Kerberoastable in the traditional sense, natively supported across Windows Server 2012+. If a client has regular service accounts with static passwords (especially if those passwords are 2+ years old), migrate to gMSA.
 **Kerberoasting check (run this, not just ask about it):**
 ```powershell
 # Find accounts with SPNs and static passwords
 Get-ADUser -Filter {ServicePrincipalName -ne "$null"} -Properties ServicePrincipalName, PasswordLastSet, Enabled |
  Where-Object {$_.Enabled -eq $true} |
  Select-Object Name, PasswordLastSet, ServicePrincipalName
 ```
 Any result with a `PasswordLastSet` older than 1 year is Kerberoastable and a P0.
 **dMSA (Delegated Managed Service Accounts):** Introduced with Windows Server 2025-era tooling, targeting the migration path from standing service accounts. Do not recommend dMSA in 2026 — there is published privilege-escalation research against the migration path. Use gMSA until the specific vulnerabilities are patched and the client's environment is confirmed current. Check current Microsoft advisories at engagement time.
 ---
 ### LAPS: Windows LAPS deployment in 2026
 **Legacy Microsoft LAPS** (the separately-downloaded agent) should be migrated to **Windows LAPS**, the built-in solution available in Windows 10 22H2 / Windows 11 22H2 and Windows Server 2019+ with April 2023 updates or later.
 Windows LAPS can store passwords in AD, in Entra ID (for Entra-joined devices), or both. For hybrid estates, store in both. Manage via Intune (cloud-joined) or GPO (domain-joined).
 **Coverage check:**
 ```powershell
 # Computers without LAPS password set (null = not managed)
 Get-ADComputer -Filter * -Properties 'ms-Mcs-AdmPwd', 'msLAPS-Password' |
  Where-Object { $_.'ms-Mcs-AdmPwd' -eq $null -and $_.'msLAPS-Password' -eq $null } |
  Select-Object Name
 ```
 Every result is a computer with a shared or unknown local admin password — lateral movement risk.
 ---
 ### KRBTGT rotation
 Check password age. 365+ days without rotation is a P1. No documented rotation since domain creation (common when the domain is 5–10 years old) is a P0 for any high-sensitivity engagement.
 ```powershell
 Get-ADUser krbtgt -Properties PasswordLastSet | Select-Object PasswordLastSet
 ```
 Rotation procedure: rotate once, wait at least the max ticket lifetime (default 10 hours), rotate again. Document both rotation timestamps. After rotation, monitor for authentication failures caused by cached golden tickets — if detections fire, that was a real golden ticket, not a drill finding.
 ---
 ### ADCS: treat it as Tier 0
 If the client has Active Directory Certificate Services deployed (almost all do if they have a domain older than 7 years), run a basic ESC vulnerability check. The ESC1–ESC8 misconfigurations are well-documented, freely exploitable, and almost never remediated because most organizations do not know they have ADCS issues.
 **Quick check:**
 - Is ADCS installed? `Get-WindowsFeature ADCS-Cert-Authority` on any server
 - Is any template published with "Supply subject in request" + broad enrollment rights? That is ESC1.
 - Certipy (open source) or Certify: run in read-only enumeration mode (`certipy find`) to identify vulnerable templates
 ADCS is Tier 0. It sits on whatever server it runs on, and that server should have the same access controls as a domain controller. Verify it is not on a Tier 1 or Tier 2 server.
 ---
 ### Admin workstations — the cloud VM is the deployable PAW
 Physical PAWs are right in principle and almost never get deployed. Hardware procurement, second device, behaviour change — the project does not survive contact with a real IT budget. Do not open the conversation with "you need a dedicated PAW laptop." Open it with the cloud admin VM.
 **The cloud admin VM:** a Windows 365 or Azure Virtual Desktop instance provisioned from a hardened template. The admin connects from their normal device via browser or RDP. Privileged credentials — including WireGuard keys for the management overlay — live in the cloud VM, not on the admin's local device. Compromise response: wipe it, reprovision from template in under 20 minutes.
 **Provisioning the cloud admin VM:**
 1. Create a Windows 365 or AVD instance from a hardened base image (CIS L2 baseline or equivalent)
 2. Enrol in Intune, apply a configuration profile: no internet browsing, no personal email, no Microsoft Store apps, screen lock on idle, BitLocker enforced
 3. Scope a CA policy restricting Global Admin and privileged role activation to this device (device compliance + named Intune group)
 4. Install the Nebula client (if deploying T0 overlay) and distribute the pre-signed node certificate
 5. Install the Tailscale client (if deploying T1 overlay) and enrol with the Entra OIDC identity
 **Minimum viable without the overlay:** a dedicated Intune-enrolled, Entra-joined cloud VM with no email and no general browsing, and a CA policy restricting GA activation to it. Not perfect, but it will actually get deployed and maintained.
 ---
 ### Management overlay — Nebula for T0, Tailscale for T1
 **When a client needs this:** SME and mid-market clients with multi-cloud resources, DevOps workloads, or remote admins — and no physical data centre with a proper management VLAN. The overlay builds the management plane that the physical network cannot provide.
 **When a client does not need this:** organisations with their own data centres and physical network infrastructure already in place. Traditional management VLAN segmentation plus jump boxes is the right answer there. Adding an overlay creates a new Tier 0 component without proportional benefit.
 **The T0 overlay — Nebula:**
 Nebula has no coordinator in the runtime path. Once certificates are distributed, the overlay runs with zero external dependencies. This is the right property for T0: a compromised or unavailable external service cannot affect access to your domain controllers.
 Deployment steps:
 1. Provision the Nebula CA on a dedicated air-gapped machine (a dedicated laptop that is never networked, or a cheap PC kept in a drawer)
 2. Generate and sign node certificates for each T0 node (DCs, sync server, ADCS, cloud admin VMs/PAWs)
 3. Distribute the signed certificates and the CA certificate to each node
 4. Configure the Nebula ACL policy: cloud admin VMs can reach DCs on port 3389 (RDP) and 5985/5986 (WinRM); nothing else. DCs do not reach each other through Nebula (they have their own replication channel)
 5. Start the Nebula service on each node. Test connectivity from the cloud admin VM to a DC
 6. Document the CA signing ceremony: who can sign new certs, what approval is needed, where the CA key is stored, how to revoke (distribute updated blocklist to all nodes)
 **Realistic T0 node count:** 15–25 nodes for a 5,000-person organisation. Certificate management is a documented ceremony run a few times a year, not an ongoing operational burden.
 **The T1 overlay — Tailscale:**
 Tailscale with Entra OIDC + key expiry gives you device trust (WireGuard node key) plus per-session identity assertion (Entra MFA on re-authentication). Configure key expiry to force re-authentication on a schedule aligned with the session risk tolerance (8–24 hours for admin access).
 Deployment steps:
 1. Create a Tailscale account or deploy Headscale (for sovereign requirements)
 2. Configure the OIDC integration with Entra ID. Set the MFA requirement to phishing-resistant (FIDO2) in the Entra Conditional Access policy that governs Tailscale authentication
 3. Set key expiry: 8–24 hours for admin nodes, 24–72 hours for standard nodes
 4. Define ACL policy: cloud admin VMs reach T1 servers on management ports only; standard user devices do not appear in the T1 ACL
 5. Enrol cloud admin VMs as nodes. Enrol T1 servers (member servers, cloud management hosts, K8s API server endpoints)
 6. Test: attempt to reach a T1 server from a non-enrolled device. Expected: no route. From an enrolled cloud admin VM: connected
 **What Tailscale carries for multi-cloud:** kubectl access to K8s clusters, SSH/RDP to member servers and cloud VMs, cloud CLI access where the management API is behind a private endpoint. It does not carry M365 admin traffic — that goes direct to Microsoft over the internet, gated by Conditional Access.
 **The Nebula CA — the one critical operation:**
 The CA key is the trust anchor for the entire T0 overlay. Its compromise means an attacker can enrol their own node and grant it access to every DC. Treat it accordingly:
 - Air-gapped machine, never networked after initial setup
 - CA key encrypted at rest on the machine and backed up separately
 - Certificate lifetime: 180 days maximum, so non-renewal handles most revocation cases
 - Revocation: generate and distribute an updated `blocklist.pem` to all nodes if a PAW is lost or an admin departs before cert expiry
 - At least two named people who know the ceremony and can perform it
 ---
 ## 3. Devices & Endpoint
 ### Reconcile the real fleet — do this on day one
 Do not trust Intune's enrolled device count or any CMDB. Pull from four sources and compare them:
 1. Intune managed devices (Intune portal)
 2. Entra registered/joined devices (Entra portal > Devices)
 3. Entra sign-in logs, device detail (what is actually authenticating)
 4. Network device discovery if in scope
 The gap between sources 1+2 and source 3 is your shadow/dark device population. Source 3 will almost always be larger. Every device authenticating that is not in sources 1+2 is an unmanaged device reaching data.
 **Concrete — pull sign-in logs by device compliance state:** In the Entra portal: Sign-in logs > Add filter > "Managed device" = No or "Compliant" = No > export. Count the distinct device IDs. That count, compared against your Intune enrolled count, is the gap metric.
 ---
 ### Cloud-native migration: Entra join + Intune as default
 For any new device deployment or device refresh in 2026, **Entra join + Intune management** is the default. Hybrid Entra join (AD-joined + cloud-registered) is technical debt to retire, not a target state.
 **Migration readiness check:** What on-prem resources does the client's fleet actually need? Line-of-business applications, file shares, printers? Each dependency is a reason to stay hybrid; each that can be moved or resolved with another mechanism is a reason to go cloud-native. Build the dependency map first.
 **GPO to Settings Catalog:** Most GPO settings now have equivalents in the Intune Settings Catalog. The IntunePolicyParser tool can parse existing GPOs and identify Settings Catalog equivalents. Run this early in an endpoint engagement to scope the migration effort.
 ---
 ### Conditional Access — test every policy before signing off
 This is not a recommendation. It is a requirement.
 **Protocol:**
 1. Before changing or reviewing any CA policy, write down the expected behavior for the users and conditions in scope: *"User X, device Y, location Z → MUST be [blocked/granted/MFA-prompted]."*
 2. Use What If as a logic check only — it evaluates configuration, not enforcement.
 3. Drive real sign-ins for every important user/condition combination. Observe the actual result.
 4. If the observed result contradicts the displayed configuration, recreate the policy from scratch. Do not edit the existing object — a ghost policy carries corruption forward through edits.
 5. Re-test after any tenant-level change: adding a domain, changing federation, new app registration. You do not need to have touched the CA policy for it to ghost.
 **Report-only mode:** Use report-only to pre-validate before enabling. But test in enabled mode before signing off. Report-only cannot find a ghost policy — only a live enforcement failure can.
 ---
 ### EPM: eliminate standing local admin
 In 2026, **Endpoint Privilege Management (EPM)** in Intune is the right answer for "some users need admin rights for specific software." EPM provides JIT, audited, approved elevation without giving the user permanent local admin.
 **Licensing:** Requires Intune Plan 2 or the Intune Suite (not included in standard Business Premium or E3 — verify licensing before scoping).
 **Deployment:**
 1. Audit current local admin membership across the fleet (GPO reporting or Intune device reports)
 2. Identify the specific applications or tasks requiring elevation
 3. Create EPM rules for those specific executables
 4. Remove standing local admin from standard user accounts
 5. Monitor EPM elevation events for anomalies
 If EPM licensing is not available, Windows LAPS for local admin credentials (randomized, no shared password) plus a JIT process for elevation requests is the intermediate posture.
 ---
 ### Update rings: the lesson from 2024
 Configure update rings in Intune for all managed endpoints. Every client needs:
 - **Pilot ring** (5–10% of devices, IT staff / early adopters): 0 days deferral
 - **Broad ring** (remainder): 7-day deferral after pilot passes
 - A named person with the authority to **halt a broad ring push** — confirmed they know how and have tested it
 **Windows Autopatch** (included in Business Premium, E3 with Intune add-on, E5) automates ring management and defers intelligently. If the client is licensed for it and not using it, that is a quick win.
 The 2024 CrowdStrike event applies not just to AV/EDR updates — it applies to any software distributed at scale. Update ring discipline is now an endpoint governance requirement, not a preference.
 ---
 ### MAM boundaries: test them on a real device
 If the client uses App Protection Policies for BYOD (MAM-WE), the policy screen does not prove enforcement. Test on real devices, on current OS builds, per platform:
 **Test protocol (run separately on iOS and Android):**
 - Attempt to copy text from a managed app (Outlook, Teams) and paste into an unmanaged app
 - Attempt to "Open in" from a managed attachment to an unmanaged app
 - Attempt to save a file locally or to the camera roll
 - Attempt to screenshot (if blocked by policy)
 - Test from an unmanaged browser accessing SharePoint or OWA
 Document where "Block" does not block. When you find a gap that survives reinstall on multiple devices, that is a vendor escalation, not a configuration fix.
 ---
 ## 4. Data & Collaboration
 ### Anonymous sharing: disable at the tenant level on day one
 "Anyone with the link" sharing is a bearer token for your data — no identity required, forwardable, often with no expiry, reachable by anyone who ever held the link. This is the single largest data exposure fragility in M365.
 **Immediate action:** SharePoint Admin Center > Policies > Sharing > External sharing: set to "New and existing guests" (requires authentication) or "Only people in your organization." If the client has a business case for anonymous links, scope specific sites where it is permitted and disable at the tenant level for everything else.
 **Enumerate existing anonymous links:**
 ```powershell
 # PnP PowerShell
 Get-PnPTenantSite -IncludeOneDriveSites | ForEach-Object {
  Get-PnPSiteCollectionSharingLinks -Site $_.Url
 } | Where-Object { $_.Link -like "*guestaccess*" }
 ```
 The list you get is almost always longer than anyone expected. The exercise of producing it is itself a finding.
 ---
 ### External auto-forwarding: block it and check for active rules
 **Block at the global level:** Exchange Admin Center > Mail flow > Remote domains > Default domain > Automatic forwarding: Disabled.
 **Check for existing rules (do this before blocking in case active BEC is in progress):**
 ```powershell
 Get-TransportRule | Where-Object {$_.BlindCopyTo -ne $null -or $_.RedirectMessageTo -ne $null} |
  Select-Object Name, BlindCopyTo, RedirectMessageTo, Enabled
 ```
 Any rule forwarding to an external address with no documented business owner is a potential BEC persistence mechanism. Treat as P0 until confirmed otherwise.
 Also check Outlook/OWA rules at the mailbox level for executive accounts:
 ```powershell
 Get-Mailbox -ResultSize Unlimited | Get-InboxRule |
  Where-Object {$_.ForwardTo -ne $null -or $_.RedirectTo -ne $null} |
  Select-Object MailboxOWAUrl, Name, ForwardTo, RedirectTo
 ```
 ---
 ### Crown jewels: name them before scoping DLP or labels
 The first question in every data engagement: *"Which five data sets, if exfiltrated, would end or materially damage this business?"*
 If the client cannot name them, that is finding #1 and the prerequisite for everything else. DLP and sensitivity labels applied before the crown jewels are identified are DLP and sensitivity labels that protect the wrong things.
 Common crown jewels in 2026: M&A communications, board and executive email, source code repositories, customer PII data subject to GDPR/NIS2, financial forecasts and models, intellectual property, credentials and secrets stored in SharePoint/Teams.
 Once named: where do they live? Who has access? Are they labeled? Is access audited?
 ---
 ### Sensitivity labels and auto-labeling
 **2026 recommendation:** If the client is on E5 Compliance or equivalent, deploy auto-labeling policies for the crown jewel data types. Manual labeling depends on user behavior; auto-labeling does not.
 **Licensing check first:** Sensitivity labels: all M365 plans. Auto-labeling, advanced DLP, and Purview data governance: M365 E5 Compliance or the Microsoft Purview compliance add-on. Verify before scoping.
 **Implementation sequence:**
 1. Define the crown jewels (see above)
 2. Create sensitivity labels in order from most to least restrictive (Highly Confidential, Confidential, Internal, Public)
 3. Apply encryption to Highly Confidential and Confidential labels — encryption travels with the file, including after exfiltration
 4. Configure auto-labeling for known high-value content types (credit card numbers, national IDs, custom regex for the client's IP)
 5. Monitor label application events before enforcing auto-labeling in production
 ---
 ### Guest access: treat as standing blast radius
 Run a guest access review on every engagement. Most tenants cannot produce the list of current guests without effort. The exercise of trying to produce it is the finding.
 **Enumerate guests:**
 ```powershell
 Get-MgUser -Filter "userType eq 'Guest'" -All |
  Select-Object DisplayName, Mail, CreatedDateTime, SignInActivity
 ```
 Sort by `LastSignInDateTime`. Guests who have not signed in for 90+ days have no legitimate active need. The default should be expiration, not permanence.
 **Configure guest access reviews** in Entra Identity Governance > Access reviews. Set recurring reviews for all guests at 90-day intervals. When a reviewer does not respond, the default action should be removal, not retention.
 ---
 ### Audit log: verify it is on and retained
 Do not assume audit logging is enabled. Go to Microsoft Purview > Audit > Start recording user and admin activity (if the banner appears, it is not on). Then run a test search to confirm log entries are being captured.
 **Retention check — critical:**
 - E3 licensing: 90-day default retention
 - E5 / Purview Audit Premium: 1 year (extendable to 10 years with add-on)
 - Unified audit log must be explicitly enabled; it has historically not been on by default in older tenants
 For incident response purposes: if a breach is discovered 60 days in, and the client has 90-day retention, the evidence window is 30 days. For most meaningful incidents, 90 days is insufficient. Scope the retention discussion explicitly.
 ---
 ## 5. Recovery & Detection
 ### M365 backup: the mandatory conversation
 Native Microsoft 365 provides recycle bins and version history. It does not provide point-in-time backup against ransomware, malicious admin deletion, or retention policy expiry.
 **The question to ask the client:** "If someone with Global Admin access right now deleted every Exchange Online mailbox and every SharePoint site, what is your recovery path, and how long does it take?"
 If the answer involves the Microsoft recycle bin and "we would call Microsoft support," that is not a recovery plan. The recycle bin window is 14–93 days depending on the workload and configuration, and it does not protect against retention policy deletion or hard-delete operations by a malicious admin.
 **2026 recommendation:** A third-party M365 backup solution covering Exchange Online, SharePoint Online, OneDrive for Business, and Teams is a baseline requirement for any client treating M365 as business-critical. The market is mature. Veeam, AvePoint, Acronis, and Dropsuite are the common options. Assess per client need.
 ---
 ### Configuration-as-code: export the control plane
 Export CA policies, Intune baseline configurations, and Entra role assignments to code or structured files at the start of every engagement. This serves three purposes:
 1. Known-good baseline to detect drift and ghost configuration against
 2. Rebuild artifact for a compromised or corrupted tenant
 3. Change management — you can diff the configuration before and after every change
 **CA policies:** Use CAExporter (`vibecoding/CAExporter`) to export all CA policies to JSON. Store in client's repository. Run the export again at the close of the engagement and diff against the opening export — changes are documented, not assumed.
 **Intune:** The Graph API can export most Intune configuration; IntunePolicyParser assists with policy comprehension. Store the export.
 **Entra roles:** Capture the current role assignment list (who holds what role, eligibility vs activation) as a document. This is your before-state for any privileged access engagement.
 ---
 ### Detection: eight signals that matter more than eight hundred that don't
 Configure these eight before anything else. Each one represents a category of attack where silence is catastrophic:
 | Signal | Where to configure | Why it cannot be noise |
 |--------|-------------------|----------------------|
 | Break-glass account sign-in (any use at all) | Entra audit logs → alert rule or Sentinel | An account that should never sign in has signed in |
 | New Global Admin assigned | Entra audit logs, `Add member to role` for GA role | Shadow admin creation |
 | DCSync from non-DC host | Microsoft Defender for Identity or Sentinel | On-prem AD credential harvest in progress |
 | Impossible-travel sign-in for admin accounts | Entra ID Protection > User risk alerts | Account takeover in flight |
 | External auto-forward rule created | Exchange audit logs | BEC persistence being established |
 | Mass download from SharePoint/OneDrive | Defender for Cloud Apps or Purview | Exfiltration in progress |
 | New OAuth consent grant to high-privilege scope | Entra audit logs, `Consent to application` | Illicit app consent attack |
 | Privileged role activation outside business hours | PIM alerts | Credential use at suspicious time |
 Each of these should route to a named human who will respond within a defined SLA. Detection that fires into an unmonitored queue is theatre with a subscription cost.
 ---
 ### AD forest recovery: have the conversation
 Ask the client: "Has anyone on your team ever run an AD forest recovery — not in a training lab, on a real forest?" The answer is almost universally no.
 This is not a project you complete in an engagement — it is a finding and a recommendation. The finding: if AD is destroyed or corrupted (ransomware taking the DCs), recovery is a multi-day, expert-dependent process that nobody on this team has ever performed. The recommendation: run a tabletop of the procedure, identify the gaps in the runbook, and ensure the runbook is stored somewhere that survives the estate being dark (not in SharePoint, not in an AD-authenticated file share).
 The minimum viable runbook should cover: authoritative DC restore sequence, metadata cleanup, double KRBTGT reset, trust rebuilds, and how the Entra side reconnects when on-prem is back.
 ---
 ### Break-glass: test it, don't just create it
 Break-glass accounts exist in most tenants. They are tested in almost none. On every engagement:
 1. Does the break-glass account exist? (Cloud-only, `.onmicrosoft.com`, not synced)
 2. Is it phishing-resistant? (FIDO2 key or certificate — not push-approve)
 3. Is it excluded from the CA policy that would otherwise block it?
 4. Does its use trigger an immediate alert? (If yes, verify the alert fires during the test — not just that the alert rule exists)
 5. Where are the credentials? (Not in the client's normal password manager that requires the same identity to access)
 6. When was it last signed in to? (Credential should be proven functional — test it)
 The test is non-negotiable. An untested break-glass account is a belief, not a recovery path.
 ---
 ## What changed: 2025 → 2026
 | Area | Prior state | 2026 position |
 |------|------------|---------------|
 | AD FS | Roadmap item for most clients | P0 conversation — tooling mature, no excuse |
 | Entra Cloud Sync | "For simple topologies" | Recommended default for new deployments |
 | dMSA | Newly released, cautiously recommended | Hold — published escalation research; use gMSA |
 | EPM | Available, optional | Table stakes for zero-standing-admin on endpoints |
 | Windows Autopatch | Optional | Default recommendation for update ring discipline |
 | CA ghost policy | Edge case, occasionally found | Documented pattern — test every policy as standard |
 | M365 native backup | "Microsoft covers it" (wrong but common) | Third-party backup framed as baseline, not option |
 | PIM activation MFA | Often push-approve | Must be phishing-resistant to count |
 | Windows LAPS | New, replacing legacy LAPS | Deployed as standard; legacy LAPS is tech debt |
 ---
 ## The governing question — carry it into every session
 Before every finding, every recommendation, every conversation:
 > **If this is owned tonight, what is the largest thing an attacker reaches before hitting a wall — and can I draw that wall?**
 If the wall is missing or undrawn, you have found the work. Everything else is sequencing.
 ---
 *Field Guide for the Antifragile Handbook. Updated June 2026. Review and update January 2027 — the honest uncertainty sections of the books define what will change.*
@@ -0,0 +1,509 @@
 # Field Guide — Adversarial Validation
 > *"It's a nice compliance dashboard you have here."*
 **Last updated:** June 2026
 **Companion to:** [Field Guide — 2026 Edition](field-guide-2026.md) · Books I–VI
 **Engagement type:** Phase 2 — for clients who have done the foundational work
 **Checklist:** [Adversarial Validation Checklist](../assessment-templates/adversarial-validation-checklist.md)
 **Next review:** January 2027
 ---
 ## The premise
 The client has MFA. They have Conditional Access. They have Intune. They have a SIEM. Their CIS score is in the seventies or eighties. Their audit passed. The dashboard is green.
 This is the most dangerous estate to walk into — not because it is badly configured, but because everyone in the room believes it works. That belief is the fragility. Book I calls it directly: *"Green dashboards, untested reality — the most dangerous estate of all, because it feels safe."*
 The foundational field guide tells you how to build controls. This engagement is about finding out which of the client's existing controls are real and which are representations — configurations that *display* correctly but *enforce* nothing, backups that exist but have never been restored, detection that fires into a queue nobody reads, attack paths to Domain Admin that nobody has mapped because the BloodHound licence expired.
 **What you are doing in this engagement:** Systematically converting claimed security into observed security, domain by domain, and producing a structural change for every gap found. Not a pentest. Not a red team. A constructive adversarial validation — you are working with the client, with full authorization, with the explicit goal of finding what breaks before an attacker does.
 **What you are not doing:** Adding more controls. This engagement deliberately does not recommend new tooling or new policies. If a control exists and does not work, the finding is that the control does not work — not that a different control is needed. Via negativa applies here too: the fragility is almost always that the existing controls have too many exceptions, too little monitoring, and have never been tested.
 ---
 ## Before you start
 ### Authorization scope
 Before any test in this engagement, confirm written authorization covering:
 - Simulating attacks against identity (Kerberoasting, DCSync simulation, PIM bypass attempts)
 - Triggering security alerts deliberately (break-glass sign-in, impossible-travel simulation, fake consent grant)
 - Testing compliance controls on managed devices (rooting a test device, forcing a non-compliant state)
 - Attempting data exfiltration through DLP and labeling controls (on test data, to controlled test destinations)
 - Restoring from backup in a test environment
 Authorization is not "we told them verbally." It is a document signed by the named executive sponsor covering the scope of tests. Scope the authorization to the test accounts, test devices, and test data used — do not test on production privileged accounts or production data unless explicitly scoped.
 ### Baseline capture before anything changes
 On day one, before any test or change:
 1. Export all CA policies to JSON (CAExporter or Graph API). This is the declared state you will test against and the known-good you will compare the close-of-engagement state to.
 2. Run BloodHound and capture the full attack graph. The number of paths to Domain Admin at T+0 is your opening metric.
 3. Pull the Entra role assignment list — who holds what role, eligible vs. active.
 4. Pull the service principal inventory with their Graph permissions.
 5. Export Intune compliance and configuration policy assignments.
 6. Run `Get-ADUser krbtgt -Properties PasswordLastSet`, `Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet`, and document both.
 7. Count sign-in log distinct device IDs for the last 30 days. Compare to Intune enrolled device count. Record the gap.
 These numbers are your before-state. Every structural change produced by this engagement is measured against them.
 ### The opening conversation
 This engagement starts with a single question asked out loud, to the most senior technical person in the room:
 > *"Can you show me one control in this estate that you are certain works — not because the portal says so, but because you have watched it fire under real conditions?"*
 The answer tells you everything. A person who can point to a specific tested control on a specific date has a security programme. A person who gestures at the dashboard has a compliance programme. Both deserve good consulting — but they need different things.
 ---
 ## 1. Identity — proving the wall is real
 ### The firebreak claim
 The client almost certainly claims that cloud privilege is separated from on-prem compromise. Test the claim, don't accept it.
 **Draw the full graph, out loud:**
 Starting from Domain Admin (or a simulated compromise of the sync server), trace every path that reaches a cloud privileged role:
 - Are any GAs synced from on-prem? (They claim no — verify.)
 - Can the sync server connector account be used to tamper with cloud objects?
 - Do any admins use the same device for Tier 0 and cloud admin work?
 - Is there a PTA agent that could be compromised to intercept credentials?
 - Does any MFA for cloud admin rely on an authenticator app on a device that is also used for email? (The MFA device is Tier 2. The admin role is cloud Tier 0. That is a tier violation across the MFA layer.)
 **Verify cloud-only GAs are actually cloud-only:**
 ```powershell
 $gaRoleId = (Get-MgDirectoryRole -Filter "displayName eq 'Global Administrator'").Id
 Get-MgDirectoryRoleMember -DirectoryRoleId $gaRoleId |
  Select-Object @{N='UPN';E={$_.AdditionalProperties['userPrincipalName']}},
                @{N='OnPremSyncEnabled';E={$_.AdditionalProperties['onPremisesSyncEnabled']}}
 ```
 `onPremisesSyncEnabled: true` on any GA is a P0 finding. "We moved them to cloud-only" is the claim; this is the verification.
 **Test the break-glass is actually independent:**
 With the client present: sign in to the break-glass account. Does it succeed? Does an alert fire? Does the person named as the responder to that alert actually receive it and acknowledge it within the agreed SLA? An alert rule that exists but routes to an unmonitored inbox is a ghost detection.
 ### AD FS: is the token-signing key actually monitored?
 If AD FS is still running (and in a "mature" estate it often is, "migration is on the roadmap"):
 ```powershell
 Get-AdfsCertificate -CertificateType Token-Signing |
  Select-Object Thumbprint, NotAfter, @{N='DaysSinceRotation';E={(Get-Date) - $_.Certificate.NotBefore | Select-Object -ExpandProperty Days}}
 ```
 Then ask: if an attacker obtained the private key for this certificate right now, what would you see in your logs? Walk through the scenario. In almost every case the honest answer is "nothing — a Golden SAML token is indistinguishable from a legitimate one." That is the finding. The migration is no longer a roadmap item.
 ### PIM: test the activation path, not the configuration
 The client has PIM. But:
 - **What MFA method is required on activation?** Navigate to PIM > Settings for Global Administrator role > Require MFA on activation. Then confirm the MFA method registered for each eligible GA. Push-approve MFA + PIM activation = phishable PIM. The control is not what it appears.
 - **Test an activation:** Have a test user with an eligible GA role activate it. Time the process. Observe: does the approval notification reach the approver? Does the approver know what they are approving, or does it arrive as a blind "approve this"? An approval workflow where approvers routinely click approve without context is not an approval workflow.
 - **Check for standing GA assignments that are supposed to be eligible-only.** `Get-MgDirectoryRoleMember` for GA — any user with no corresponding PIM eligible assignment has a permanent standing assignment that exists outside PIM, whether intentionally or by configuration drift.
 - **Check the maximum activation time box.** 24-hour activation windows are common in "we have PIM" deployments. An activation window that covers an entire working day is functionally standing privilege during business hours.
 ### The connector account as a canary
 Reconfigure: any sign-in by the Entra connector account (Directory Synchronization Accounts role) from any host other than the sync server should fire an alert. Then test it: simulate a sign-in from an unexpected host. Does the alert fire? Does someone respond?
 If the answer is "we have an alert rule," test it. "We have an alert rule" is a declaration. A firing alert reaching a responding human is an observation. The handbook's hardest rule applies here: verify by observation, never by inspection.
 ---
 ## 2. Privilege — attack paths the client has not mapped
 ### BloodHound as a metric, not a one-time scan
 The client's mature estate almost certainly has attack paths to Domain Admin that nobody has counted since the last pentest, if ever. Run BloodHound, capture the full graph, and count:
 - **Total paths to Domain Admin** (all principals)
 - **Paths reachable from standard user compromise** (the realistic starting point for a phishing attack)
 - **Paths involving Kerberoastable service accounts** specifically
 - **Paths involving ADCS** (add `-CollectionMethod ACL,ObjectProps,Trusts` to catch certificate-based escalation)
 Present the number. Do not present it as "you have X findings." Present it as: *"From a single compromised standard user account, there are N independent routes to Domain Admin. Each route is a path through controls the attacker does not need to break because they route around them."* Then pick the three shortest paths and show them concretely.
 This number is now a tracked metric. The engagement is not complete until it is going down.
 ### Kerberoast it — don't ask if it's possible
 Run the attack:
 ```powershell
 # Using Rubeus or Invoke-Kerberoast in an authorized test context
 Invoke-Kerberoast -OutputFormat Hashcat | Out-File kerberoast_hashes.txt
 ```
 The question is not "are there Kerberoastable accounts" (there are) — the question is: **did anything detect it?** A Kerberoast produces distinctive TGS request patterns. If Defender for Identity, Microsoft Sentinel, or any SIEM is watching, it should alert. If it does not, you have found a detection gap more important than the accounts themselves.
 Then attempt to crack the hashes offline (with explicit authorization, on a controlled device). Report which accounts crack and in what time. Most clients are surprised. The service account from 2019 with the password that was "rotated" to `ServiceAcc0unt!2019` cracks in minutes.
 ### ADCS: the forgotten Tier 0 target
 Run a basic ESC vulnerability enumeration:
 ```
 certipy find -u <test-account>@domain.com -p <password> -dc-ip <DC-IP> -stdout
 ```
 Or Certify if a Windows test host is more convenient:
 ```
 Certify.exe find /vulnerable
 ```
 In a mature estate, the ADCS server has been running for years, was configured for a specific purpose in 2018, and has never been audited against the ESC series. ESC1 (supply subject in request + broad enrollment rights) in particular is common and catastrophic — it allows any enrolled user to obtain a certificate for any principal, including Domain Admins. Find it, show the exploit path, and document that the ADCS server is being treated as Tier 1 when it is Tier 0.
 ### Service principal dark matter
 The client's mature estate has app registrations. Some of them have permissions that were granted for a reason that nobody in the room can explain. Find the escalation-grade ones:
 ```powershell
 # Application permissions (not delegated — these run without a user)
 $dangerousPermissions = @(
  "9e3f62cf-ca93-4989-b6ce-bf83c28f9fe8", # RoleManagement.ReadWrite.Directory
  "06b708a9-e830-4db3-a914-8e69da51d44f", # AppRoleAssignment.ReadWrite.All
  "1bfefb4e-e0b5-418b-a88f-73c46d2cc8e9", # Application.ReadWrite.All
  "19dbc75e-c2e2-444c-a770-ec69d8559fc7"  # Directory.ReadWrite.All
 )
 Get-MgServicePrincipal -All | ForEach-Object {
  $sp = $_
  Get-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $sp.Id |
    Where-Object { $_.AppRoleId -in $dangerousPermissions } |
    ForEach-Object {
      [PSCustomObject]@{
        ServicePrincipal = $sp.DisplayName
        Permission       = $_.AppRoleId
        GrantedDate      = $_.CreatedDateTime
      }
    }
 } | Sort-Object GrantedDate
 ```
 For each result: ask the room who created this app registration, what it does, and whether the permission is still needed. The answer to all three is usually "I don't know." That is the finding.
 Then go further: check which of these service principals have non-expiring client secrets and which have never been used (check the sign-in logs for the service principal's `lastSignInDateTime`). A service principal that has not authenticated in 180 days with a never-expiring secret holding escalation-grade Graph permissions is a standing credential an attacker can use indefinitely without triggering a human sign-in.
 ### Standing privilege check: the PIM compliance gap
 Ask for the full current list of active (not eligible) privileged role assignments. For each one:
 - Is it a break-glass account? If not, it should not be standing.
 - Is it a service account that cannot use PIM? Document and scope the managed-identity migration.
 - Is it an account someone added "temporarily" and forgot?
 In most mature tenants, the list of active non-break-glass assignments is longer than anyone expects, because PIM was deployed and the existing standing assignments were not cleaned up at the time.
 ---
 ## 3. Devices — the compliance signal gap
 ### The ghost CA policy protocol
 Apply this to every CA policy the client considers important (not every policy — prioritize the ones that block legacy auth, enforce device compliance, and gate privileged sign-in):
 **Before testing any policy:**
 Write down the expected outcome: *"User [X], device [Y], from location [Z], accessing [App] → MUST be [blocked / MFA-prompted / compliant-device-required]."* Write this before looking at the policy configuration. This prevents rationalizing whatever you observe.
 **The tests to run:**
 1. **Legacy auth block:** Use a mail client that supports Basic Auth (older Outlook, curl with basic auth headers to Exchange Online) from a test account. Expected: blocked. If it succeeds, the CA policy that blocks legacy auth either has an exclusion, is in report-only, or is a ghost.
 2. **Compliant device gate:** Sign in from a device that is known to be non-compliant (a personal device, or a managed device you have taken out of compliance by disabling BitLocker or removing an agent). Expected: blocked from sensitive workloads. If access is granted, either the CA policy is not evaluating correctly or the compliance signal is stale.
 3. **Admin sign-in from non-PAW:** Attempt to activate a PIM role from a standard workstation or a personal device. Expected: blocked if there is a CA policy restricting admin access to compliant or named devices. If it succeeds, the PAW policy is a claim.
 4. **The ghost test:** If any policy above fails to enforce despite its configuration appearing correct — recreate the policy from scratch with identical parameters. Re-test. If the recreated policy enforces and the original did not, you have found a ghost policy. Document the specific policy name, the discrepancy, the recreation, and the re-test result.
 **Important:** Do not re-edit a failing policy to fix it. Recreate it. A ghost policy carries its corruption forward through edits.
 ### Compliance signal spoofing: measure the lag
 Take a test enrolled device (a managed device you have authorization to modify):
 1. Root/jailbreak it, or manually induce a non-compliant state (disable encryption, disable the screen lock, install a prohibited app — whatever the compliance policy checks).
 2. Record the timestamp.
 3. Watch Intune and Entra ID: when does the compliance state flip to non-compliant?
 4. When does Conditional Access revoke the session token?
 5. Is Continuous Access Evaluation (CAE) in place for the workloads that matter? If yes, token revocation should be near-real-time for supported apps. If no, the window is bounded by the token lifetime.
 The gap between step 2 and step 4 is the attacker's window after compromising a compliant device. Present it in minutes, not as "the token may be stale." Most clients have never measured it.
 ### Reconcile the real fleet
 Pull four numbers and compare them:
 | Source | Count |
 |--------|-------|
 | Intune managed devices | |
 | Entra registered/joined devices | |
 | Distinct device IDs in sign-in logs (last 30 days) | |
 | Distinct device IDs signing in with "Device compliant: No" or "Device managed: No" |  |
 The gap between row 1+2 and row 3 is the shadow population. The number in row 4 is the unmanaged population actively accessing data. Neither of these are hypothetical risks — they are current, observable facts about who is accessing the tenant right now.
 For every device in row 4: what data can it reach, and what Conditional Access policy, if any, applies to it?
 ### Legacy auth: find the surviving flows
 Even with a "block legacy auth" CA policy in place, find the exceptions:
 ```
 Sign-in logs → Add filter → Client App → select all non-modern entries:
  Exchange ActiveSync
  Exchange Online PowerShell
  Exchange Web Services
  IMAP4
  MAPI Over HTTP
  Other clients
  POP3
  Reporting Web Services
  SMTP
 ```
 Export the results. Every entry is a legacy auth flow that either bypasses the CA policy (via an exclusion you should examine) or is a service account using a protocol that will break when the exclusion is removed. Build the map. The goal is zero — but the path to zero requires knowing what is currently there.
 ---
 ## 4. Data — does protection actually travel
 ### Exfiltrate a labelled document
 With authorization, take a test document labelled at the highest sensitivity tier available (Highly Confidential, or equivalent):
 1. Forward it as an email attachment to a personal test email address outside the tenant. Does DLP intercept it? Does the label encryption hold on the received document?
 2. Download it to an unmanaged device (one that is not Intune-enrolled). Open it. Does encryption require authentication to the tenant?
 3. Share it via an anonymous "Anyone with the link" URL (if anonymous sharing is still permitted). Access the link from a browser with no tenant authentication. Does it open?
 4. Copy and paste the content from the document into an unmanaged app (on a device where the MAM boundary applies). Does the block work?
 5. Open it in a browser through Conditional Access App Control session policy. Attempt to download. Does the block work?
 Document which paths hold and which do not. The ones that do not hold are the exfiltration routes an attacker (or a careless employee) will actually use. Every failed block is a finding; the label configuration that passed in the policy screen is the ghost, and the exfiltrated file is the fact.
 ### Enumerate the anonymous link population
 The tenant sharing setting may say "restricted." That setting controls new links. It does not remove existing ones. Run:
 ```powershell
 # PnP PowerShell — requires SiteCollection Admin on each site
 Get-PnPTenantSite | ForEach-Object {
  Connect-PnPOnline -Url $_.Url -Interactive
  Get-PnPSharingLinks | Where-Object { $_.SharingLinkType -eq "Anonymous" }
 } | Export-Csv anonymous_links.csv
 ```
 Present the count. In mature tenants, the anonymous link population predates the current tenant sharing settings by years. The setting was changed; the links were not revoked. Every entry is an active bearer token for data that predates the restriction.
 ### The BEC forward rule: simulate it
 With a test account (not an executive, not a privileged account):
 1. Create an Inbox rule forwarding all email to an external test address you control.
 2. Wait to see whether anything detects it and when.
 3. Check whether the global block on external auto-forwarding (`Get-RemoteDomain Default | Select-Object AutoForwardEnabled`) actually blocks this test rule from executing.
 4. Confirm: does the transport rule block the forwarding, or does the block only apply to Outlook/OWA auto-forwarding (not to manually-created Inbox rules)?
 There is a documented distinction: the transport-level `AutoForwardEnabled: false` on Remote Domains blocks transport-rule-level forwarding and OWA Auto-Reply forwarding, but Inbox rules created in Outlook/OWA by the user may still forward depending on the specific configuration. Test this on the client's environment. Do not assume.
 ### Crown jewel access review
 For the data sets the client has identified as crown jewels (if they have not identified them, that is the first finding — go back to basic engagement):
 1. Pull the access list for the crown-jewel SharePoint sites and OneDrive locations.
 2. Pull the audit log for access events on those locations over the last 30 days.
 3. Identify: who accessed them, how frequently, from what devices?
 4. Find: any access from unmanaged devices. Any access from accounts that should not have visibility. Any bulk download events.
 5. Specifically check for guest access to the crown-jewel locations — guests whose project has concluded but whose access persists.
 The audit log review is also a test of the audit infrastructure: can you produce a coherent forensic reconstruction of who accessed what, when, from where, over the last 30 days? If the answer is "we would need to run several different reports and correlate them manually," that is an incident response readiness finding.
 ---
 ## 5. Detection — does it fire, does anyone act
 This section is the difference between robustness and antifragility. Everything before this is about whether controls hold. This section is about whether the organization learns when they do not.
 ### The eight simulations
 For each of these, run the simulation with authorization, observe the outcome, and measure the time from event to human acknowledgment. The SLA the client believes they have is the declared state. The measured time is the observed state.
 **Simulation 1 — Break-glass sign-in:**
 Sign in to the break-glass Global Admin account. This should trigger an immediate, high-priority alert routed to a named responder. Measure: how long from sign-in to human acknowledgment? If the answer is longer than 15 minutes, the break-glass is not monitored at the level it needs to be.
 **Simulation 2 — New Global Admin assigned:**
 Assign GA to a test account. Observe: does an alert fire in Microsoft Sentinel, Microsoft Defender, or the configured SIEM? Who receives it? When? Revoke the assignment after the test.
 **Simulation 3 — DCSync simulation:**
 From a non-DC host with a test account that has the relevant permissions (or using Mimikatz in an authorized test context), simulate a DCSync operation. Defender for Identity should alert on `Directory Services Replication`. Does it? Does the alert reach a human? Most mature clients have DfI deployed; fewer have confirmed the specific alert fires and routes correctly.
 **Simulation 4 — Kerberoasting (detection, not just the attack):**
 Run the Kerberoast from section 2 again, now with the explicit goal of measuring detection. Did the TGS request pattern generate an alert? The attack was run earlier to find the vulnerable accounts; run it again now to find the detection gap.
 **Simulation 5 — Impossible travel for an admin account:**
 Using a VPN exit node or a cloud VM in a geographically distant region, sign in as a test user who recently signed in from the client's location. Entra ID Protection should flag this as a risky sign-in. Does the user risk policy elevate the risk? Does a CA policy enforce remediation (MFA challenge or block)? Does an alert fire to the SOC? For admin accounts specifically, this should be a high-priority signal.
 **Simulation 6 — External auto-forward rule:**
 From the data section — did anything alert when the test Inbox rule was created? If no detection fired during that test, that is a finding: BEC persistence can be established without triggering a single alert.
 **Simulation 7 — Mass download from SharePoint:**
 With a test account that has access to a document library, download 50+ files in rapid succession. Does Defender for Cloud Apps or Microsoft Purview generate an unusual-download alert? Does anything block or throttle it?
 **Simulation 8 — OAuth consent grant:**
 Register a test app requesting `Mail.Read` and `Files.ReadWrite.All` permissions. Grant it on behalf of a test user (simulating a user who clicks "Accept" on a consent prompt). Does anything alert on the grant event? Is user consent for this class of permission blocked by policy, or can users grant it freely?
 ### Alert fatigue: measure it honestly
 Pull the alert volume from the last 30 days (from Sentinel, Defender XDR, or wherever alerts are collected). Calculate:
 - Total alerts generated
 - Alerts closed as "true positive" with a documented response
 - Alerts closed as "false positive"
 - Alerts that have sat open for more than 48 hours
 - Alerts that were suppressed or auto-closed without human review
 The ratio of responded-to versus everything else is the real detection efficacy rate. Most mature clients discover that their effective detection rate is single-digit percentages of generated alerts. Present the number; it is a more honest metric than "we have Sentinel."
 ### The structural change test
 Pull the last five security incidents or alerts that resulted in a closed ticket. For each:
 - What was the incident?
 - What was the response?
 - What structural change resulted — what was removed, severed, restricted, or reconfigured because of this incident?
 If the answer to the third question is "we sent a reminder," "we noted it in the risk register," or "we trained the affected user" — the feedback loop is broken. Pain that closes a ticket without changing the architecture is wasted pain. Present the count of structural changes from the last five incidents. If it is zero, that is the most important finding in the report.
 ---
 ## 6. Recovery — is the exit ramp real
 ### Restore something
 Before the engagement closes, restore a real dataset from backup. Not a test restore of a test file — a production dataset (authorized, scoped, non-disruptive) or the clearest approximation the client can authorize.
 Time it. Record the actual MTTR. Compare it to the RTO written in the policy document.
 If the actual MTTR is longer than the policy MTTR, the policy is fiction. Present the observed time as the finding. The goal is not to shame the recovery team — it is to replace a comfortable fiction with a useful truth.
 **For M365 specifically:** Restore a mailbox or a SharePoint document library item from the third-party backup (if one exists). If no third-party backup exists in a mature estate, that is a P0 — it means the client has delegated recovery to Microsoft's recycle bin, which is not a backup posture.
 ### AD forest recovery readiness
 Ask the client to produce their AD forest recovery runbook. Three things to verify:
 1. **Is the runbook stored where it can be accessed when AD is down?** Not in SharePoint. Not in an AD-authenticated file share. Not in a password manager that authenticates against the domain. Paper, or a system outside the recovery domain, or both.
 2. **Has anyone ever run the procedure?** Not a tabletop — an actual restore, even in a lab. The first time you perform AD forest recovery must not be during the real disaster.
 3. **Does the runbook account for the double-KRBTGT rotation, metadata cleanup, and trust resets?** If it says "restore the DC from backup and you're done," it is incomplete.
 If the answer to question 2 is no, scope a recovery rehearsal. This is the finding: the organization is one ransomware incident away from performing the hardest IT operation in existence for the first time, under maximum pressure, with incomplete runbooks.
 ### Configuration drift from the known-good
 Compare the CA policy export from the beginning of this engagement against the current state. In any mature estate where CA policies are managed by multiple people without change control, there will be differences. For each difference:
 - Was it intentional? Is there a change record?
 - Does the difference make the policy more or less restrictive?
 - If a policy was modified by someone without change authorization, how long ago and how would it have been detected?
 The absence of a known-good baseline means the client cannot answer these questions. The presence of a known-good baseline and a diff is the beginning of drift detection. If the diff reveals changes made outside the change window or without documentation, that is a control failure independent of whether the change was malicious.
 ---
 ## The close
 ### What changes structurally
 At the end of this engagement, for every finding that was verified by observation (not just inspected), produce a specific structural change:
 | Finding type | Structural change target |
 |---|---|
 | Ghost CA policy found | Policy recreated, re-tested, documented |
 | PIM activation MFA is push-approve | Migration to phishing-resistant MFA scoped |
 | Kerberoasting not detected | Detection rule created, tested end-to-end |
 | Standing GA outside PIM | Account removed from role; break-glass confirmed working |
 | Anonymous links not revoked | Links enumerated and revoked; expiration policy applied |
 | BEC rule creation not detected | Exchange alert configured, tested |
 | Alert queue not triaged | Alert owner named, SLA defined, volume reduced |
 | Backup MTTR exceeds policy | Policy updated to observed time; rehearsal scheduled |
 The engagement deliverable is not the report. The deliverable is the list of structural changes, plus the metrics: BloodHound path count before and after, standing privilege account count before and after, confirmed-working detection count, and measured MTTR.
 ### Metrics to deliver at close
 | Metric | Before | After |
 |--------|--------|-------|
 | BloodHound paths to Domain Admin (from standard user) | | |
 | Standing (non-break-glass) Global Admin count | | |
 | Standing (non-break-glass) Domain Admin count | | |
 | CA policies verified to enforce by observation | | |
 | Detection signals tested end-to-end and confirmed working | | |
 | Anonymous link count (existing) | | |
 | Unmanaged devices in sign-in logs (% of total) | | |
 | Actual MTTR from backup restore drill | | |
 | Structural changes from last 5 incidents (before) | | |
 These numbers are the honest alternative to a compliance score. None of them can be faked by clicking a toggle. All of them represent something an attacker either can or cannot do.
 ---
 ## 7. The leave-behind
 The engagement ends. The admin has to operate the estate alone until the next engagement. This section is what you set up during the engagement so they can do that.
 ### The self-service cadence document
 Every adversarial validation engagement closes with a filled-in [Self-Service Cadence](../assessment-templates/self-service-cadence.md) document, customized for the client. The template becomes their recurring runbook — monthly portal checks, quarterly tool runs, and a clear list of "call us if you see this" triggers.
 Spend the last session of the engagement walking through the document with the named admin. Run the first quarterly check together, with them driving. The goal is not to hand over a PDF — it is to verify they can execute it without you in the room.
 ### Tools to leave installed and working
 Before you leave, confirm these are installed and the admin has run each at least once:
 | Tool | Confirm working | Leave-behind |
 |------|----------------|--------------|
 | PingCastle | Run a healthcheck scan, admin can read the output | HTML report from today as the baseline |
 | Purple Knight | Run a full scan, admin can read the indicators | PDF report from today as the baseline |
 | CAExporter | Exported today's CA policies, stored in agreed location | JSON files from today as the known-good |
 | Graph PowerShell module | Admin can connect and run the scripts in the cadence document | Scripts saved to the agreed local path |
 | PnP PowerShell | Admin can connect to SharePoint admin and run the anonymous link export | Confirmed connected during the session |
 Do not leave a tool installed that the admin has never run. An unfamiliar tool is not a capability — it is a task that will not get done.
 ### The baseline numbers
 At close of engagement, record the opening and closing metrics in the tracking spreadsheet you set up with the admin. These are the numbers their quarterly PingCastle and Purple Knight runs will be compared against. Without a baseline, a quarterly scan is a point in time with no direction — with a baseline, it tells a story.
 | Metric | Value at close of engagement |
 |--------|------------------------------|
 | PingCastle score | |
 | Purple Knight: Critical indicators | |
 | BloodHound paths to DA (standard user) | |
 | Standing GA count (non-break-glass) | |
 | Anonymous link count | |
 | Stale guest count (90+ days inactive) | |
 | CA policies verified to enforce | |
 | Detection signals confirmed working | |
 ### "Call us" triggers — agree them explicitly
 From the [cadence document](../assessment-templates/self-service-cadence.md), go through the trigger list out loud with the admin and confirm they understand each one. The list exists so they do not have to judge whether something is important enough to contact you — the bar is already defined.
 The most important part of this conversation: *"When in doubt, contact us. We would rather look at a false alarm than hear about a real incident that sat for two weeks because you were not sure if it was worth mentioning."*
 ---
 ## What this engagement is not
 **Not a red team.** The client knows you are here. You are working with them, not against them. When a simulation fires an alert, you tell the responder it is a test. The goal is to calibrate the detection, not to prove that you can evade it.
 **Not a vulnerability scan.** You are not looking for unpatched CVEs or misconfigured services in bulk. You are validating the specific controls the client believes are in place.
 **Not a compliance audit.** You will not produce a CIS score or a NIST gap report at the end. You will produce a list of controls that work and a list of controls that do not, measured by observation, with structural changes attached to the ones that do not.
 **Not additive.** You are not recommending new tools, new policies, or new products. If something does not work, the fix is almost always to remove the exception, test the existing control, or eliminate the coupling — not to add a compensating control on top of the broken one.
 ---
 *Field Guide — Adversarial Validation. Updated June 2026. Review alongside the main field guide — January 2027.*
@@ -75,7 +75,7 @@ Jsme malá, specializovaná praxe. Neprovozujeme 24/7 operační centrum. Nepode
 **6. [PLACEHOLDER: Vaše šestá diferenciace]**
-> **INTERNÍ POZNÁMKA** — Přidejte diferenciaci specifickou pro vaši praxi. Příklady: hluboká odbornost v konkrétním odvětví (OT/energie, české regulatorní prostředí); proprietární nástroje (ASTRAL, AOC, Elysium); jazykové schopnosti; specifické certifikace; metodologický přístup.
+> **INTERNÍ POZNÁMKA** — Přidejte diferenciaci specifickou pro vaši praxi. Příklady: hluboká odbornost v konkrétním odvětví (OT/energie, české regulatorní prostředí); proprietární nástroje (ASTRAL, PULSAR, Elysium); jazykové schopnosti; specifické certifikace; metodologický přístup.
 [PLACEHOLDER: konkrétní diferenciace s jedním konkrétním příkladem nebo důkazem]
@@ -1,5 +1,11 @@
 # About CQRE · Brownhat
 > ⚠️ **TEMPLATE — NOT READY TO SHARE** ⚠️
 >
 > This document contains unfilled `[PLACEHOLDER]` sections. **Do not share this file with clients or external contacts until every placeholder has been replaced with real content and all INTERNAL NOTE sections have been removed.**
 >
 > To check: `grep -r "\[PLACEHOLDER\]" about-cqre.md` should return no results before this file leaves the repository.
 >
 > *This document introduces CQRE and the Brownhat methodology to new clients and new team members. Fill every `[PLACEHOLDER]` section with specific, honest information. Avoid generic consulting language — clients can tell. The sections marked **INTERNAL NOTE** contain guidance for completing the template; remove them before sharing externally.*
 >
 > *A Czech-language version of this document is maintained at [about-cqre-cs.md](about-cqre-cs.md).*
@@ -75,7 +81,7 @@ We are a small, specialist practice. We do not run a 24/7 SOC. We do not sign of
 **6. [PLACEHOLDER: Your sixth differentiator]**
-> **INTERNAL NOTE** — Add a differentiator specific to your practice. Examples: deep expertise in a specific vertical (OT/utilities, Czech regulatory environment); proprietary tools (ASTRAL, AOC, Elysium); language capability; specific certifications; methodology approach.
+> **INTERNAL NOTE** — Add a differentiator specific to your practice. Examples: deep expertise in a specific vertical (OT/utilities, Czech regulatory environment); proprietary tools (ASTRAL, PULSAR, Elysium); language capability; specific certifications; methodology approach.
 [PLACEHOLDER: specific differentiator with one concrete example or proof point]
@@ -1,242 +1,260 @@
 # AI Sovereignty Framework
-> *"The cloud model is smarter at everything, which makes it dumb at your specific thing."*
+> *"The question is not whether you use cloud AI. The question is whether you have the right to stop."*
 ## For the Executive Reader
-Your organization is currently engaged in a **massive, unpaid research project for cloud AI providers**. Every proprietary document, every strategic query, every operational workflow sent to a third-party AI becomes training data for models that will eventually be sold to your competitors.
+Most organisations treat AI as a utility — like electricity. But electricity suppliers cannot change your contract mid-production, cannot decide your use case violates their acceptable-use policy, and cannot be subpoenaed for the data you ran through them. Cloud AI providers can do all three.
-AI sovereignty is not an IT project. It is a **strategic asset protection mandate**. By running artificial intelligence on infrastructure you control, you:
+AI sovereignty is not a refusal to use cloud AI. It is a demand to **control your dependency on it** — to own the option to change direction, to retain audit rights over systems that touch your regulated data, and to maintain operational continuity when a vendor's priorities diverge from yours.
- **Stop funding your competitors** through proprietary data leakage
+By managing AI infrastructure with the same rigour you apply to any critical vendor relationship, you:
 - **Eliminate vendor lock-in** for your organization's cognitive infrastructure
 - **Reduce long-term costs** from unpredictable per-query pricing to fixed capital
 - **Demonstrate regulatory maturity** on data residency and third-party risk
-**The economic argument**: A mid-sized organization spending €5,000-€15,000 monthly on cloud AI APIs will break even on local infrastructure within 12-18 months. After break-even, the cost is a fraction of cloud pricing—and the data remains exclusively yours.
+- **Satisfy data residency requirements** increasingly mandated by NIS2, DORA, GDPR, and sector-specific regulators — without hoping the vendor's data processing addendum is legally sufficient
 - **Retain audit rights** over inference decisions that touch regulated data, sensitive operations, or customer information
 - **Protect operational continuity** from vendor pricing changes, API deprecations, acceptable-use updates, and geopolitical events outside your control
 - **Build intelligence that compounds** — a fine-tuned model trained on your data gets better at your specific workflows, not at everyone's generic tasks
-**The competitive argument**: A fine-tuned local model trained on your proprietary data will outperform a general cloud model on your specific workflows. The cloud model improves at everyone's tasks. Your local model improves at only your tasks. That is sustainable differentiation.
+**The economic argument**: At meaningful scale, cloud AI inference is priced to grow with your usage. Fixed-cost inference infrastructure — local models, private cloud, or auditable sovereign cloud — produces predictable economics. Organisations spending €5,000–€15,000 monthly on cloud AI APIs typically reach break-even within 12–18 months.
 **The competitive argument**: A model fine-tuned on your proprietary data outperforms a general cloud model on your specific workflows. The cloud model improves at everyone's tasks. Your local model improves at your tasks alone. That gap is sustainable differentiation the vendor cannot replicate without access to your data.
 *For board conversation scripts, see [C-Suite Conversation Guide](c-suite-conversation-guide.md).*
 *For financial justification, see [Business Case Template](../playbooks/business-case-template.md).*
 *For the distinction between optional business AI and inevitable operational AI, see [AI Operations Inevitability](ai-operations-inevitability.md).*
 ---
 ## For the Practitioner
-This framework provides the strategic, technical, and ethical arguments for treating artificial intelligence as **sovereign infrastructure** rather than rented utility. It is designed for consultants and architects who must persuade boards, CISOs, and engineering leaders to invest in locally controlled intelligence.
+This framework provides the strategic, technical, and regulatory arguments for treating artificial intelligence as **sovereign infrastructure** rather than a rented utility. It is designed for consultants and architects who must persuade boards, CISOs, and engineering leaders to invest in locally controlled or auditable intelligence — and who need arguments that survive pushback from technically literate audiences.
---
+> **Critical framing note**: Avoid leading with "your prompts are training competitors." Most enterprise AI agreements (Azure OpenAI, Google Workspace AI, Microsoft Copilot) explicitly prohibit training on customer data, and technically literate clients will push back immediately if you lead here. The stronger, more durable arguments are regulatory compliance, audit rights, and operational continuity. Start there. The "training data" counter-argument is documented below as a secondary concern where genuinely applicable.
 ## Executive Summary
 Most organizations are currently engaged in a **massive, unpaid R&D project for cloud AI providers**. Every proprietary prompt, every internal document fed into a third-party model, every workflow built on an external API is a transfer of intellectual capital to an entity whose interests are not aligned with the organization's survival.
 AI sovereignty reverses this extraction. It restores the boundary of trust. It converts intelligence from a rented commodity into an owned asset.
 ---
 ## The Five Strategic Arguments
-### 1. The Data Sovereignty Argument (The Trojan Horse)
+### 1. The Regulatory Compliance Argument (The Mandatory Case)
 **The Problem**
-When proprietary data is sent to cloud AI providers, it does not merely get "processed." It becomes part of a feedback loop that improves general models—models that will eventually be sold to competitors, used to commoditize the client's industry, or deployed to replicate the client's unique edge.
+EU regulatory frameworks have made data residency and audit rights over AI inference a legal requirement — not a preference — for a growing proportion of organisations.
-Every query is a lesson. Every document is a training sample. The client is not a customer; they are an **uncompensated research contributor**.
+- **NIS2 (Article 21)**: Essential and important entities must demonstrate control over ICT systems that process sensitive operational data. "We use Azure OpenAI and trust Microsoft's DPA" is increasingly insufficient as supervisory authorities in Germany, France, and the Netherlands begin active audits.
 - **DORA (Article 28–30)**: Financial entities must conduct ICT third-party risk assessments and maintain contractual controls over critical AI providers. AI used in ICT processes covered by DORA constitutes a critical ICT third-party service.
 - **GDPR Article 28**: Data processed by cloud AI on personal data requires a Data Processing Addendum that satisfies Article 28 requirements. Many organisations are running AI workflows over personal data with no DPA in place.
 - **GDPR Article 44–49**: Transfers of personal data to AI providers with servers outside the EEA require adequate safeguards. US-based AI providers fall under this constraint regardless of their European data centre commitments.
 - **Sector-specific**: Healthcare organisations (special category data under GDPR Article 9), financial entities (BaFin, ACPR, DNB supervision), and public sector bodies face additional layered requirements that generic cloud AI agreements frequently cannot satisfy without supplementary controls.
 **The Pitch**
-> *"By sending our internal data to the cloud, we are effectively training the very system that will eventually commoditize our industry and replace our proprietary edge. We are not just 'using' AI; we are contributing our secrets to the public model."*
+> *"Every AI workflow that processes personal data, strategic operational data, or regulated information is a data processing activity. The question is not whether we trust the provider — it is whether the processing arrangement is legally compliant and auditable. For regulated organisations, 'we assume they comply' is not an acceptable control."*
 **The Antifragile Move**
-Running local models creates a **closed intellectual loop**. The organization's data remains an asset, not a training set for a competitor. It creates a moat that cloud giants cannot cross because they never receive the raw material to replicate it.
+Sovereign or self-hosted AI eliminates the compliance gap entirely. The data never leaves the organisation's controlled environment. There is no sub-processor DPA to maintain, no cross-border transfer to justify, and no audit finding waiting to happen.
-**Key Points for the Room**
+Where full local deployment is not immediately feasible, the stepping-stone is **auditable sovereign cloud** — EU-hosted AI infrastructure with contractually guaranteed data isolation and documented audit rights. Azure OpenAI in EU regions with a complete Microsoft Data Processing Addendum is a defensible starting point; it is not the end state.
 - Cloud AI providers are incentivized to aggregate and generalize. You are incentivized to differentiate and protect.
 - What you consider proprietary operational data, they consider valuable training signal.
 - A local model trained on your data becomes *better* at your workflows over time. A cloud model becomes *better at everyone's workflows*, diluting your advantage.
 ---
-### 2. The Operational Resilience Argument (The "Pulling the Plug" Scenario)
+### 2. The Audit Rights Argument (The Control Gap)
 **The Problem**
-Cloud AI is a dependency with no service-level guarantee of continuity. Terms of service change. Pricing changes. API versions are deprecated. Geopolitical events disable access. "Safety" filters are updated to censor specific industries or use cases. The organization's core operations are, in effect, an application running on someone else's brain.
+An organisation using cloud AI cannot independently verify:
 - What data is retained after inference
 - Whether inference logs are accessible in an incident investigation
 - Whether model outputs are deterministic (same input, same output) across versions
 - What the model's reasoning was for a specific output — relevant when AI-assisted decisions are challenged by regulators or courts
 - Whether the model's behaviour has changed between versions in ways that affect your workflows
 These are not theoretical concerns. DORA explicitly requires that ICT systems be auditable. NIS2 requires that security measures be demonstrable. GDPR Article 22 restricts automated decision-making. A cloud AI black box fails each of these requirements in ways that a local or auditable model does not.
 **The Pitch**
-> *"What happens to our core operations if the cloud-AI provider changes its Terms of Service, raises prices by 1000%, or suffers a geopolitical blackout that disables their API? Our entire business model should not be an app running on someone else's brain."*
+> *"We cannot audit what we do not control. When an AI-assisted decision is questioned by a regulator, a court, or a customer, we need to be able to show our reasoning. A cloud model gives us an output. It does not give us an audit trail. A local model gives us both."*
 **The Antifragile Move**
 Local models are auditable by definition — you have the weights, the inference logs, and the version history. For workflows where regulatory auditability is a requirement, local inference is not an architectural preference; it is the only defensible choice.
 For lower-sensitivity workflows, open-weights models (Llama, Mistral, Qwen) deployed on controlled infrastructure provide a middle path: cloud AI capability with local audit rights.
 **Key Points for the Room**
 - Version-pinned local models produce deterministic outputs. Cloud models update silently, and the same prompt may produce different results across model versions.
 - Inference logs from a local model can be retained and searched. Inference logs from cloud AI are typically inaccessible — a gap that becomes critical in incident investigations.
 - Open-weights models can be independently evaluated for bias, backdoors, and capability claims. Closed-source cloud models cannot.
 ---
 ### 3. The Operational Continuity Argument (The "Pulling the Plug" Scenario)
 **The Problem**
 Cloud AI is a dependency with no contractual guarantee of continuity. Terms of service change. Pricing is restructured. API versions are deprecated. Acceptable-use policies are updated to restrict specific industries or use cases. Geopolitical events impose access restrictions. The organisation's critical workflows are, in effect, running on someone else's infrastructure — and someone else's judgment about what is and is not permitted.
 **The Pitch**
 > *"What happens to our core operations if the cloud AI provider changes its terms of service, raises prices by 500%, or suffers a geopolitical restriction that disables their API? Our operational continuity should not be an application running on someone else's brain."*
 **The Antifragile Move**
 Local models are **sovereign infrastructure**. They operate when:
 - The internet is degraded or unavailable
- The provider is down, acquired, or embargoed
+- The provider is down, acquired, subject to sanctions, or embargoed
- The "safety" filters have been updated to block your use case
+- Acceptable-use policy has been updated to restrict your use case
- Pricing has been restructured beyond recognition
+- Pricing has been restructured beyond budget tolerance
 - The vendor's strategic direction no longer aligns with your operational needs
-This is the ultimate insurance policy—not against data loss, but against **capability loss**.
+This is the operational resilience argument: not about data leakage, but about **capability continuity**.
 **Key Points for the Room**
- Vendor lock-in for compute is expensive. Vendor lock-in for *intelligence* is existential.
+- Vendor lock-in for compute is expensive. Vendor lock-in for *intelligence* — the reasoning layer your operations depend on — is potentially existential.
- Recovery from a cloud exit is measured in quarters if workflows are deeply integrated. Recovery from a local model is measured in minutes.
+- Recovery from a cloud AI exit is measured in quarters if workflows are deeply integrated. Migrating to a local model is measured in weeks.
 - Resilience is not about having a backup. It is about having no single point of failure in your cognitive pipeline.
 - The optionality cost is low. Maintaining the technical capability to run locally — even while using cloud AI today — preserves the exit option at minimal cost.
 ---
-### 3. The Intellectual Property Argument (The Asset Protection)
+### 4. The Intellectual Property Argument (The Asset Protection Case)
 **The Problem**
-When an organization uses cloud AI, it owns neither the weights, the architecture, nor the deterministic behaviour of the system. It cannot audit the reasoning. It cannot guarantee that the same prompt will produce the same result tomorrow. It cannot prevent its proprietary workflows from being absorbed into a general model.
+When an organisation uses cloud AI, it owns neither the weights, the architecture, nor the deterministic behaviour of the system. It cannot audit the reasoning. It cannot guarantee that the same prompt produces the same result tomorrow. It cannot version-control the model itself. Fine-tuned domain knowledge built through intensive proprietary workflow cannot be transferred out of the provider's environment.
 Additionally, in some cloud AI configurations — particularly where enterprise agreements are not in place or are misconfigured — prompt data *can* be used to improve models. This risk exists and should be assessed, even if it is not universal.
 **The Pitch**
-> *"When we run models locally, we own the weights, the architecture, and the outputs. We are not tenants of an intelligence; we are the owners of it. We can tune it for our specific tasks, not the generic tasks the cloud provider cares about."*
+> *"When we run models locally, we own the weights, the architecture, and the outputs. We are not tenants of an intelligence — we are the owners of it. We can fine-tune it for our specific tasks, version it, audit it, and legally defend it. None of that is possible with a cloud black box."*
 **The Antifragile Move**
-The organization moves from being a **consumer of AI** to a **manufacturer of its own intelligence**.
+The organisation moves from **consuming AI** to **manufacturing its own intelligence**.
 This is the difference between:
 - A farm that buys seeds every year and is subject to the seed catalog (cloud AI)
 - A farm that saves, selects, and breeds its own cultivars (sovereign AI)
- A farm that buys seeds every year (cloud AI)
+Over time, the sovereign farm develops capabilities perfectly adapted to its specific environment. The seed-buying farm is permanently dependent on external supply.
 - A farm that saves, selects, and breeds its own (sovereign AI)
 Over time, the sovereign farm develops cultivars perfectly adapted to its soil. The seed-buying farm is at the mercy of the seed catalog.
 **Key Points for the Room**
- Fine-tuned local models on proprietary data outperform general models on domain-specific tasks.
+- Fine-tuned local models trained on proprietary data outperform general models on domain-specific tasks. This is well-documented across legal, medical, financial, and operational domains.
- You can version, audit, and legally defend a local model. You cannot audit a cloud black box.
+- You can version, audit, and legally defend a local model. You can file for trade secret protection over its weights and training data. You cannot do any of this with a cloud model.
- The outputs of your local model are your intellectual property, unencumbered by third-party terms.
+- The outputs of your local model are your intellectual property, unencumbered by third-party terms of service that can change.
 ---
-### 4. Overcoming the Complexity Objection
+### 5. The Strategic Differentiation Argument (The Compounding Moat)
 **The Objection**
 > *"But the cloud models are smarter. And local deployment is complex."*
 **The Counter**
 Cloud models are smarter at *everything*, which makes them *dumb* at your specific thing. A general-purpose model optimized for broad benchmarks is not optimized for your internal processes, your data schemas, your regulatory constraints, or your proprietary logic.
 By training or fine-tuning a smaller, local model on specific, proprietary data, the organization can achieve:
 | Metric | Cloud General Model | Local Fine-Tuned Model |
 |--------|--------------------|------------------------|
 | Performance on generic tasks | 95% | 70% |
 | Performance on proprietary tasks | 60% | 90% |
 | Cost at scale | Linear / unpredictable | Sub-linear / fixed |
 | Data leakage risk | Non-zero and growing | Zero |
 | Operational ownership | None | Complete |
 **The Honest Reframe**
 > *"Most businesses do not need a model that can write Shakespeare. They need a model that knows their internal processes, their data, and their specific workflow. Local models are better at that—and they get better every day you feed them proprietary signal."*
 **Technical Reality**
 Modern quantized models, parameter-efficient fine-tuning (LoRA, QLoRA), and retrieval-augmented generation (RAG) have reduced the barrier to local deployment dramatically. A reasonable AI budget today can achieve what required a dedicated team two years ago.
 ---
 ### 5. The Professional Responsibility Angle
 **The Problem**
-As a security architect, consultant, or technical leader, you are the steward of the organization's crown jewels. Recommending that proprietary strategic intelligence be outsourced to an unauditable third-party black box is not a neutral technical decision. It is a **breach of fiduciary responsibility**.
+Cloud AI democratises baseline capability. Every competitor who subscribes to the same cloud AI service starts with the same baseline. The gap between organisations narrows as general AI capability becomes a commodity — and the only remaining differentiation is proprietary data and domain-specific fine-tuning.
 Organisations that feed their proprietary data into cloud AI rather than their own models are contributing the raw material of their future differentiation to a platform that commoditises it.
 **The Pitch**
-> *"I cannot in good faith recommend that we outsource our strategic intelligence to a third-party black box that we cannot audit and that is actively incentivized to commoditize our data."*
+> *"The cloud model is smarter at everything, which makes it dumb at your specific thing. A model fine-tuned on your data, your workflows, and your domain knowledge will outperform a general model on your actual tasks. And unlike the general model — which improves for everyone — this model improves only for you. That is a moat the vendor cannot replicate without your data."*
-**The Outcome**
+**The Antifragile Move**
-This framing elevates the advisor from a "technical implementer" to a **Strategic Defender of the Company's Future**. It positions the recommendation not as a preference for complexity, but as a principled stand for structural integrity.
+Treat proprietary AI capability as a **T0 strategic asset** — not because the technology is valuable in the abstract, but because the combination of your model, your data, and your domain knowledge produces capability that competitors cannot purchase.
-**Key Points for the Room**
+See the full [T0 Asset Framework](t0-asset-framework.md) for classification guidance.
 - You are not selling local AI. You are protecting the organization's ability to think independently.
 - The conflict of interest is real: cloud AI consultants are often incentivized by provider partnerships. Independent architects have no such conflict.
 - This is the same logic that demands on-premises key management for cryptography. Intelligence is no different.
 ---
-## The T0 Asset Classification
+## Handling Objections
-In cybersecurity and architecture, a **Tier 0 (T0) asset** is something that, if compromised, destroys the entire operation.
+The following objections are common from technically literate audiences. Superficial responses will not work. These are calibrated for audiences who have read the fine print.
-Local AI must be classified as T0. This framing speaks the language of high-stakes infrastructure and immediately elevates the conversation from "tech project" to **foundational pillar of survival**.
+| Objection | Honest Response |
-
+|-----------|----------------|
-### Why T0?
+| "Our enterprise agreement prohibits training on our data." | Most do. The issue is not current policy — it is audit rights over that commitment, the definition of what counts as "your data" vs. metadata, what happens when the agreement renews, and whether you can prove compliance to a regulator. Policy is not architecture. |
-
+| "We use Azure OpenAI, which is EU-hosted and GDPR-compliant." | A defensible starting point, not an end state. Verify your DPA covers all inference use cases. Confirm no data is routed outside the EU/EEA. For DORA-covered entities, complete the ICT third-party risk assessment. Azure OpenAI is the sovereignty bridge, not the destination. |
-1. **It defines the boundary of trust**: Moving intelligence inside the firewall re-establishes a perimeter that has been silently dissolving.
+| "Cloud models are more capable." | For generic tasks, yes. For your specific domain workflows, a fine-tuned local model — trained on your data — will match or exceed general model performance while keeping your data inside. The comparison is not "cloud vs. local on general benchmarks." It is "general model vs. fine-tuned model on your actual tasks." |
-2. **It removes vendor risk**: A local model is vendor-independent. It remains functional regardless of Silicon Valley boardroom decisions.
+| "Local deployment is too expensive." | Cloud AI pricing scales linearly with usage. Locally-run models (or private cloud inference) are a fixed-cost investment with predictable operating costs. Organisations with meaningful AI workloads typically reach break-even within 12–18 months. After break-even, the cost advantage compounds. |
-3. **It signals strategic maturity**: While competitors chase shiny APIs, the T0 advocate is building durable infrastructure for a 5-to-10-year horizon.
+| "We don't have the expertise." | Start with a pilot using modern tooling (Ollama, LM Studio, or a managed private cloud endpoint). The barrier has dropped dramatically. Partner for initial setup; own for ongoing operations. |
-
+| "This will slow us down." | Sovereignty is not a replacement for cloud AI. It is a capability you build alongside it. Start by identifying the 10% of workflows that touch regulated or proprietary data — sovereign those first. Continue using cloud AI for everything else while building toward full capability. |
-See the full [T0 Asset Framework](t0-asset-framework.md) for implementation guidance.
+| "We already have a SOC 2 / ISO 27001 certified vendor." | Certification covers the vendor's internal processes, not your specific data flows. It does not satisfy NIS2 Article 21, DORA Article 28–30, or GDPR Article 32 on its own. Ask your auditor directly whether your current AI processing arrangement would survive a supervisory authority inquiry. |
 ---
-## Implementation Posture
+## The Sovereignty Spectrum
-### Immediate (0-30 days)
+AI sovereignty is not binary. Organisations move along a spectrum from full cloud dependency to full local sovereignty. The appropriate position depends on the sensitivity of workflows, regulatory obligations, and operational risk tolerance.
- **Inventory**: Map all current AI usage—approved and shadow. Identify what data is leaving the perimeter.
+| Position | Description | Typical Use Case |
- **Classify**: Label workflows by sensitivity. Anything involving IP, strategy, or customer data is a sovereignty candidate.
+|----------|-------------|-----------------|
- **Pilot scope**: Select one non-critical, high-signal workflow for local model proof-of-concept.
+| **Cloud AI (unmanaged)** | No enterprise agreement, data processed with no residency guarantee | Consumer tools, non-regulated workflows |
 | **Cloud AI (enterprise)** | Enterprise agreement, data processing addendum, EU residency where required | Most corporate Microsoft 365 / Azure OpenAI usage today |
 | **Sovereign cloud AI** | Dedicated infrastructure, contractually guaranteed data isolation, full audit rights; e.g. Azure OpenAI with committed EU residency and complete DPA | Regulated organisations with EU data requirements |
 | **Self-hosted open-weights** | Open-weights models (Llama, Mistral, Qwen) on organisation-controlled infrastructure | High-sensitivity workflows; NIS2/DORA-regulated operations; organisations with strong data sovereignty requirements |
 | **Fine-tuned local** | Organisation-trained or fine-tuned model on proprietary data, fully isolated | Maximum differentiation; T0 intelligence workflows; trade secret protection |
-### Short-term (30-90 days)
+The CQRE practice does not mandate a specific position. We recommend mapping each AI workflow to its appropriate position on the spectrum based on the sensitivity of the data it processes and the regulatory obligations it creates.
- **Deploy local inference**: Establish on-premises or sovereign-cloud inference infrastructure.
+---
 - **Fine-tune**: Train a small model (7B-13B parameters) on proprietary data for the pilot workflow.
 - **Measure**: Compare accuracy, latency, cost, and leakage risk against the cloud baseline.
-### Medium-term (90-180 days)
+## Practical Starting Points
- **Expand**: Migrate additional workflows based on pilot results.
+### Immediate (0–30 days)
- **Integrate**: Connect local models to internal data pipelines, CMDB, and security tooling.
+
- **Govern**: Establish policies for approved AI usage, data handling, and model versioning.
+- **Inventory**: Map all current AI usage — approved and shadow. Identify what data is leaving the perimeter and under what contractual terms.
 - **Classify**: Label workflows by sensitivity and regulatory exposure. Anything involving personal data, strategic information, or regulated operations is a sovereignty candidate.
 - **Gap analysis**: For each AI workflow, assess: Is there a valid DPA? Is data residency confirmed? Do you have audit rights? Can you exit in 90 days if needed?
 ### Short-term (30–90 days)
 - **Harden existing cloud AI**: Ensure enterprise agreements and DPAs are in place for all cloud AI workflows. Confirm EU data residency where required. Document the ICT third-party risk assessment for DORA-covered entities.
 - **Sovereignty bridge**: For M365 environments, Azure OpenAI with committed EU data residency is the appropriate stepping-stone while local capability is built. See [Azure OpenAI Sovereignty Bridge](azure-openai-sovereignty-bridge.md).
 - **Pilot local inference**: Deploy Ollama or equivalent on controlled infrastructure for one high-sensitivity workflow. Measure performance, latency, and operational overhead.
 ### Medium-term (90–180 days)
 - **Expand local capability**: Migrate additional high-sensitivity workflows to local inference based on pilot results.
 - **Fine-tune**: Train or fine-tune a model on proprietary data for the workflows where domain performance matters most.
 - **Govern**: Establish a documented AI usage policy that maps each workflow to its approved processing tier and the controls required.
 ### Long-term (180+ days)
- **Manufacture**: Build internal capability to train, evaluate, and deploy domain-specific models.
+- **Manufacture**: Build internal capability to train, evaluate, and version domain-specific models.
- **Distribute**: Extend sovereign intelligence to edge locations, OT environments, and disconnected operations.
+- **Integrate**: Connect local models to internal data pipelines, security tooling, and operational systems.
- **Monetize**: Consider whether proprietary model capabilities represent a productizable asset.
+- **Productise**: Assess whether proprietary model capabilities represent a competitive asset worth protecting formally (trade secrets, access controls).
 ---
-## Common Objections and Responses
+## CQRE Tools as Sovereign Intelligence Examples
-| Objection | Response |
+The CQRE product suite is a working example of this framework applied to M365 operations:
-|-----------|----------|
+
-| "Cloud models are more capable." | For generic tasks, yes. For your proprietary tasks, a fine-tuned local model will outperform them—while keeping your data inside. |
+- **ASTRAL** runs on infrastructure you control (Azure DevOps, Git). M365 configuration state — the baseline of your tenant's security posture — never leaves your environment. The intelligence (what changed, what it means) is produced locally.
-| "Local deployment is too expensive." | Cloud AI pricing is linear with usage and unpredictable. Local is a fixed capital expense with predictable operating costs. At scale, it is cheaper. |
+- **PULSAR** retains M365 audit logs on your infrastructure (or CQRE-managed EU infrastructure for hosted deployments). The events that a cloud provider would let expire in 90 days are yours permanently, searchable, and not subject to any vendor's data retention policy.
-| "We don't have the expertise." | Start with a pilot. Modern tooling has reduced the expertise barrier dramatically. Partner for setup, own for operations. |
+- **AURORA** provides AI-assisted diagnostics over your ASTRAL and PULSAR data. Self-hosted AURORA brings its own Azure OpenAI endpoint (BYOAI) — your tenant data is processed by AI infrastructure under your control, not routed through a third-party service with opaque data handling.
-| "Our vendor says they don't train on our data." | Terms of service change. Verbal assurances are not architecture. If the data leaves your perimeter, you have lost control regardless of current policy. |
+
-| "This will slow us down." | A temporary reduction in velocity is preferable to a permanent loss of strategic optionality. Build the vault first; fill it quickly after. |
+This is the sovereignty spectrum in practice: open-source tools, client-controlled data, auditable AI integration, with a hosted tier for organisations that prefer managed infrastructure but require EU data residency.
 ---
 ## The Builder's Mandate
-By pushing for local AI infrastructure in the corporate world, you are **decentralizing the Machine**. You are taking the intelligence that centralized cloud platforms are trying to monopolize and distributing it to the edges—where human-scale organizations live and operate.
+By building sovereign AI capability in the organisations you advise, you are **decentralising intelligence**. You are taking the cognitive infrastructure that centralised cloud platforms are trying to monopolise and returning it to the organisations that generate the underlying data.
-You are building the infrastructure that allows businesses to remain **sovereign entities** rather than terminal sinks for centralized AI extraction.
+This is not anti-cloud ideology. It is a straightforward application of the antifragile principle: **own your critical dependencies, and maintain the option to change everything else**.
-This is the most responsible architecture work possible right now.
+The organisations that build this capability today will retain independent judgment when vendor relationships change. The organisations that don't will discover — too late — that their cognitive infrastructure belongs to someone else.
 ---
 *Next: [T0 Asset Framework](t0-asset-framework.md)*
 *Previous: [Antifragile Manifest](antifragile-manifest.md)*
 *Related: [Azure OpenAI Sovereignty Bridge](azure-openai-sovereignty-bridge.md)*
 *Related: [AI Operations Inevitability](ai-operations-inevitability.md)*
 *Related: [CQRE Product Suite](../playbooks/cqre-product-suite.md)*
@@ -25,6 +25,8 @@ This manifest defines the five foundational pillars of an antifragile enterprise
 Your job as a consultant is to translate each pillar into the specific context of the client. The language should shift (a CISO hears "Stress-to-Signal Conversion" differently than a CFO does), but the underlying logic does not.
 **On sequencing**: The pillars are not equally weighted at the start of an engagement. Pillars 1 and 2 (Structural Decoupling and Optionality Preservation) are foundation work — mapping and removing dangerous dependencies. Pillar 3 (Stress-to-Signal Conversion) requires having something to instrument. Pillar 4 (Sovereign Intelligence) presupposes a foundation worth protecting and a signal worth amplifying. A client excited about AI sovereignty who has not enforced MFA is building a sophisticated roof on a house with no walls. Fix the foundation first. See [Move Fast and Fix Things — The AI Distraction](move-fast-and-fix-things.md#the-ai-distraction).
 For the reasoning *why* these pillars work—drawn from natural systems, distributed networks, and emergent order—see [Spontaneous Order Principles](spontaneous-order-principles.md).
 ---
@@ -45,6 +47,7 @@ Cloud architectures have created an illusion of resilience through scale. In rea
 - **Design graceful degradation**: Every critical function must have a fallback mode that operates at reduced capacity without the external dependency.
 - **Practice controlled failure**: Introduce chaos into non-production environments. If a system cannot survive the simulated failure of a dependency, it will not survive the real one.
 - **Establish exit architectures**: For every major platform dependency, maintain a technical and procedural path to migration that can be executed within 90 days.
 - **Build greenfield capability**: The ultimate expression of structural decoupling is the ability to rebuild the entire environment from scratch — cleanly, from documentation and version-controlled configuration, without inheriting the compromised state. An organisation that can execute a planned greenfield deployment every five years or so is in a structurally different risk position than one for which greenfield is a nightmare scenario. This is the controlled burn: organisations that never rebuild accumulate the technical debt and undocumented dependencies that make eventual failure catastrophic. See [Move Fast and Fix Things — Rule 5](move-fast-and-fix-things.md#rule-5-build-toward-greenfield-capability).
 ### Executive Framing
@@ -118,9 +121,11 @@ An organization that outsources its cognition outsources its future. Sovereign i
 ### The Argument
-The current AI paradigm is extractive. Every prompt sent to a cloud AI is a contribution to a competitor's training set. Every workflow built on a third-party model is a dependency on an intelligence you do not control, cannot audit, and cannot guarantee will serve your interests tomorrow. This is not a privacy concern. It is a **survival concern**.
+The current AI paradigm introduces three underappreciated risks. First, **vendor dependency**: every workflow built on a third-party model is a dependency on an intelligence you do not control, cannot fully audit, and cannot guarantee will serve your interests when the vendor's incentives shift. Second, **data residency and audit rights**: even where enterprise agreements prohibit training on your data, you typically cannot verify this independently — and audit rights over model inference are absent from most SLAs. Third, **operational continuity**: cloud AI services can change pricing, degrade quality, or enforce new acceptable-use restrictions at will. Your workflows break on their schedule, not yours.
-Sovereign intelligence is the antifragile response: local models, proprietary data loops, and owned reasoning infrastructure that improves with use rather than leaking value to external platforms.
+Sovereign intelligence is the antifragile response: owned or auditable models, proprietary data loops, and reasoning infrastructure that improves with use rather than creating dependency. This does not require rejecting all cloud AI. It means treating AI infrastructure with the same dependency analysis you would apply to any critical vendor: map it, stress-test the exit, and ensure you retain options.
 > **Consultant note**: The strongest client argument is not "your prompts are training competitors" — most enterprise agreements explicitly prohibit this, and technically literate clients will push back. The more durable arguments are data residency requirements (NIS2, DORA, GDPR Article 32), audit rights over inference decisions, and operational continuity risk when a critical workflow depends on an endpoint you cannot control. Start there.
 ### Antifragile Moves
@@ -10,7 +10,7 @@ It is designed for M365/Azure consultancies whose clients are not ready for on-p
 ## The Executive Summary
-Your clients are likely using ChatGPT, Claude, or Gemini via public APIs and consumer accounts. Every prompt leaves their perimeter, and the terms of service allow model improvement using that data. This is the worst possible posture.
+Your clients are likely using ChatGPT, Claude, or Gemini via consumer accounts or unmanaged public APIs — where data residency is uncontrolled, audit rights are absent, and (for consumer tiers) terms of service may permit model improvement using submitted data. This is the worst possible posture.
 **Azure OpenAI Service is not fully sovereign.** Microsoft operates the infrastructure. The underlying models are shared. But it offers something critical that public APIs do not:
@@ -204,7 +204,7 @@ For E3 clients, Azure OpenAI is a **separate Azure subscription**—it does not
 |---------|----------|
 | "Is this just another Microsoft lock-in?" | "It reduces lock-in compared to public APIs because your fine-tuned models, embeddings, and RAG pipelines are portable assets. When you are ready for full local AI, you migrate them. We are using Azure as a warehouse, not a prison." |
 | "Why not go straight to local AI?" | "Local AI requires hardware procurement, infrastructure setup, and expertise development—typically 3-6 months. Azure OpenAI stops the data leakage in 2 weeks while we build the local capability in parallel." |
-| "How is this different from just using ChatGPT?" | "ChatGPT trains on your data. Azure OpenAI explicitly does not. ChatGPT has no audit trail. Azure OpenAI logs every prompt. ChatGPT offers no data residency guarantee. Azure OpenAI keeps your data in your region. The difference is governance, not capability." |
+| "How is this different from just using ChatGPT?" | "Consumer ChatGPT may use your data for model improvement; Azure OpenAI explicitly does not. Consumer ChatGPT has no audit trail; Azure OpenAI logs every prompt. Consumer ChatGPT offers no data residency guarantee; Azure OpenAI keeps your data in your chosen region. The difference is governance and compliance, not capability." |
 | "What if Microsoft changes the terms?" | "The data processing agreement is contractually binding. More importantly, the assets we build in Foundry are portable. If terms change unfavorably, we exercise the exit option we have been building toward all along." |
 | "Will this slow down our AI adoption?" | "It will accelerate safe adoption. Employees currently use unauthorized AI because there is no sanctioned alternative. Azure OpenAI gives them a better, safer tool. Adoption goes up; risk goes down." |
@@ -121,7 +121,7 @@ Many organizations have purchased or inherited an impressive security stack:
 **Deliverable**: Operating Rhythm Playbook
-**Tool stack for the operating rhythm**: See the [Sovereign Tool Stack](../playbooks/sovereign-tool-stack.md) for the complete open-source SOC architecture. For M365-centric environments, AOC provides audit log intelligence; Wazuh + Sysmon provide endpoint detection; TheHive + Cortex provide case management; Shuffle provides automated response. This stack replaces €200K+/year commercial SOC tooling for clients who prioritise sovereignty.
+**Tool stack for the operating rhythm**: See the [Sovereign Tool Stack](../playbooks/sovereign-tool-stack.md) for the complete open-source SOC architecture. For M365-centric environments, PULSAR provides audit log intelligence; Wazuh + Sysmon provide endpoint detection; TheHive + Cortex provide case management; Shuffle provides automated response. This stack replaces €200K+/year commercial SOC tooling for clients who prioritise sovereignty.
 - Weekly, bi-weekly, and monthly cadence definitions
 - RACI matrix for each activity
 - Dashboard definitions and data sources
@@ -245,6 +245,24 @@ Week 1 produces the baseline. It does not produce improvements. Clients sometime
 ---
 **11. Validating the AI Distraction**
 A client opens with: *"We want to implement AI-powered threat detection"* or *"Can AI help us manage our security posture?"* The mistake is engaging with the AI question directly — evaluating vendors, discussing models, building a roadmap — before establishing whether the foundation exists.
 AI security tools are multipliers. A multiplier applied to a broken foundation produces nothing except an expensive invoice and a false sense of coverage. The client who wants AI detection but has no MFA on admin accounts, no tested backups, and unpatched internet-facing systems does not need AI detection. They need MFA.
 **The redirect script**:
 > *"I want to get you to the AI layer — that's where the interesting work is. The fastest path there is closing the gaps that AI can't compensate for first. Otherwise, we're tuning the detection system while the front door is unlocked. Let's run the Brownhat Diagnostic, find your kill chain, close the existential gaps, and then we build the intelligence layer on top of something solid. You'll actually get value from the AI at that point."*
 **When to apply this**: Any time a client's opening request is for an intelligence or detection capability before you have confirmed that basic hygiene is in place. The discovery call question that surfaces it: *"What's your current MFA coverage across admin accounts?"* If the answer is anything other than "100%, enforced by policy," you have a layer-one gap. Fix that before any AI conversation.
 **The one exception**: A client with demonstrably strong fundamentals — IG1 complete, MFA enforced, logging in place, backups tested — who wants to build on that foundation. This is a legitimate AI conversation. But verify the foundation before accepting the premise that it exists.
 See [Move Fast and Fix Things — The AI Distraction](move-fast-and-fix-things.md#the-ai-distraction) for the full philosophical statement.
 ---
 ## Part 5: Technical Onboarding
 ### CQRE tool repositories
@@ -253,8 +271,8 @@ Before leading a module, you need to be able to deploy and use the tools that mo
 | Tool | Repository | Used in |
 |------|-----------|---------|
-| **ASTRAL** | `cqrenet/astral` (public) · `cqrenet/Intune` (internal, full version) | Modules 1, 2, 3 |
+| **ASTRAL** | [github.com/cqrenet/astral](https://github.com/cqrenet/astral) | Modules 1, 2, 3 |
-| **AOC** | `cqrenet/aoc` | Modules 2, 3, 12; retained capability |
+| **PULSAR** | [github.com/cqrenet/pulsar](https://github.com/cqrenet/pulsar) | Modules 2, 3, 12; retained capability |
 | **macOS_IntuneManagement** | `cqrenet/macOS_IntuneManagement` | Module 1; tenant migrations |
 | **Elysium** | `cqrenet/elysium` | Module 6, 10 |
 | **CAExporter** | `vibecoding/CAExporter` | Modules 2, 3 |
@@ -271,7 +289,7 @@ This is the minimum bar for leading (not shadowing) a module. If you are not the
 | Module | Minimum competency |
 |--------|-------------------|
 | **Module 1** (Endpoint) | PowerShell 7+; Intune policy structure; ASTRAL deployment and configuration; E8-CAT scoring |
-| **Module 2** (Identity) | Entra ID architecture; Conditional Access design; PIM/PAM concepts; AOC deployment; CAExporter export and analysis |
+| **Module 2** (Identity) | Entra ID architecture; Conditional Access design; PIM/PAM concepts; PULSAR deployment; CAExporter export and analysis |
 | **Module 3** (M365 Hardening) | Modules 1 and 2 competency; Prowler Azure audit; ASTRAL drift detection; ASR rules |
 | **Module 6** (AD Hardening) | Active Directory architecture; BloodHound collection and analysis; DSInternals and Elysium operation; LAPS deployment; GPO design; Sysmon configuration |
 | **Module 8** (OT Security) | OT/IT network segmentation concepts; NIS2 Article 21 and 23 requirements; SCADA/ICS risk framing; Zeek or Suricata basics |
@@ -283,7 +301,7 @@ This is the minimum bar for leading (not shadowing) a module. If you are not the
 Before your first client engagement, build a personal lab that lets you safely test deployments:
- **M365 developer tenant** — Microsoft's free developer programme provides an E5 tenant. Use it for ASTRAL, AOC, CAExporter, and M365 module testing. Register via the Microsoft 365 Developer Programme.
+- **M365 developer tenant** — Microsoft's free developer programme provides an E5 tenant. Use it for ASTRAL, PULSAR, CAExporter, and M365 module testing. Register via the Microsoft 365 Developer Programme.
 - **A small Linux VM (any cloud)** — For chatmail relay, Wazuh, TheHive, and Shuffle deployments. A €5–10/month VPS is sufficient for personal lab use.
 - **A Windows Server VM** — For AD module testing: BloodHound, Elysium, LAPS, Sysmon. Can be local (Hyper-V, VMware) or cloud.
 - **A CQRE internal environment** — Ask for access to the shared lab environment used for tool testing and client demos.
@@ -166,7 +166,7 @@ Some clients want ongoing support rather than discrete projects. Three models:
 | Type | Description | Typical cadence |
 |------|-------------|----------------|
 | **Retained advisory** | A fixed number of hours per month for questions, threat model reviews, architecture reviews, and strategic guidance. No new module delivery — advisory only. | Monthly retainer, 8–16 hours/month |
-| **Retained capability support** | Active support operating tools we deployed: reviewing ASTRAL alerts, tuning AOC detection rules, running quarterly AD scans with Elysium and PingCastle, reviewing Huntress findings. | Monthly or quarterly, scoped per tool set |
+| **Retained capability support** | Active support operating tools we deployed: reviewing ASTRAL alerts, tuning PULSAR detection rules, running quarterly AD scans with Elysium and PingCastle, reviewing Huntress findings. | Monthly or quarterly, scoped per tool set |
 | **Module continuation** | Ongoing delivery of a multi-module programme at a structured cadence. Each module planned and scoped before it begins. | Quarterly module delivery |
 Retained relationships are renewed quarterly. Either side can exit with 30 days' notice.
@@ -211,11 +211,11 @@ Each module documents its prerequisites in detail in [Modular Engagements](modul
 | Stage | Deliverable |
 |-------|-------------|
-| Brownhat Diagnostic | Current-state assessment report (6-domain, kill chain, quick wins) + prioritised module roadmap |
+| Brownhat Diagnostic | Current-state assessment report (6-domain, kill chain, quick wins) + prioritised module roadmap + **findings backlog opened** (all diagnostic findings entered with P0/P1/P2 priority and named owner) |
-| Module kickoff | Written scope agreement; access checklist confirmation; communication channel setup |
+| Module kickoff | Written scope agreement; access checklist confirmation; communication channel setup; backlog reviewed and updated with module-specific prerequisites |
-| Weekly (during delivery) | Change log update; check-in summary (decisions made, items pending, risks) |
+| Weekly (during delivery) | Change log update; check-in summary (decisions made, items pending, risks); backlog items resolved this week noted |
-| Module completion | Configuration baseline document; scripts/rules in client repository; operating runbooks; risk register update; metrics baseline; next-step recommendation |
+| Module completion | Configuration baseline document; scripts/rules in client repository; operating runbooks; **backlog updated** (items closed with evidence, new findings added); risk register update where one exists; metrics baseline; next-step recommendation |
-| Retained relationship | Monthly advisory summary or capability review report |
+| Retained relationship | Monthly advisory summary or capability review report; **backlog health review** (P0 count, items closed this cycle, blockers) |
 **Ownership**: Every script, detection rule, query, configuration file, and document produced during an engagement belongs to the client permanently. We do not retain privileged access to client environments after an engagement closes. We do not license anything we build — it is yours.
@@ -6,13 +6,13 @@
 ## The Problem in One Sentence
-Your organization is currently engaged in a **massive, unpaid research project for its competitors**—sending proprietary data, strategic reasoning, and operational intelligence to cloud platforms that are incentivized to commoditize your industry.
+Your organization depends on technology infrastructure it does not fully control — cloud platforms whose incentives are not aligned with your survival, AI tools processing your operational intelligence under agreements you cannot audit, and vendors whose pricing, terms, and continued existence are outside your influence.
 ## What Is at Stake
 | Asset Category | Current Risk | If Compromised or Extracted |
 |---------------|-------------|----------------------------|
-| Strategic intelligence | Rented from cloud AI providers | Competitors replicate your edge; your strategy becomes public model training data |
+| Strategic intelligence | Rented from cloud AI providers | Vendor dependency, data residency risk, no audit rights over inference — and a strategy that improves their platform, not yours |
 | Customer trust | Protected by compliance theater | Regulatory fines, class-action liability, irreversible reputational damage |
 | Operational continuity | Dependent on vendor stability | Single API change or geopolitical event halts revenue-critical workflows |
 | Technical talent | Wasted on maintenance of fragile systems | Burnout, attrition, inability to attract security-conscious engineers |
@@ -34,16 +34,24 @@ An antifragile organization does not merely survive shocks. It **grows stronger
 ## The Strategic Mandate: AI Sovereignty
-The current AI paradigm is **extractive**. Every prompt sent to a cloud AI teaches that system how to replace you. By running artificial intelligence on infrastructure you control, you:
+Cloud AI introduces three risks that most organisations have not priced. **Vendor dependency**: your critical workflows run on an endpoint you cannot audit, cannot predict, and cannot replace overnight. **Data residency and audit rights**: even where enterprise agreements prohibit training on your data, you typically cannot verify this, and regulators increasingly want proof — not assurances. **Operational continuity**: cloud AI services change pricing, restrict acceptable use, and degrade quality on the vendor's timeline, not yours.
- **Protect your intellectual property** from becoming public training data
+By running intelligence on infrastructure you control, you:
 - **Retain audit rights** over every inference decision — increasingly required by GDPR, NIS2, and DORA auditors
 - **Ensure operational continuity** regardless of vendor decisions, geopolitics, or API changes
 - **Eliminate data residency risk** — EU customers in particular face regulatory requirements that cloud AI processing often cannot satisfy
 - **Reduce long-term costs** from unpredictable per-token pricing to fixed infrastructure
 - **Demonstrate regulatory maturity** to auditors who increasingly scrutinize data residency and third-party risk
 > *"If our company's intelligence were a physical pile of cash, would we store it in a public bank that takes a 'training fee' off every dollar and reserves the right to change the currency? Or would we keep it in our own vault?"*
-Local AI is the vault.
+Local AI — or auditable AI with clear data residency — is the vault.
 ## The Regulatory Context
 For organisations operating in the EU, the compliance case is now as compelling as the security case. **NIS2** (in force October 2024) requires essential and important entities to demonstrate configuration management, logging, and incident detection. **DORA** (applying to financial entities from January 2025) mandates ICT change management records and audit log retention. **GDPR Article 32** requires appropriate technical measures that are increasingly interpreted as continuous, evidenced controls — not annual point-in-time reviews.
 Every engagement we deliver produces evidence that maps directly to these requirements. This is not coincidence — it is by design.
 ## The 180-Day Commitment
@@ -61,7 +69,7 @@ We do not propose a three-year transformation. We propose **four phases, 180 day
 This is not a cost centre. It is **optionality insurance**.
 - **Cost of the program**: Primarily configuration and process—existing tools are leveraged first.
- **Cost of inaction**: A single ransomware incident averages €4.5M in recovery. A single regulatory fine under DORA can reach 2% of global turnover. A single competitor trained on your data renders your proprietary advantage worthless.
+- **Cost of inaction**: A single ransomware incident averages €4.5M in recovery. A single regulatory fine under DORA can reach 2% of global turnover. A single uncontrolled AI vendor relationship can expose your operational data to residency and audit failures that NIS2, DORA, or sector regulators will not overlook.
 - **ROI timeline**: Risk reduction is visible in 30 days. Regulatory evidence is demonstrable in 90 days. Competitive advantage from sovereign intelligence compounds over 12-24 months.
 ## The Decision Required
@@ -73,7 +73,7 @@ We do not sell monolithic transformation projects. We sell **building blocks** t
 - Legacy authentication blocked tenant-wide
 - Privileged access workstation (PAW) architecture for admins
 - PIM deployment (if E5/Entra ID P2) or manual JIT process (if E3)
- AOC deployment for audit log intelligence and anomalous admin detection
+- PULSAR deployment for audit log intelligence and anomalous admin detection
 - Guest access audit and time-bounding
 - OAuth consent governance
@@ -168,7 +168,7 @@ We do not sell monolithic transformation projects. We sell **building blocks** t
 **Executive pitch**:
-> *"Your teams are already using AI—through personal accounts, browser tabs, and mobile apps. Every proprietary document they paste into ChatGPT trains a model that will eventually be sold to your competitors. We stop that leakage in two weeks by giving them a better, safer alternative. Then we build your first custom AI asset on data that never leaves your Azure region."*
+> *"Your teams are already using AI—through personal accounts, browser tabs, and mobile apps. Every proprietary document they send to an unmanaged AI service is processed under terms you haven't reviewed, on infrastructure outside your control, with no data residency guarantees. We stop that leakage in two weeks by giving them a better, safer alternative. Then we build your first custom AI asset on data that never leaves your Azure region."*
 **Natural next modules**: Module 9 (Organizational Resilience), Module 4 (Data Governance), Module 10 (Red Team & Validation)
@@ -31,7 +31,9 @@ The Brownhat Diagnostic — a structured [NIST CSF 2.0 baseline assessment](../a
 ### Speed Is a Security Control
-The organizations that survive are not the ones with the most comprehensive plans. They are the ones that **execute fastest** against the gaps that actually matter. A 90% solution deployed today outperforms a 100% solution that ships in six months—because the attacker does not wait for your roadmap.
+The organizations that survive are not the ones with the most comprehensive plans. They are the ones that **execute fastest** against the gaps that actually matter. A realistic engagement delivers 30–60% of an ideal posture in 180 days. That is the honest target. It is also, in almost every case, an enormous improvement over what existed before — and infinitely better than the 100% solution that stays in planning and never ships.
 The correct comparison is not "30% today vs. 100% in six months." It is "30% today vs. the 0% that will still be there in two years if you wait for the perfect plan." Momentum beats completeness. Imperfect progress beats perfect paralysis.
 ### Fixing Things Is Strategic
@@ -47,7 +49,7 @@ The antifragile consultant's first duty is not to recommend new spending. It is
 ---
-## The Three Rules
+## The Five Rules
 ### Rule 1: Start With What You Own
@@ -85,17 +87,81 @@ A fix that does not generate intelligence is a fix that will rot. Every remediat
 | "We rotated the password." | "We rotated the password and vaulted it in the PAM with checkout logging." |
 | "We fixed the firewall rule." | "We fixed the firewall rule and added a monthly rule review to the change process." |
 ### Rule 4: Run Housekeeping as a Permanent Stream
 This is the rule most often acknowledged and least often followed. In every engagement, cleanup is identified as necessary. In almost no engagement is it ever finished. Stale accounts accumulate. Orphaned permissions persist. Old devices stay enrolled. Legacy protocols remain enabled because removing them requires a change window that never gets scheduled.
 The correct response is not to add cleanup to the project backlog. It is to **establish housekeeping as a dedicated, permanently resourced stream with its own queue, its own cadence, and its own accountability**.
 Housekeeping is not janitorial work. It is attack surface reduction at a structural level. Every stale account is a credential that can be compromised without detection. Every orphaned permission is a privilege escalation path that BloodHound will find. Every legacy protocol still enabled is an authentication downgrade waiting to happen. The environment accumulates new objects continuously — every employee, every project, every vendor relationship adds accounts, permissions, and configurations. Almost nothing removes them automatically. Without a permanent housekeeping stream, the attack surface grows without bound regardless of what else you fix.
 **What housekeeping covers:**
 - Stale user accounts: departed employees, contractors, service accounts with no owner
 - Orphaned group memberships and permissions that outlasted the project that created them
 - Old app registrations and service principals — often the most overlooked and most dangerous
 - Enrolled devices that are no longer in use
 - Conditional Access policies with no named owner and no documented purpose
 - Legacy protocols: NTLM, basic authentication, SMBv1, NTLMv1 — things that should have been disabled years ago
 - DNS records for decommissioned services
 - Firewall rules added for temporary access that became permanent
 - Old GPOs, old admin rights, old certificates
 **The engagement implication**: Every module scoping conversation must include a housekeeping component. It is not optional and not deferrable. The client names a resource, a cadence (minimum monthly), and a queue. The queue is the [Findings Backlog](../assessment-templates/findings-backlog.md) — the single place where every finding from every diagnostic and module lands, prioritised, owned, and tracked to closure. The backlog is populated from module findings and from continuous discovery tools (ASTRAL drift, PULSAR alerts, quarterly BloodHound and Elysium runs). Progress is tracked and reviewed at every steering committee. If there is no resourcing for housekeeping, the engagement model must reflect that — because every fix we make will be partially undone within 18 months by new accumulation if the stream does not exist.
 ---
 ### Rule 5: Build Toward Greenfield Capability
 The cheapest and fastest recovery from a serious breach is often a greenfield deployment — rebuilding the environment from scratch on clean infrastructure rather than remediating a compromised one. Most organisations treat this as a nightmare scenario. The goal is to treat it as a **standard operational capability exercised every five years or so** — not something that wakes you up at night, but something you have done before and know how to do again.
 This is the ultimate defender's power move. An attacker's leverage in a breach depends largely on your inability to walk away from the compromised environment. If you can build the parallel company and burn the old one, that leverage disappears. Ransomware becomes an inconvenience rather than an existential event. The threat model changes fundamentally.
 **What greenfield capability requires:**
 - **Everything documented as code**: infrastructure configuration, security baselines, identity architecture, network topology. If you cannot rebuild it from documentation in a clean environment, you do not own it — you are renting it from accumulated history.
 - **Configuration under version control**: M365 policy state in ASTRAL, infrastructure definitions in IaC, runbooks in a repository. The new environment can be provisioned from the same source of truth.
 - **Clean data separation**: you know where your data is, what form it is in, and how to migrate it. Data that cannot be migrated cleanly is a dependency you have not acknowledged.
 - **Tested migration procedures**: the greenfield capability is not real until it has been exercised. Partial migrations, parallel-environment tests, and recovery drills build the muscle. Each module completion should leave the client one step closer to a documented, tested rebuild path.
 - **Vendor independence at critical layers**: you cannot rebuild greenfield if the new environment depends on the same compromised vendor. Optionality (Pillar 2) is the prerequisite.
 **The cadence target**: An organisation that can execute a planned greenfield migration in 90 days — with data integrity, minimal service disruption, and full security posture — is in a structurally different risk position than one for which greenfield is theoretical. This is not a one-time project. It is a capability you build, test, and maintain.
 **The controlled burn**: forests that are never burned accumulate the fuel for catastrophic fires. Organisations that are never greenfield-deployed accumulate technical debt, legacy dependencies, and accumulated compromise that makes eventual failure more severe. Planned greenfield on a 5-year cycle is the controlled burn that prevents the uncontrolled one.
 ### The Critical Infrastructure Adaptation
 For organisations operating OT/NT environments — power generation, transmission, water utilities, telecoms network infrastructure — a full greenfield rebuild is often genuinely not possible. Protection relays run for 30 years. PLCs controlling turbines cannot be taken offline for a rebuild exercise. Safety systems require regulatory approval for any change. The controlled burn, taken literally, cannot be applied.
 The goal remains the same. The method changes.
 **The purpose of greenfield capability is to eliminate inherited compromise and return to a known-good operational state.** In OT environments, this is achieved through a different set of moves — but the test is identical: *"If our control systems were completely compromised and had to be restored, could we maintain critical service delivery and return to full automated operation from a verified baseline?"*
 **IT layer greenfield protects the OT layer.** The corporate IT environment, SCADA servers, historian, HMI workstations, and M365 tenant can almost always be made greenfield-capable even when the OT hardware cannot. When the IT layer can be rebuilt clean, an adversary who compromised it loses their persistence and pivot path without a single OT system being touched. IT greenfield is the outer defence of an OT environment that cannot be rebuilt itself.
 **Configuration as code for OT.** PLC logic, IED settings, protection relay configurations, SCADA databases, and DCS configurations belong in version control. The ability to restore a verified configuration to existing hardware is the OT equivalent of greenfield: the hardware stays, but the software state is erased and rebuilt from a known-good baseline. Configuration backup and integrity checking for OT systems is not optional — it is the closest available substitute for the rebuild capability that IT environments take for granted. ASTRAL for M365 is the pattern; the same discipline applied to OT configuration archives is the OT equivalent.
 **Manual operation capability is a form of "drop the compromised layer."** A power utility that can maintain 80% of service from manual procedures during a SCADA compromise has a fundamentally different risk profile than one that cannot. The ability to operate without the automation layer is, in effect, the ability to sacrifice the compromised layer and continue. Manual override procedures, validated quarterly, are the OT sector's equivalent of a tested greenfield playbook. If operators have not practised running manually in the past 12 months, the capability does not exist.
 **Compartmentalisation over total rebuild.** OT environments are often sectionable. Grid islanding, corridor isolation, plant-level segmentation, and control centre failover allow the operator to sacrifice a section while maintaining critical service elsewhere. The burn is localised rather than total — but the principle is the same: designed-in ability to contain, recover, and restore in sequence rather than all at once.
 **Long-cycle planned refresh.** OT systems have 20–40 year lifetimes, but those lifetimes should be planned, not accidental. A utility with a documented 20-year OT refresh programme — component-by-component replacement milestones, firmware escrow, spare parts inventory — is doing the OT equivalent of periodic greenfield: the environment is continuously re-established in controlled segments. Organisations that do not have this programme are not avoiding greenfield; they are deferring it until a crisis forces it under the worst possible conditions.
 **What the test looks like for OT**: *"If our SCADA and IT layers were fully compromised tonight, could we maintain critical service from manual procedures within 4 hours, rebuild the IT layer from clean baselines within 48 hours, and restore full automated operation from verified OT configuration backups within two weeks?"* If any of those answers is no, the gap is in manual procedures, IT rebuild capability, or OT configuration management — not in greenfield per se, but in the prerequisites that make any form of recovery possible.
 For the full OT/critical infrastructure treatment, see [Vertical: Power and Utilities](../reference/vertical-power-utilities.md).
 ---
 ## Mapping to Antifragile Pillars
 | Antifragile Pillar | Move Fast and Fix Things Expression |
 |-------------------|-------------------------------------|
-| **Structural Decoupling** | Identify and eliminate hidden dependencies before they become fatal. Do not add new platforms to solve problems that abstraction can solve. |
+| **Structural Decoupling** | Identify and eliminate hidden dependencies before they become fatal. Greenfield capability is the ultimate expression: if you can rebuild cleanly, no single vendor or compromise holds you hostage. |
-| **Optionality Preservation** | Maximize existing investments to preserve budget for strategic optionality. Every unnecessary purchase reduces your ability to pivot. |
+| **Optionality Preservation** | Maximize existing investments to preserve budget for strategic optionality. Greenfield deployment requires vendor independence at every critical layer — build and maintain that independence now. |
 | **Stress-to-Signal Conversion** | Every fix must generate telemetry. Incidents are not failures; they are unpaid penetration tests. Convert their lessons into structure. |
-| **Sovereign Intelligence** | Use what you own first. Local AI on existing hardware beats cloud AI on a credit card. Your data should improve your models, not someone else's. |
+| **Sovereign Intelligence** | Use what you own first. Your data, your configurations, your runbooks — all under version control, all portable, all yours. Housekeeping keeps it clean. Greenfield capability proves it. |
-| **Asymmetric Payoff Design** | Small, fast fixes on the kill chain yield disproportionate risk reduction. Do not distribute effort evenly; concentrate it where failure is existential. |
+| **Asymmetric Payoff Design** | Small, fast fixes on the kill chain yield disproportionate risk reduction. Housekeeping and greenfield capability are the highest-leverage long-term investments: small ongoing cost, enormous reduction in catastrophic risk. |
 ---
@@ -151,6 +217,102 @@ When you walk into a client environment, bring these assumptions:
 ---
 ## The AI Distraction
 There is a recurring pattern in security consulting: a client opens with "we want AI-powered threat detection" or "can AI help us with our security posture?" and the instinct — especially from vendors — is to say yes and start selling.
 The correct response is to ask: *"Do your domain admins have MFA enforced?"*
 We call this pattern the **AI Mythos**: the belief that intelligence-layer tooling is the primary answer to security problems. It is not. AI is a multiplier. A multiplier applied to an absent foundation produces nothing. An AI-powered SOC that generates alerts from a network with no MFA, no patching cadence, and no tested backups is generating expensive noise about a patient who already has a terminal condition.
 ### The Multiplier Principle
 Security capabilities stack in layers. Each layer requires the layer below it to function.
 ```
 Foundation   → Identity hygiene, endpoint coverage, patching, tested backups, basic logging
 Signal       → Logging turned on, SIEM ingesting the right sources, alerts with owners
 Intelligence → Detection engineering, threat hunting, AI-assisted analysis
 ```
 AI lives at layer three. Organisations that have not completed layer one do not benefit from layer three — they buy something that has nothing to amplify.
 **The test**: Ask "what would have stopped this breach?" For the overwhelming majority of incidents — credential theft, ransomware, insider threat, misconfiguration exploitation — the answer is a layer-one control: MFA, patched systems, least-privilege accounts, a working backup. Not AI detection. Not an AI SOC. Not AI-powered SIEM correlation.
 The CIS Controls make this explicit. IG1 — 56 safeguards covering basic inventory, secure configuration, data protection, account management, patching, and backup — is the minimum viable security posture. Every organisation should complete IG1 before spending money on anything above it. AI-powered security tools are not IG1 controls. They are IG3 multipliers applied to an IG1 foundation.
 ### What to Do When a Client Leads with AI
 The client who opens with AI is not wrong to want it. They are wrong about sequencing. Your job is to redirect without dismissing.
 **The redirect**:
 > *"AI security tools are most valuable when you have a strong signal to amplify. The fastest path to benefiting from AI is making sure the basics are right first — because AI on a broken foundation is just expensive noise. Let's start with the Brownhat Diagnostic, find your kill chain, and close the gaps that AI can't compensate for. Then you'll actually get value from the AI layer on top."*
 This reframes AI as a reward for good hygiene, not a substitute for it. It respects the client's interest in AI while directing the budget where it produces real risk reduction.
 ### The Sequencing Rule
 The antifragile pillars are not equally weighted at the start of an engagement. They are sequenced:
 1. **Structural Decoupling** (Pillar 1) and **Optionality Preservation** (Pillar 2) are foundations — you establish these first by mapping and removing dangerous dependencies.
 2. **Stress-to-Signal Conversion** (Pillar 3) requires having something to instrument — logging, monitoring, telemetry. This is layer two.
 3. **Sovereign Intelligence** (Pillar 4) — AI sovereignty, local models, owned cognitive infrastructure — presupposes that you have a foundation worth protecting and a signal worth amplifying. It is not the starting point.
 4. **Asymmetric Payoff Design** (Pillar 5) is the lens applied throughout — concentrate effort where failure is existential.
 A client excited about Pillar 4 who has not addressed Pillar 1 is building a sophisticated roof on a house with no walls.
 ### What "Move Fast" Means Here
 Moving fast does not mean buying AI tools quickly. It means closing the kill chain quickly — with unglamorous, proven controls that stop breaches:
 - Enforce MFA on every account. Today.
 - Patch internet-facing systems. This week.
 - Verify that backups restore. This month.
 - Remove stale privileged accounts. In week one.
 - Turn on logging where it is off. Before anything else.
 These are not interesting. They are not cutting-edge. They are the interventions that would have prevented most of the incidents in the headlines. The AI tools that make headlines did not prevent those incidents.
 ---
 ## When the Vulnerability Surface Is Effectively Infinite
 Recent AI-assisted security research — including large-scale automated vulnerability discovery across entire software stacks — has surfaced a reality that was always true but is now undeniable: **the number of exploitable vulnerabilities in any complex environment exceeds any organisation's capacity to patch them.** This is not a new problem. It is a shift in visibility. The vulnerabilities existed before. We can now find them faster than we can fix them.
 The vendor response to this is predictable: "You need AI-assisted patching." Faster discovery paired with faster remediation, AI all the way down.
 This is the wrong frame. It accepts a race you cannot win.
 ### The Architectural Response
 The correct response to an effectively infinite vulnerability surface is not to patch faster. It is to **move to a realm where most vulnerabilities matter less** — by designing systems architecturally so that the exploitation of any single vulnerability does not lead to existential compromise.
 This is not a new idea. It is the fundamental premise of defence in depth, blast radius limitation, and kill chain thinking. What has changed is the urgency: when AI can identify thousands of vulnerabilities across your stack in hours, the "patch-first" strategy is exposed as insufficient. The architectural strategy becomes the only viable long-term position.
 The moves:
 **Kill chain awareness** — Not every CVE is existential. The ones that matter are the ones that sit on the path from "nothing bad has happened yet" to "the organisation cannot operate." Concentrate protection there. A critical vulnerability in a segmented, non-privileged system is a low-priority finding. The same vulnerability on a domain controller, a backup server, or an OT control system is P0. The vulnerability is the same; the kill chain position is what changes the priority.
 **Blast radius limitation** — Segmentation, least privilege, and structural decoupling mean that exploiting a vulnerability in one component cannot pivot freely through the environment. A flat network with over-privileged accounts converts every vulnerability into a potential total compromise. A segmented, least-privilege environment converts most vulnerabilities into limited-scope incidents.
 **Assume breach posture** — Design for rapid detection and recovery rather than prevention of every entry. If architectural controls are in place, a compromised component is an isolated incident, not a catastrophe. The question shifts from "how do we keep attackers out?" to "how quickly do we detect, contain, and recover?" This is Pillar 3 (Stress-to-Signal Conversion) applied to the vulnerability layer.
 **Known-good baseline** — Configuration management (ASTRAL) and system state tracking mean that after a compromise, you can restore to a verified baseline. The ability to rebuild rapidly from a known-good state reduces the cost of successful exploitation dramatically.
 ### What This Means for Prioritisation
 When clients ask how to respond to the AI vulnerability discovery story, the answer is not a new patching tool. It is a sequenced architectural programme:
 1. Map and close the kill chain — the vulnerabilities that sit on the path to existential compromise get patched first, regardless of CVSS score.
 2. Reduce blast radius — segmentation and least privilege limit the value of any single exploit.
 3. Build detection and recovery capability — assume some vulnerabilities will be exploited; make exploitation detectable and recoverable.
 4. Then consider tooling to accelerate patch velocity for the long tail.
 The correct posture is: **a well-segmented, least-privilege, T0-protected environment with fast recovery capability survives more CVEs than a flat, over-privileged environment with a fast patch programme.** Architecture beats velocity in the vulnerability race. It is the only bet you can actually win.
 ---
 ## Contrast With "Move Fast and Break Things"
 The Silicon Valley mantra was an excuse for externalizing harm. "Move fast and fix things" is its responsible successor:
@@ -0,0 +1,133 @@
 # Quantum Vulnerability Management
 > *"You do not have 40,000 critical vulnerabilities. You have ~400 that are real, ~40 that are on fire, and a process that cannot tell them apart. Quantum vulnerability management is the discipline of sizing remediation to the time you actually have — and of admitting that the unit of work was never the vulnerability. It was the path."*
 This is the operating framework behind [Book VII — Vulnerability Management](../books/06-vulnerability-management.md). Book VII is the philosophy; this is the model a consultant runs in an engagement. It pairs with the [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md) (which sizes the quanta) and the [AI-Assisted TVM Blueprint](../playbooks/ai-assisted-tvm.md) (which automates the hours-lane).
 ---
 ## The problem in one paragraph
 Time-to-exploit has collapsed to roughly **4 hours** while median remediation sits at **43 days**; CVE volume has gone past **59,000/year** and the public enrichment data (NVD) is degrading; and as of the **2026 Verizon DBIR, vulnerability exploitation is the #1 initial-access vector, roughly twice phishing.** A human-paced, CVSS-sorted patch programme cannot close a gap that runs the wrong way by two orders of magnitude. The answer is not "patch faster." It is to **stop using the vulnerability list as the unit of work**, size remediation into time-budgeted quanta, contain the few that matter in hours, make the rest not matter through architecture, and feed every exploited path back into a shorter kill chain.
 ---
 ## What a quantum is
 A **quantum** is the smallest unit of remediation that:
 1. **Fully closes a specific exploitable path** — not a CVE in the abstract, a path an adversary could actually walk.
 2. **Is sized to a time budget it can actually be completed within** — hours, days, or a sprint.
 3. **Ends in a verifiable signal** — a test that proves the path is closed, not a ticket marked done.
 The word is chosen deliberately:
 - **Atomic.** You cannot ship half a quantum and claim half the protection. A patch on 80% of the fleet, or a rule applied but never verified to block, is a *ghost patch* — fully exploitable and now invisible. A quantum is all-or-nothing.
 - **Discrete.** Work is packetised into units that fit the time available, not smeared across an infinite backlog. An undifferentiated backlog has no front; quanta give it one.
 ---
 ## The sort key: time-to-existential-impact
 Quanta are ordered not by severity but by **time-to-existential-impact**, a function of three things the *environment* determines — not the CVE:
 > **time-to-existential-impact = f( kill-chain position, reachability, exploit availability )**
 | Factor | Question | Where it comes from |
 |--------|----------|---------------------|
 | **Kill-chain position** | Does this sit on a path to existential compromise? | [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md), BloodHound, the diagnostic |
 | **Reachability** | Can the adversary actually get to it (internet-facing, one hop from T0, behind segmentation)? | Network topology, external scan, [Perimeter Scanning](../playbooks/perimeter-scanning-capability.md) |
 | **Exploit availability** | Is there a working exploit in the wild now? | CISA KEV, exploit databases, threat intel |
 The same CVE has a different quantum on different assets, because position, not severity, sets the clock. **A 9.8 on a segmented, unreachable, non-privileged host is a sprint quantum. A 7.5 on an internet-facing box one hop from a domain controller is an hours quantum.** This is the Book I principle — kill-chain position changes the priority, not the score — made operational.
 ---
 ## The four quanta
 | Quantum | Time budget | What's in it | The response | Lane character |
 |---------|-------------|--------------|--------------|----------------|
 | **Critical** | **Hours** | On the kill chain, reachable, exploit available now | **Compensating control, not the patch** — sever reachability, edge-block, isolate, disable feature. Patch follows later. | Must be partly **autonomous**; human at policy boundary |
 | **Severe** | **Days** | Material risk; reachable with friction, or partial compensating cover | Batched, completed and verified inside one short change window | Human-run, tightly scheduled |
 | **Standard** | **Sprint** | The long, real, non-urgent tail | Drained in sprint-sized batches that can actually be finished; this is where patch velocity is the right tool | Routine engineering rhythm |
 | **Dark** | **Unsized** | Can't see the asset, can't establish reachability, can't determine exploitability | **Route to discovery** — turn an uncharacterised risk into a sized quantum | Discovery, not remediation |
 ### Why "compensating control, not the patch" for the critical quantum
 You cannot meet an hours budget with a vendor patch cycle, and often the patch does not exist yet. So the critical quantum's job is **not to fix the vulnerability — it is to move the asset out of the hours-window** by the cheapest fast control available: cut the reachability, block at the edge, isolate the host, disable the vulnerable feature, pull it behind the WAF. A 4-hour time-to-impact becomes a non-urgent one, and the actual patch drops into the standard lane on the normal change calendar. Reachability is almost always faster to change than a patch is to ship — which makes **reachability the fastest remediation you own.**
 ### Why the dark quantum is the most dangerous
 The old model ignores the dark quantum because it has no score. That is exactly backwards: an uncharacterised risk on an unknown asset is how estates die. A *known* severe is safer than an *unknown* nothing, because you can plan around the known one. The antifragile move is to spend judgement converting dark quanta into sized ones — which is why discovery (the [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md), [zero-budget discovery](../playbooks/zero-budget-vulnerability-discovery.md), osquery) is part of vulnerability management, not separate from it.
 ---
 ## The barbell: contain fast or architect away — never the fragile middle
 ```
  CHEAP / FAST / REVERSIBLE                                 SLOW / STRUCTURAL / DURABLE
  Hours-lane compensating controls                          Segmentation, least privilege,
  (edge block, isolate, cut reachability)                   T0 protection, assume-breach
  ── wins the time race the patch can't ──                  ── makes ~90% of vulns not matter ──
            ◄──────────────  THE FRAGILE MIDDLE TO AVOID  ──────────────►
            The aging "critical patch backlog": carries hours-lane urgency,
            moves at sprint-lane speed. Max anxiety, min protection,
            and the attacker clears it for you one exploited host at a time.
 ```
 Both ends of the barbell are convex (small cost, large payoff — Pillar 5). The fragile middle is concave (maximum cost, minimum return). The rule: **contain it fast, or architect it away. Never let it age in the middle.**
 ---
 ## The ~90% subtraction — via negativa applied to the list
 The single highest-leverage move, and it is pure subtraction. Industry data suggests **roughly 90% of "critical" vulnerabilities are not exploitable in a given environment** once compensating controls, reachability, and segmentation are mapped. So before adding any work:
 1. Map, per asset: internet reachability, EDR coverage, WAF rules, segmentation distance from T0.
 2. Delete the false urgency on everything segmented, unreachable, or already neutralised.
 3. What remains — the genuinely reachable, genuinely exploitable ~10% — is the only thing the hours- and days-lanes ever touch.
 This turns "40,000 criticals" into a few hundred real findings and a few dozen on fire. The compensating-control map that makes it possible is **the single most valuable artefact in the programme** — build it before the incident, because during a zero-day it answers "are we actually exposed?" in minutes instead of days. The caveat (Book I): a mapped control that has rotted into a ghost is a false negative. **Test the controls you are counting on; do not trust the map.**
 ---
 ## The feedback loop — the antifragile difference
 A vulnerability that was exploited or nearly exploited is the cheapest penetration test you will ever get. Patching the CVE wastes the data. The antifragile move is to **sever the path** the attacker used — boundary the flat segment, collapse the over-privileged service account, pull the reachable management interface behind the bastion — so the *next* vulnerability that lands there is a non-event before it is even disclosed.
 **The metric is not MTTR. It is: did the kill chain get shorter?** Ten incidents that produce ten patches and zero severed paths mean you are merely fast. Ten incidents that produce six structurally shortened kill chains mean the estate is getting harder to compromise every time it is tested — the only honest definition of antifragile.
 ---
 ## Running it in an engagement — the sequence
 1. **Discover** — run the [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md) to map assets, reachability, and the shortest existential path. Anything you cannot characterise is a dark quantum; route it to deeper discovery.
 2. **Subtract** — apply the ~90% reduction using the compensating-control and reachability map. Delete false urgency.
 3. **Size** — place every remaining real finding into a quantum (critical / severe / standard) by time-to-existential-impact.
 4. **Contain the hours-lane** — apply compensating controls to the critical quantum *today*, autonomously where guardrails allow ([AI-Assisted TVM](../playbooks/ai-assisted-tvm.md)). Verify each closes with a signal.
 5. **Batch the rest** — days-lane in the next change window, sprint-lane in the engineering rhythm.
 6. **Architect away the middle** — feed the recurring paths into segmentation and least-privilege work (Books II–V) so the same class of vulnerability stops mattering.
 7. **Close the loop** — after every exploited-or-near finding, ask what path got shorter, and track that number over time.
 ---
 ## What to measure
 | Metric | Why it matters | Antifragile target |
 |--------|----------------|--------------------|
 | Critical-quantum containment time | The hours-lane is the race you must not lose | Hours, trending down |
 | % of "criticals" confirmed reachable | Proves the ~90% subtraction is real, not assumed | Known, not "unknown" |
 | Ghost-patch rate (closed-but-unverified) | Half-done remediation is hidden full exposure | Zero — every quantum closes with a signal |
 | Dark-quantum count | Uncharacterised risk is the dangerous kind | Shrinking; each one converted to sized |
 | **Kill-chain length after incidents** | The only measure of getting *stronger* | Shorter after each exploited-or-near event |
 | Items aging in the fragile middle | The concave zone the barbell forbids | Zero — contained or architected, never aging |
 ---
 ## Honest uncertainty
 The headline statistics (the 4-hour, 43-day, ~59,000-CVE, ~90%-not-exploitable, and "#1, ~2× phishing" figures) are point-in-time and churn annually — re-check them against the current DBIR, M-Trends, and FIRST/CVE data before putting them on a slide. The *direction* is the stable signal; the numbers move. The autonomous-execution tooling for the hours-lane is real but immature and fast-moving — verify current capability and failure modes, and start with reversible compensating controls, never irreversible change. What does not churn: kill-chain position beats CVSS, most criticals aren't reachable, a half-done remediation is a hidden full vulnerability, and every exploited path should shorten the chain.
 ---
 *See [Book VII — Vulnerability Management](../books/06-vulnerability-management.md) for the full philosophy, [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md) for sizing the quanta in unknown territory, and [AI-Assisted TVM Blueprint](../playbooks/ai-assisted-tvm.md) for automating the hours-lane.*
@@ -32,7 +32,7 @@ When you outsource a security function, you should retain three capabilities int
 | Retained Capability | Why It Cannot Be Outsourced | What It Produces |
 |--------------------|---------------------------|------------------|
-| **Detection Engineering** | Only you know what "normal" looks like in your environment. Only you can write rules that detect anomalies specific to your architecture, your applications, and your user behaviours. | Custom detection rules (KQL, Sigma, YARA, Wazuh) and M365-specific detections via AOC that catch threats generic rules miss |
+| **Detection Engineering** | Only you know what "normal" looks like in your environment. Only you can write rules that detect anomalies specific to your architecture, your applications, and your user behaviours. | Custom detection rules (KQL, Sigma, YARA, Wazuh) and M365-specific detections via PULSAR that catch threats generic rules miss |
 | **Threat Context & Prioritization** | Only you know which assets are crown jewels. Only you can prioritize a vulnerability on your payment gateway over a vulnerability on your marketing blog. | Risk-ranked remediation that aligns with business impact |
 | **Integration & Orchestration** | Only you can connect the SOC to your change management, your identity team, your OT engineers, and your executives. | Closed-loop incident response that produces structural improvement |
@@ -42,6 +42,7 @@ Operational and persuasion documents used in engagements. **Start every new clie
 | [Antifragile Manifest](core/antifragile-manifest.md) | Five pillars of antifragile enterprise | Executives, Architects, Consultants |
 | [AI Sovereignty Framework](core/ai-sovereignty-framework.md) | Strategic arguments and implementation for local AI | CISOs, CTOs, Security Architects |
 | [T0 Asset Framework](core/t0-asset-framework.md) | Tier 0 classification and protection for critical assets | Security Architects, Infrastructure Leads |
 | [Quantum Vulnerability Management](core/quantum-vulnerability-management.md) | Sizing remediation into time-budgeted quanta (hours/days/sprint/dark) for the exploitation-first era; companion to Book VII | CISOs, Vulnerability Management, Consultants |
 | [Spontaneous Order Principles](core/spontaneous-order-principles.md) | Philosophical foundation for the five pillars | Executives, Architects, Strategists |
 ## Playbooks
@@ -51,6 +52,7 @@ Operational and persuasion documents used in engagements. **Start every new clie
 | [Rapid Modernisation Plan](playbooks/rapid-modernisation-plan.md) | 30-60-90-180 day transformation roadmap | Program Managers, Consultants, CISOs |
 | [Endpoint Management Entry Vector](playbooks/endpoint-management-entry-vector.md) | Intune/device management as the ideal engagement entry point | M365 Consultants, Account Managers |
 | [AI-Assisted TVM Blueprint](playbooks/ai-assisted-tvm.md) | AI-powered vulnerability management for AI-powered adversaries | CTOs, CISOs, Vulnerability Management |
 | [Kill Chain Assessment App](playbooks/kill-chain-assessment-app.md) | Spec for the offline tool that maps unknown estates into an attack graph, computes the shortest existential path, and sizes quanta. Tool: [`tools/kill-chain-assessment.html`](tools/kill-chain-assessment.html) | Consultants, Assessors, Security Architects |
 | [Zero-Budget Vulnerability Discovery](playbooks/zero-budget-vulnerability-discovery.md) | Script-based and osquery-based server/container vuln discovery without Tenable/Qualys | Security Engineers, Consultants |
 | [Perimeter Scanning Capability](playbooks/perimeter-scanning-capability.md) | External attack surface strategy: build, partner, or hybrid | Security Architects, Consultants |
 | [Osquery: The Sovereign Discovery Platform](playbooks/osquery-custom-platform.md) | Build a custom vulnerability and asset inventory platform on osquery | Security Engineers, Consultants, CTOs |
@@ -59,7 +61,9 @@ Operational and persuasion documents used in engagements. **Start every new clie
 | [AD and Endpoint Hardening](playbooks/ad-endpoint-hardening.md) | On-prem AD, Windows endpoints, hybrid identity | Infrastructure Consultants, Security Engineers |
 | [Zero-Budget Hardening](playbooks/zero-budget-hardening.md) | Maximize existing tools, minimize new purchases | Consultants, CISOs, IT Managers |
 | [Implementation Playbook](playbooks/implementation-playbook.md) | Tactical step-by-step delivery guide | Technical Leads, Security Engineers |
-| [Sovereign Tool Stack](playbooks/sovereign-tool-stack.md) | Open-source arsenal: Prowler, BloodHound, CISO Assistant, ASTRAL, AOC, Wazuh, Shuffle | Consultants, CTOs, CISOs |
+| [Sample Engagement: Mid-Market Hybrid](playbooks/sample-engagement-mid-market.md) | Complete worked example: 500 employees, AD+M365 E3, NIS2 scope — findings, kill chain, module sequence, Day 30/90/180 deliverables, populated backlog | Consultants, New Hires |
 | [CQRE Product Suite](playbooks/cqre-product-suite.md) | ASTRAL, PULSAR, and AURORA: product details, framework alignment, deployment, and positioning | Consultants, Account Managers |
 | [Sovereign Tool Stack](playbooks/sovereign-tool-stack.md) | Full arsenal: Prowler, BloodHound, CISO Assistant, ASTRAL, PULSAR, AURORA, Wazuh, Shuffle | Consultants, CTOs, CISOs |
 | [Privileged Access Architecture](playbooks/privileged-access-architecture.md) | PAM design: Teleport, Tailscale/Headscale, JIT access, vendor access governance | Security Architects, Infrastructure Consultants, OT Leads |
 | [Sovereign Communications](playbooks/sovereign-communications.md) | Delta Chat chatmail relay, Matrix/Element, crisis out-of-band channels | CISOs, Operations Leads, Incident Response |
 | [Business Case Template](playbooks/business-case-template.md) | Financial justification, ROI, risk quantification | CFOs, Boards, Consultants |
@@ -83,6 +87,8 @@ Operational and persuasion documents used in engagements. **Start every new clie
 | Document | Purpose | Audience |
 |----------|---------|----------|
 | [Assessment Team Guide](assessment-templates/assessment-team-guide.md) | Technical execution guide for the Brownhat Diagnostic: tool sequence, what to run, what to look for, kill chain synthesis, report structure | Assessors, Technical Consultants |
 | [Findings Backlog](assessment-templates/findings-backlog.md) | Single source of truth for all findings across every engagement; input queue for the housekeeping stream; pragmatic alternative to a formal risk register | Consultants, IT Leads, Client Teams |
 | [NIST CSF 2.0 Baseline Assessment](assessment-templates/nist-csf-baseline.md) | The Brownhat Diagnostic: structured 2-half-day workshop, gap analysis, prioritised module roadmap | Consultants, CISOs, IT Managers |
 | [NIST CSF 2.0 — česká verze](assessment-templates/nist-csf-baseline-cs.md) | Brownhat Diagnostika: dotazníky a průvodce workshopem v češtině | Consultants running Czech-language workshops |
 | [Module Completion Report](assessment-templates/module-completion-report.md) | Template for the deliverable package at the end of every module | Consultants |
@@ -125,25 +131,26 @@ Operational and persuasion documents used in engagements. **Start every new clie
 8. [NIST CSF 2.0 Baseline Assessment](assessment-templates/nist-csf-baseline.md) — run this first with every new client (the Brownhat Diagnostic)
 9. [Modular Engagements](core/modular-engagements.md) — the full module menu (Modules 1–14) and platform adaptation guide
-10. [Sovereign Tool Stack](playbooks/sovereign-tool-stack.md) — the full arsenal: CQRE tools, open-source stack, commercial partnerships, and when to use each
+10. [CQRE Product Suite](playbooks/cqre-product-suite.md) — ASTRAL, PULSAR, and AURORA: what they do, how they fit the framework, and how to deploy them
-11. [M365 E3 Hardening](playbooks/m365-e3-hardening.md) — primary client environment for MS clients (most are E3)
+11. [Sovereign Tool Stack](playbooks/sovereign-tool-stack.md) — the full arsenal: CQRE tools, open-source stack, commercial partnerships, and when to use each
-12. [AD and Endpoint Hardening](playbooks/ad-endpoint-hardening.md) — on-premises identity and endpoint depth
+12. [M365 E3 Hardening](playbooks/m365-e3-hardening.md) — primary client environment for MS clients (most are E3)
-13. [Privileged Access Architecture](playbooks/privileged-access-architecture.md) — Module 13: Teleport, Tailscale/Headscale, JIT access, vendor remote access governance
+13. [AD and Endpoint Hardening](playbooks/ad-endpoint-hardening.md) — on-premises identity and endpoint depth
-14. [Sovereign Communications](playbooks/sovereign-communications.md) — Module 14: Delta Chat chatmail relay, Matrix/Element, crisis out-of-band channels
+14. [Privileged Access Architecture](playbooks/privileged-access-architecture.md) — Module 13: Teleport, Tailscale/Headscale, JIT access, vendor remote access governance
 15. [Sovereign Communications](playbooks/sovereign-communications.md) — Module 14: Delta Chat chatmail relay, Matrix/Element, crisis out-of-band channels
 **Reference when needed:**
-15. [AI Sovereignty Framework](core/ai-sovereignty-framework.md) — persuasive arguments and objection handling
+16. [AI Sovereignty Framework](core/ai-sovereignty-framework.md) — persuasive arguments and objection handling
-16. [AI Operations Inevitability](core/ai-operations-inevitability.md) — why defensive AI is not optional
+17. [AI Operations Inevitability](core/ai-operations-inevitability.md) — why defensive AI is not optional
-17. [Organizational Resilience](core/organizational-resilience.md) — shift left and Dev/Sec/Ops merger talking points
+18. [Organizational Resilience](core/organizational-resilience.md) — shift left and Dev/Sec/Ops merger talking points
-18. [Retained Capability](core/retained-capability.md) — what to keep in-house when outsourcing SOC, pentest, compliance
+19. [Retained Capability](core/retained-capability.md) — what to keep in-house when outsourcing SOC, pentest, compliance
-19. [Zero-Budget Hardening](playbooks/zero-budget-hardening.md) — extract value from existing tools in 30 days
+20. [Zero-Budget Hardening](playbooks/zero-budget-hardening.md) — extract value from existing tools in 30 days
-20. [Zero-Budget Vulnerability Discovery](playbooks/zero-budget-vulnerability-discovery.md) — script-based and osquery-based discovery before scanner procurement
+21. [Zero-Budget Vulnerability Discovery](playbooks/zero-budget-vulnerability-discovery.md) — script-based and osquery-based discovery before scanner procurement
-21. [Osquery: The Sovereign Discovery Platform](playbooks/osquery-custom-platform.md) — build owned vulnerability and asset inventory capability
+22. [Osquery: The Sovereign Discovery Platform](playbooks/osquery-custom-platform.md) — build owned vulnerability and asset inventory capability
-22. [Rapid Modernisation Plan](playbooks/rapid-modernisation-plan.md) — structured engagement roadmap
+23. [Rapid Modernisation Plan](playbooks/rapid-modernisation-plan.md) — structured engagement roadmap
-23. [Implementation Playbook](playbooks/implementation-playbook.md) — tactical delivery guidance
+24. [Implementation Playbook](playbooks/implementation-playbook.md) — tactical delivery guidance
-24. [Vertical: Power and Utilities](reference/vertical-power-utilities.md), [Vertical: Telco](reference/vertical-telco.md), or [Vertical: Banking](reference/vertical-banking.md) — sector-specific adaptations
+25. [Vertical: Power and Utilities](reference/vertical-power-utilities.md), [Vertical: Telco](reference/vertical-telco.md), or [Vertical: Banking](reference/vertical-banking.md) — sector-specific adaptations
-25. [CIS Controls Mapping](reference/cis-controls-mapping.md) and [NIST CSF Mapping](reference/nist-csf-mapping.md) — standards alignment for auditors and regulators
+26. [CIS Controls Mapping](reference/cis-controls-mapping.md) and [NIST CSF Mapping](reference/nist-csf-mapping.md) — standards alignment for auditors and regulators
 ---
@@ -70,7 +70,7 @@ AI-assisted TVM does not replace basic hygiene. It **accelerates it by an order
 | **Cloud security posture** (Defender for Cloud, Prisma, Wiz) | Cloud resource misconfigurations | AI identifies cloud-specific kill chains (e.g., overly permissive S3 → compromised IAM → lateral movement) |
 | **Zero-budget discovery** (PowerShell, SSH scripts, Syft/Grype, osquery) | Server inventory, SBOMs, package-level CVE correlation | AI aggregates script-based findings into unified risk view. See [Zero-Budget Vulnerability Discovery](zero-budget-vulnerability-discovery.md) |
 | **osquery + FleetDM** | Cross-platform endpoint inventory, real-time process/network data, policy compliance | AI queries live endpoint state for prioritization and kill chain simulation. See [Osquery: The Sovereign Discovery Platform](osquery-custom-platform.md) |
-| **AOC (Admin Operations Center)** | M365 audit log intelligence, anomalous admin behaviour, privilege escalation detection | AI enriches insider-threat context with external vulnerability data for complete kill chain picture. See [Sovereign Tool Stack](sovereign-tool-stack.md) |
+| **PULSAR (Platform for Unified Log Search, Alerting & Review)** | M365 audit log intelligence, anomalous admin behaviour, privilege escalation detection | AI enriches insider-threat context with external vulnerability data for complete kill chain picture. See [Sovereign Tool Stack](sovereign-tool-stack.md) |
 | **Prowler** | Multi-cloud security posture (AWS, Azure, GCP) | AI correlates cloud misconfigurations with endpoint and identity findings for cross-layer risk scoring. See [Sovereign Tool Stack](sovereign-tool-stack.md) |
 | **Attack surface management** (Cortex Xpanse, Shodan, Nuclei, Amass) | External-facing assets unknown to IT | AI maps shadow IT and forgotten assets faster than manual discovery. See [Perimeter Scanning Capability](perimeter-scanning-capability.md) |
 | **Software bill of materials (SBOM)** | Known vulnerable components in applications | AI monitors SBOMs against real-time CVE disclosure and exploit availability |
@@ -0,0 +1,292 @@
 # Assignment: Conditional Access Architecture
 > *CA policies are enforcement points, not audit tools. A policy in report-only mode is a sensor. A policy in enabled mode is a wall. Know which you're building before you start.*
 This is a **scoped assignment package** — a complete, principled delivery guide for one specific client brief. It can be delivered standalone or immediately after [Assignment: Identity Baseline](assignment-identity-baseline.md). If identity baseline has not been completed, the prerequisites section below applies first.
 ---
 ## The Brief
 Client requests that fall within this scope:
 - *"Review our Conditional Access policies — we're not sure they're right"*
 - *"We need to enforce MFA properly, not just per-user MFA"*
 - *"Our auditor wants evidence of access controls"*
 - *"We got a new employee and nobody knows how access actually works"*
 - *"We bought E5 and want to use the CA features"*
 - *"We need compliant devices to be required for access"* (if Intune baseline is already deployed)
 This assignment does not require executive sponsorship. It requires one named IT lead with Global Administrator access, tolerance for a 72-hour report-only period per policy before enforcement, and awareness that policy changes affect all users.
 ---
 ## Scope Boundary
 **In scope:**
 - Audit of all existing CA policies (coverage, gaps, naming, exclusions, mode)
 - Design and documentation of a complete CA policy set
 - Staged deployment of the baseline policy set (identity-level controls)
 - Device compliance integration if Intune compliance policies are already active
 - Named locations configuration
 - Authentication strengths configuration (phishing-resistant MFA for admins)
 **Out of scope:**
 - Intune compliance policy configuration → [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md)
 - Microsoft Defender for Cloud Apps session controls (app-enforced restrictions are in scope; MDCA-dependent session policies are not)
 - Privileged Identity Management configuration → privileged access engagement
 - Identity Baseline (MFA registration, legacy auth, admin account hygiene) → [Assignment: Identity Baseline](assignment-identity-baseline.md)
 **Dependency:** This assignment can configure device compliance as a CA signal, but only if Intune compliance policies are already active and returning compliance state for enrolled devices. If Intune is not deployed, the device-compliance policies in this assignment are designed in report-only mode and left for activation when Intune is ready. Do not activate device-compliance CA policies against an environment where device enrollment is incomplete — the result is a broad lockout.
 ---
 ## Before You Touch Anything
 **1. Break-glass confirmation.**
 Before touching any CA policy, confirm that two cloud-only break-glass Global Admin accounts exist and are excluded from all CA policies. If they do not exist, create them and configure sign-in alerts before proceeding. See [Assignment: Identity Baseline](assignment-identity-baseline.md) for the break-glass standard. This step is non-negotiable — a misconfigured CA policy with no break-glass is a full tenant lockout.
 **2. CAExporter baseline.**
 Export all existing CA policies using [CAExporter](https://github.com/merill/caexporter). Store the JSON export as the before-state. Every change is measurable against it. This is also the rollback reference.
 **3. Per-user MFA audit.**
 Run the per-user MFA state report (Entra admin center → Users → Per-user MFA). If per-user MFA is enabled for any accounts, document it. Per-user MFA and CA-enforced MFA operate on separate control planes and interact unpredictably: a user with per-user MFA *enforced* may bypass some CA policies. Resolution is part of Step 3 below.
 **4. Sign-in log baseline.**
 Export 30 days of sign-in logs. Note the distribution of authentication methods in use, client application types (modern vs. legacy), and any conditional access results (success, failure, report-only). This is the baseline against which policy impact is measured.
 ---
 ## Principles Applied
 **Automation over procedure.**
 A CA policy enforces MFA whether or not anyone remembers to ask for it. A checklist does not. Every identity control in this assignment is implemented as a CA policy — self-enforcing, continuous, requiring no human decision to operate after deployment.
 **Kill chain first.**
 The policy set in this assignment is sequenced by structural impact. Legacy auth block and universal MFA enforcement come first because they close the widest attack path. Device compliance, location controls, and session policies come after. If the engagement ends early, the first two policies are the ones that matter.
 **Explicit design, documented intent.**
 Every policy deployed in this assignment has a documented name, purpose, conditions, grant controls, exclusions, and the date it was set to enabled. A CA policy with no documented intent is a liability: nobody can safely modify it, nobody knows if it can be removed, and future administrators work around it rather than through it. The leave-behind package for this assignment is the policy design document — not just the JSON export.
 **Report-only before enforcement.**
 Every new policy goes to report-only mode for a minimum of 48–72 hours. Sign-in logs are reviewed during that window to confirm expected behavior before enforcement. This is not optional. The cost of a production lockout — even for 30 minutes — is higher than the cost of 72 hours' delay.
 ---
 ## Delivery Architecture
 ### Step 1 — Audit (no changes)
 Document the current state honestly. The finding is not a criticism of the IT team — it is the starting point.
 | Action | Output |
 |--------|--------|
 | CAExporter export | CA policy baseline JSON and human-readable summary |
 | Per-user MFA state export | Accounts with per-user MFA enforced vs. disabled vs. not configured |
 | Policy coverage matrix | Every policy: name, state (enabled/report-only/disabled), conditions, grant, exclusions, last modified, named owner |
 | Gap analysis | Conditions with no coverage; duplicate coverage; exclusion lists with individual accounts |
 | Sign-in log review | Authentication methods in use; legacy auth clients; CA policy results |
 | Named locations inventory | Trusted IPs and named locations configured, if any |
 Deliver the audit findings to the named client lead before writing any policies. The coverage matrix should be readable without technical background — each row is one policy, each column answers one question. Include a plain-language summary: "You have 14 policies. Three are disabled and appear forgotten. Two overlap in ways that may create gaps. Five have no named owner and no documented purpose. Legacy authentication is not blocked at the CA level."
 ---
 ### Step 2 — Design
 Before deploying anything, produce the complete policy set design on paper (or in a document). Every policy defined, every exclusion justified, every interaction between policies mapped. Review with the named client lead before deployment begins.
 The policy set is designed in three layers. Deploy them in order.
 **Layer 1 — Identity controls (no device dependency)**
 These work immediately, without Intune or any device management. Deploy first.
 **Layer 2 — Admin controls (elevated requirements for privileged roles)**
 Stricter controls applied specifically to accounts holding privileged roles. Deploy after Layer 1 is stable.
 **Layer 3 — Device and session controls (Intune dependency)**
 Require device compliance as a CA signal. Deploy only when Intune compliance policies are active and returning results. Design these policies now; activate them when the Intune assignment is complete.
 ---
 ### Step 3 — Deploy Layer 1 (staged)
 Each policy follows the same deployment sequence:
 1. Create policy in **report-only** mode
 2. Wait 48–72 hours; review sign-in logs for the policy's report-only results
 3. Identify any legitimate traffic that would be blocked; create exclusion groups or refine conditions
 4. Switch to **enabled**
 5. Monitor sign-in logs for 24 hours
 6. Only then move to the next policy
 Do not deploy multiple policies simultaneously. Each policy change has independent blast radius; sequential deployment makes causality clear when something breaks.
 **Legacy authentication block first.** This is the one control that cannot afford to be partially deployed. If legacy auth is blocked via CA but not via Entra authentication policies, a policy gap in CA can allow legacy auth through. Confirm after deployment that the sign-in log shows zero legacy auth sign-ins. Zero is the only acceptable result.
 **Per-user MFA resolution.** After CA-enforced MFA is active for all users, disable per-user MFA for all accounts except break-glass. Leaving both active creates a split control plane. The CA policy is the authoritative control; per-user MFA is the legacy mechanism. They should not coexist once CA is stable.
 ---
 ## The Baseline Policy Set
 This is the policy set to deploy on every engagement. Adapt scope and exclusions to the client's environment; do not adapt the design principles.
 **Naming convention:**
 `CA-[Audience]-[Condition or Trigger]-[Grant or Block]`
 Examples: `CA-AllUsers-LegacyAuth-Block`, `CA-Admins-AllApps-RequirePhishingResistantMFA`
 Consistent naming is not aesthetic preference — it is the difference between a policy set that can be maintained and one that accumulates technical debt.
 **Exclusion groups:**
 All exclusions use Entra ID security groups, never individual accounts (except break-glass, which is excluded by account). Group membership is reviewed as part of the leave-behind. A group named `CA-Exclusion-BreakGlass` is named and owned; an individual account exclusion is invisible in aggregate policy review.
 ---
 ### Layer 1 — Identity Controls
 | Policy | Conditions | Grant / Block | Notes |
 |--------|-----------|---------------|-------|
 | `CA-AllUsers-LegacyAuth-Block` | All users / All cloud apps / Legacy auth clients (Exchange ActiveSync + Other clients) | Block | Deploy first. Confirm zero legacy auth in sign-in logs post-enforce. |
 | `CA-AllUsers-AllApps-RequireMFA` | All users / All cloud apps / All platforms / Exclude break-glass group | Require MFA | Core enforcement. Deploy second. Resolve per-user MFA conflict after this is stable. |
 | `CA-GuestUsers-AllApps-RequireMFA` | Guest and external users / All cloud apps | Require MFA | Separate policy: guests often require different exclusion handling. |
 **E3 stops here for identity-layer controls.** Risk-based policies (sign-in risk, user risk) require Entra ID P2. If the client has P2 licensing, add:
 | Policy | Conditions | Grant / Block | Notes |
 |--------|-----------|---------------|-------|
 | `CA-AllUsers-HighUserRisk-RequirePasswordChange` | All users / High user risk | Require MFA + password change | P2 required. Requires Identity Protection enabled. |
 | `CA-AllUsers-MedHighSignInRisk-RequireMFA` | All users / Medium and High sign-in risk | Require MFA | P2 required. Step-up for risky sign-ins. |
 ---
 ### Layer 2 — Admin Controls
 | Policy | Conditions | Grant / Block | Notes |
 |--------|-----------|---------------|-------|
 | `CA-Admins-AllApps-RequirePhishingResistantMFA` | Directory roles (Global Admin, Privileged Role Admin, Security Admin, Exchange Admin, SharePoint Admin, User Admin, Conditional Access Admin, Application Admin) / All cloud apps | Require authentication strength: Phishing-resistant MFA | Phishing-resistant = FIDO2 security key, Windows Hello for Business, or certificate-based auth. Requires auth strength configured in Entra. Standard Authenticator push is not phishing-resistant. |
 | `CA-Admins-AllApps-RequireCompliantOrHybridDevice` | Same role scope / All cloud apps | Require compliant device OR hybrid Azure AD joined | Layer 3 control applied early to admins specifically. Activate this even before broad device compliance enforcement if Intune covers admin workstations. |
 **Why admins get a separate, stricter policy set:** Admin credentials are the highest-value target in the tenant. An attacker who can bypass MFA on an admin account owns the tenant. Standard Authenticator push MFA is bypassed by MFA fatigue attacks (request flooding until the user approves). Phishing-resistant MFA is not. The separation in the policy set makes it explicit that admin accounts have a different requirement — and makes it auditable.
 ---
 ### Layer 3 — Device Controls (activate when Intune is ready)
 Design these policies now. Activate them after [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md) is complete and device compliance results are stable.
 | Policy | Conditions | Grant / Block | Notes |
 |--------|-----------|---------------|-------|
 | `CA-AllUsers-AllApps-RequireCompliantDevice` | All users / All cloud apps / All platforms | Require compliant device OR require MFA | Start with OR (compliant device OR MFA) — gives unmanaged-device users a path via MFA. Once enrollment is high enough, switch to AND or compliant-only. |
 | `CA-AllUsers-SensitiveApps-RequireCompliantDevice` | All users / Exchange Online + SharePoint Online / All platforms | Require compliant device | Strict. Apply to sensitive apps first before all apps. |
 | `CA-AllUsers-UnmanagedDevice-AppEnforcedRestrictions` | All users / Exchange Online + SharePoint Online / Any platform / Filter: not compliant, not hybrid-joined | Session: app-enforced restrictions (use limited web access) | Limits download and sync on unmanaged devices accessing mail and documents. Requires Exchange Online and SharePoint to be configured for app-enforced restrictions. E3-compatible. |
 The `CA-AllUsers-UnmanagedDevice-AppEnforcedRestrictions` policy is the most immediately valuable Layer 3 control for E3 clients without full Intune enrollment — it degrades access rather than blocks it, which is easier to deploy without user disruption.
 ---
 ### Named Locations (supporting the policy set)
 Configure named locations before deploying any location-based policies.
 | Location | Purpose |
 |----------|---------|
 | **Trusted corporate networks** | Office IP ranges. Used to relax MFA requirements on trusted networks if the client explicitly requests it. Default recommendation: do not relax MFA on any network — trusted location is less durable than device compliance. |
 | **High-risk countries** (optional) | Countries from which the client has no operations and no expected sign-ins. Can be used to block access or require MFA as a step-up. Use carefully: VPN exit nodes and mobile roaming will trigger this. Document the decision. |
 Named locations are often requested but rarely worth the operational overhead unless the client has a specific use case (blocking sign-ins from a defined list of countries, or relaxing physical office controls). Include in the design document; deploy only if the client has a clear requirement.
 ---
 ## Structural Resilience Checklist
 Controls that hold without ongoing human willingness after this engagement closes.
 - [ ] `CA-AllUsers-LegacyAuth-Block` is **enabled** — not report-only — and sign-in logs confirm zero legacy auth clients
 - [ ] `CA-AllUsers-AllApps-RequireMFA` is **enabled** and covers all users including guests (separate guest policy)
 - [ ] `CA-Admins-AllApps-RequirePhishingResistantMFA` is **enabled** and authentication strength is configured
 - [ ] Per-user MFA has been disabled for all accounts after CA-enforced MFA is stable (except break-glass)
 - [ ] All exclusions use named Entra ID groups — no individual account exclusions except break-glass
 - [ ] Every policy has a documented name, intent, owner, and date of last review
 - [ ] CAExporter export (before and after) stored in client documentation
 - [ ] Layer 3 policies exist in **report-only** mode, ready for activation when Intune is complete
 ---
 ## Kill Chain Contribution
 **What this assignment closes:**
 | Attack vector | Control deployed |
 |---------------|-----------------|
 | Password spray with no MFA prompt | `CA-AllUsers-AllApps-RequireMFA` |
 | MFA fatigue attack against admin accounts (push flooding) | `CA-Admins-AllApps-RequirePhishingResistantMFA` |
 | Legacy protocol abuse (SMTP AUTH, IMAP, Basic Auth REST) | `CA-AllUsers-LegacyAuth-Block` |
 | Credential stuffing from breached credential lists | MFA enforcement |
 | Guest account lateral movement through weakly controlled external access | `CA-GuestUsers-AllApps-RequireMFA` |
 | Unmanaged device access to sensitive apps (if Layer 3 activated) | `CA-AllUsers-UnmanagedDevice-AppEnforcedRestrictions` |
 **What this assignment does not close:**
 | Remaining gap | Addressed by |
 |---------------|-------------|
 | Adversary-in-the-middle / session token theft post-MFA | Device compliance in CA + Entra token protection (P2) |
 | Unmanaged device as unrestricted access vector | [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md) + Layer 3 activation |
 | Standing admin privilege (long-lived sessions, no JIT) | Privileged access engagement (PIM) |
 | Sign-in risk and impossible travel detection | Entra ID P2 Layer 1 additions |
 | App permission abuse (OAuth consent phishing) | Service identity engagement |
 The residual gap the client is most likely to feel: a stolen session token (from phishing with AiTM proxy) bypasses MFA because it captures the token after MFA completes. This is the next-generation phishing technique. Mitigating it requires token binding to device compliance — a Layer 3 control — plus Entra token protection (P2 feature). Document this in the residual risk statement.
 ---
 ## Leave-Behind Package
 | Artifact | Description |
 |----------|-------------|
 | **CAExporter JSON (before)** | CA policy state at engagement start |
 | **CAExporter JSON (after)** | CA policy state at engagement close |
 | **Policy design document** | Every deployed policy: name, intent, conditions, grant/block, exclusion groups, owner, date enabled |
 | **Policy coverage matrix** | Human-readable: which users are covered by which policies, which apps, which platforms |
 | **Per-user MFA resolution record** | Confirmation that per-user MFA has been disabled post-CA deployment |
 | **Layer 3 design document** | Device compliance policies designed but not yet activated; activation prerequisites and checklist |
 | **Exclusion group inventory** | Every CA exclusion group: name, members, review cadence |
 | **Sign-in log confirmation** | Legacy auth: zero clients post-block. MFA: applied to >99% of sign-ins. |
 | **Named locations documentation** | Any configured named locations with business justification |
 | **Scope boundary log** | Every finding outside this scope, named and prioritized |
 | **Residual risk statement** | What this assignment did not close, specifically including AiTM/token theft risk |
 The Layer 3 design document is the explicit handoff to the Intune assignment. A CISO reading the leave-behind package can see exactly what was built, why, what it prevents, and what comes next — without needing to ask.
 ---
 ## Scope Boundary Signals
 | Signal | Points toward |
 |--------|--------------|
 | No Intune enrollment or compliance policies active | Intune Security Baseline assignment — activate Layer 3 after |
 | Global Admins have no phishing-resistant MFA method registered | Auth method enrollment drive; may need hardware key procurement |
 | Entra ID P2 not licensed; client has credential-stuffing exposure | Licensing recommendation: P2 for Identity Protection (cheaper than full E5) |
 | App registrations with broad Graph permissions visible in sign-in logs | Service identity engagement |
 | Service accounts authenticating with CA policies applied | Service account remediation — service accounts should use managed identities or workload identity federation, not user-like credential flows through CA |
 | Defender for Cloud Apps not licensed; session control requests needed | MDCA engagement for full session control |
 | Sign-in logs show access from unexpected geographies | Named location policy review; may warrant country block |
 | Audit log retention < 90 days | Detection baseline assignment |
 ---
 ## Buildable-On: What the Next Assignment Depends On
 The Intune Security Baseline assignment builds directly on the CA architecture deployed here. Specifically, it depends on:
 1. **`CA-AllUsers-AllApps-RequireCompliantDevice` exists in report-only mode.** The Intune assignment activates this policy as its final step — the point where device compliance becomes an access control, not just a reporting tool.
 2. **CA exclusion groups are using the right naming convention.** Device compliance policies deployed in Intune reference the same user groups used in CA. Consistent group naming prevents the Intune assignment from having to clean up CA policy exclusions mid-deployment.
 3. **Sign-in logs show MFA is enforced.** The Intune assignment cannot safely activate device-compliance CA policies if MFA enforcement is incomplete — an unmanaged device could otherwise use the compliance check as a bypass path.
 If all three conditions are true at handover, the Intune assignment can activate Layer 3 without revisiting the CA work. If any condition is false, the scope boundary log documents what needs to be resolved first.
 ---
 *For the identity foundation this builds on, see [Assignment: Identity Baseline](assignment-identity-baseline.md).*
 *For the device compliance integration that activates Layer 3, see [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md).*
 *For the technical depth on privileged access architecture that informs admin CA requirements, see [Book III — Privileged Access](../books/02-privileged-access.md).*
@@ -0,0 +1,443 @@
 # Assignment: Collaboration and Data Security
 > *Data is liquid. It leaves where you put it — copied, shared, forwarded, synced, linked. The question is never "is it locked down" but "where can it flow, who can reshare it, and can you see and reverse the flow?"*
 This is a **scoped assignment package** and the fourth in the M365 security sequence. It addresses the data and collaboration layer: how corporate data moves, where it leaks, and what structural controls reduce the blast radius when it does. It can be delivered standalone, but the device and identity controls from the preceding assignments are assumed in the residual risk analysis.
 This assignment completes the **"Secure M365"** engagement when delivered after Identity Baseline, CA Architecture, and Intune Security Baseline.
 ---
 ## The Brief
 Client requests that fall within this scope:
 - *"Secure our M365 / harden our Exchange and SharePoint"*
 - *"We're worried about data leaking through email or shared links"*
 - *"We got a phishing email and want to prevent it"*
 - *"Our auditor wants to see DLP controls"*
 - *"We need email authentication — DMARC / DKIM / SPF"*
 - *"We need to know what's being shared externally"*
 - *"Set up sensitivity labels"*
 This assignment does not require executive sponsorship. It requires one named IT lead with Global Administrator and Exchange Administrator access, tolerance for discovering that external sharing is significantly wider than assumed, and willingness to remove sharing types that users may push back on.
 ---
 ## Scope Boundary
 **In scope:**
 - External sharing exposure mapping ("Anyone" links, external guests, external shares)
 - Removal of anonymous sharing and external auto-forwarding
 - Exchange Online Protection (EOP) hardening: anti-phishing, anti-malware, anti-spam
 - Email authentication: SPF verification, DKIM enablement, DMARC deployment
 - SharePoint and OneDrive tenant-level sharing governance
 - Guest access governance: expiration, review cadence
 - Sensitivity label taxonomy and deployment (foundation: 3–4 labels)
 - DLP baseline: 3–5 known high-value patterns for Exchange, SharePoint, OneDrive
 - Audit logging verification and configuration
 - App consent governance: restrict user consent, enable admin consent workflow
 **Out of scope:**
 - Comprehensive data classification programme → separate Purview engagement
 - Defender for Office 365 P1/P2 advanced configuration (Safe Links, Safe Attachments, Attack Simulation) → E5 or add-on engagement
 - Microsoft Defender for Cloud Apps session controls → MDCA engagement
 - Retention policies and data lifecycle governance → separate Purview engagement
 - On-premises Exchange decommissioning → separate hybrid engagement
 - Cross-tenant access configuration (B2B direct connect) → out of scope unless specifically requested
 - Entitlement management and full guest lifecycle (P2 feature) → out of scope for E3
 When the client asks for comprehensive DLP — covering all data types across all services — scope it as a separate engagement. A DLP programme that attempts to cover everything produces alert fatigue that degrades the protection for the things that actually matter.
 ---
 ## Before You Touch Anything
 **1. Crown jewels question.**
 Before configuring any control, ask the named client lead one question: *"Which three data sets, if leaked, would cause the most harm to the organisation — regulatory, competitive, or reputational?"*
 If they cannot answer, that inability is finding #1. You cannot apply protection asymmetrically until you know what the asymmetry is for. Sensitivity labels, DLP policies, and restricted-site configurations all depend on this answer. If the organisation genuinely cannot identify its crown jewels, document it and apply the default framework (financial data, HR data, and strategic/M&A communications) as a starting point.
 **2. Surface map.**
 Before making any changes, enumerate the actual external exposure. The findings are almost always worse than the client assumes — and the enumeration itself, shared with the client lead, is often the moment that creates willingness for the removal steps that follow.
 Run these reports before touching configuration:
 | Report | Tool / Location |
 |--------|----------------|
 | "Anyone" (anonymous) links | SharePoint admin center → Reports → Sharing → or Graph API |
 | External shares (authenticated guest links) | SharePoint admin center → Sharing report |
 | Guest users with last sign-in date | Entra ID → External Identities → All users (filter: Guest) |
 | External auto-forwarding rules | Exchange admin center → Mail flow → Rules; or PowerShell: `Get-TransportRule` filtered for external redirect |
 | User-consented OAuth app grants | Entra ID → Enterprise applications → filter: User consent |
 | SPF, DKIM, DMARC status | MXToolbox or PowerShell DNS lookup per domain |
 | Unified Audit Log status | Compliance portal → Audit → or `Get-AdminAuditLogConfig` |
 Deliver the surface map to the named client lead before proceeding to any removal steps. State the findings plainly: "You have 847 anonymous sharing links. Fourteen mailboxes have active external forwarding rules. You have 312 guest accounts, 189 of whom have not signed in within 90 days. DMARC is not configured. Your Unified Audit Log has not been enabled."
 These are facts, not accusations. The client lead needs to see the actual exposure before approving the removal steps.
 ---
 ## Principles Applied
 **Remove first, then govern.**
 The highest-impact actions in this assignment are removals: anonymous links, external auto-forwarding, over-permissioned OAuth grants. These are not governance gaps — they are open doors. No amount of sensitivity labelling or DLP configuration compensates for an anonymous sharing link that routes around every identity control built in the preceding three assignments. Subtraction comes first.
 **Name the crown jewels before you protect them.**
 Even-spreading protection across all data is the concave failure: enormous maintenance cost, false positive noise that trains users to click through warnings, and the real exfiltration lost in the background. Sensitivity labels and DLP policies are applied to the crown jewels and known high-value patterns — not to everything. Three well-targeted DLP policies that fire reliably are worth more than thirty policies that nobody trusts.
 **Visibility before governance.**
 The surface map is the most valuable deliverable in this assignment. An organisation that has never seen its "Anyone" link count, its guest list with last sign-in dates, or its auto-forward rule inventory cannot govern what it has. The surface map creates visibility; governance follows from it.
 **Protection must travel with the data.**
 A sensitivity label with encryption is the only control that survives data leaving the tenant. Container controls — SharePoint permissions, CA policies, device compliance — stop working the moment the file is downloaded and forwarded. For the crown jewels, the protection must be bound to the file itself. Everything else is a gate on the way out, not a lock on the data.
 ---
 ## Delivery Architecture
 ### Step 1 — Surface Map (no changes)
 *Described above in "Before You Touch Anything." Complete and deliver before proceeding.*
 The surface map has a second purpose beyond informing the work: it is the before-state that makes the leave-behind measurable. "You had 847 anonymous links; you now have 0" is a concrete, auditable risk-reduction statement.
 ---
 ### Step 2 — Remove the Dangerous Paths
 These actions have the highest impact per unit of effort in the entire assignment. They should be completed before any additive control is deployed.
 **Kill anonymous "Anyone" links.**
 Set the tenant-level sharing policy to prohibit new "Anyone" links:
 - SharePoint admin center → Policies → Sharing
 - External sharing: set to **New and existing guests** (requires authentication) — not "Anyone"
 - This stops new anonymous links from being created. It does not revoke existing links.
 Existing anonymous links must be revoked separately. Use the SharePoint Sharing Report or a Graph API query to enumerate them, then decide with the client lead: bulk revoke all, or review and selectively revoke. Bulk revoke is correct for any link created more than 90 days ago with no documented business justification. Document the decision and the revocation count.
 **Block external auto-forwarding.**
 External auto-forwarding rules are the most reliable mailbox-compromise exfiltration technique. They should not exist.
 - Exchange admin center → Mail flow → Remote domains → Default domain → Uncheck "Allow automatic forwarding"
 - Or via the outbound anti-spam policy: set automatic forwarding to **Off**
 - After disabling, audit existing rules: `Get-TransportRule | Where-Object { $_.RedirectMessageTo -like "*@*" }` and `Get-Mailbox -ResultSize Unlimited | Get-InboxRule | Where-Object { $_.ForwardTo -or $_.RedirectTo -like "*@*" }`
 Any active external forwarding rule found during the audit is a potential incident indicator. Treat each one as suspicious until confirmed legitimate by the mailbox owner and the named client lead. Document the outcome for each.
 **Restrict user OAuth consent.**
 Users should not be able to grant arbitrary third-party applications access to tenant data.
 - Entra ID → Enterprise applications → Consent and permissions → User consent settings
 - Set to: **Allow user consent for apps from verified publishers, for selected permissions (classified as low impact)** — or **Do not allow user consent** (more restrictive; requires admin approval workflow to compensate)
 - Enable the **Admin consent workflow**: users can submit a request; named admins receive and review it
 Review existing user-consented grants. Flag any app with permissions in these categories:
 - `Mail.Read`, `Mail.ReadWrite`, `Mail.Send` — reads or sends all mail
 - `Files.ReadWrite.All`, `Sites.Read.All` — accesses all files and sites
 - `User.Read.All`, `Directory.Read.All` — reads full directory
 High-permission user-consented grants should be reviewed with the named client lead and revoked where the app is not recognised, not actively used, or not from a verified publisher. Revoke through Entra ID → Enterprise applications → [App] → Permissions → Revoke user consent.
 ---
 ### Step 3 — Exchange Online Protection Baseline
 EOP is included in E3 and M365 Business Premium. It handles anti-phishing, anti-malware, and anti-spam for Exchange Online. Default EOP configuration is functional but not optimal.
 **Email authentication (SPF, DKIM, DMARC):**
 | Protocol | What it does | Configuration |
 |----------|-------------|---------------|
 | **SPF** | Declares which servers may send email as your domain | DNS TXT record — verify it exists and is not over-broad (`+all` invalidates it) |
 | **DKIM** | Cryptographically signs outbound email | Enable in Exchange admin center → Email authentication → DKIM → Enable for each domain. Key rotation is handled automatically. |
 | **DMARC** | Specifies how receiving servers handle SPF/DKIM failures | DNS TXT record. Deploy in stages: `p=none` (monitoring) → verify no legitimate mail fails → `p=quarantine` → eventually `p=reject`. Minimum target for this assignment: `p=quarantine` after 30-day monitoring period shows no legitimate mail failing. |
 Without DMARC, your domain can be spoofed in inbound email to your users and in outbound email to others. SPF and DKIM without DMARC do not enforce — DMARC is the enforcement record.
 **Anti-phishing policy (EOP):**
 - Exchange admin center → Policies & rules → Threat policies → Anti-phishing
 - Enable impersonation protection for: the organisation's own domain(s), key users (CEO, CFO, board members, finance team)
 - Enable mailbox intelligence (learning sender patterns)
 - Set action for impersonation detections: **Quarantine** (not move to Junk — quarantine is reviewed; Junk is ignored)
 If the client has Defender for Office 365 P1 (included in M365 Business Premium or as an add-on): enable Safe Links and Safe Attachments. These are materially more effective than EOP baseline anti-phishing. Note the gap if E3 without the add-on.
 **Anti-malware policy:**
 - Threat policies → Anti-malware
 - Enable common attachment filter: block executable file types (.exe, .vbs, .js, .ps1, .bat, .cmd and others)
 - Zero-hour auto purge (ZAP): ensure it is enabled — retroactively quarantines malware found after delivery
 - Admin notifications: notify security team on malware detection
 **Anti-spam policy:**
 - Threat policies → Anti-spam
 - Bulk complaint level threshold: set to 6 (aggressive; default is 7)
 - Enable outbound spam notifications: alert the security team when a mailbox is detected sending spam (indicator of compromise)
 - Verify SPF hard fail is evaluated
 ---
 ### Step 4 — Sharing Governance
 Sharing governance operates at multiple levels in M365. The tenant setting is the ceiling — per-site can be more restrictive but never more permissive than the tenant setting.
 **Tenant-level settings (SharePoint admin center → Policies → Sharing):**
 | Setting | Target value | Notes |
 |---------|-------------|-------|
 | External sharing — SharePoint | New and existing guests | Requires guest authentication. "Anyone" was removed in Step 2. |
 | External sharing — OneDrive | New and existing guests | Match SharePoint setting or more restrictive. |
 | Require guests to sign in using the same account | Yes | Prevents link forwarding to a different account. |
 | Allow guests to share items they don't own | No | Prevents reshare chain from escaping first-hop control. |
 | Guest access expiration | 30 days (or per organisation policy) | Guests must be reviewed and re-invited; standing access expires. |
 | Link permissions default | View | Least privilege; users explicitly upgrade if edit is needed. |
 | Link expiry (new and existing guest links) | 30 days | Prevents permanent link accumulation. |
 **Per-site controls — crown jewel sites:**
 For sites identified in the crown jewels question (Step 1 of "Before You Touch Anything"):
 - Set external sharing to **Only people in your organization**
 - Remove broad internal permissions ("Everyone except external users", "All company")
 - Document the named owners of the site and the access review schedule
 Internal oversharing is often overlooked: a finance site accessible to "All company" means any compromised internal account reaches the financial data. Restrict sensitive sites to named groups with specific membership.
 ---
 ### Step 5 — Guest Governance
 Guest accounts are standing external blast radius. Every guest that has not been reviewed is an unknown with access to unknown data.
 **Immediate actions:**
 1. **Export the guest list with last sign-in date.** In Entra ID → Users → filter by User type: Guest. Export to CSV. Sort by last sign-in date.
 2. **Flag for removal:** guests who have not signed in within 90 days and have no active project sponsorship. Present the list to the named client lead for approval before removing.
 3. **Remove approved stale guests.** Document the count.
 **Ongoing governance (configure before handover):**
 | Control | Configuration |
 |---------|--------------|
 | Guest invitation restrictions | Restrict to Entra ID admins only (not all users can invite guests) |
 | Guest access expiration | Configure in Entra ID → External Identities → External collaboration settings: Guest user access expires after 180 days unless reviewed |
 | Access reviews | Entra ID → Identity Governance → Access reviews — create a quarterly review for all guests. Reviewer: IT lead or line-of-business owner. Action on no response: remove access. |
 Access reviews require Entra ID P2 for full automation. For E3, a manual quarterly review using the Entra guest export is the alternative — document the cadence in the leave-behind and assign an owner.
 ---
 ### Step 6 — Sensitivity Labels Foundation
 Sensitivity labels are the mechanism that makes protection travel with the data. A labelled document carries its permissions wherever it goes — downloaded, emailed, shared externally.
 **Label taxonomy — baseline (4 labels):**
 | Label | Meaning | Default protection |
 |-------|---------|-------------------|
 | **Public** | Intended for external distribution | No restrictions |
 | **Internal** | Default for internal business content | No external sharing by default |
 | **Confidential** | Business-sensitive; restricted distribution | Encrypt; restrict to organisation members; no external forwarding |
 | **Highly Confidential** | Crown jewels: financial, legal, M&A, HR | Encrypt; restrict to named group; no download on unmanaged device; watermark |
 Keep the taxonomy to four labels. More labels increase classification fatigue and reduce the percentage of content that gets labelled at all. A four-label taxonomy that users understand and apply is worth more than a twelve-label taxonomy that nobody uses.
 **Deployment:**
 1. Create labels in Microsoft Purview compliance portal → Information protection → Labels
 2. Publish labels to all users via a label policy
 3. Configure auto-labelling for the Highly Confidential label: define content patterns (e.g., project name, internal designation) that trigger auto-labelling in SharePoint and OneDrive
 4. Set the default label for SharePoint sites identified as crown jewel sites: Confidential
 **For Highly Confidential — encryption configuration:**
 - Rights Management encryption: Only organisation members can open; no external forwarding; no printing
 - Apply to: the named crown-jewel sites and document libraries
 The label is the escape hatch. A Highly Confidential document downloaded to an unmanaged device and forwarded externally is still encrypted — the attacker has ciphertext, not data. This is the only control in this assignment that holds after data leaves the tenant.
 ---
 ### Step 7 — DLP Baseline
 DLP policies intercept known sensitive information patterns transiting Exchange, SharePoint, and OneDrive. Deploy DLP as a scalpel: 3–5 specific, high-confidence patterns. Do not attempt comprehensive coverage.
 **Target patterns for most organisations:**
 | Policy | Pattern | Initial action |
 |--------|---------|---------------|
 | Payment card data | Credit card numbers (PCI scope) | Policy tip to user + admin alert |
 | National identity numbers | National ID / tax number format for the client's jurisdiction | Policy tip to user |
 | Crown jewel content | Sensitivity label: Highly Confidential (label-based DLP) | Block external sharing + admin alert |
 | External forwarding with attachments | Email to external recipients with attachments > threshold | Notify user |
 Start every DLP policy in **simulation mode** (test/audit) before enforcement. Review DLP activity reports after 48 hours of simulation. Identify false positives. Tune the policy. Then enable with **notify only** before moving to **block**.
 The sequence: simulation → notify → block. Never skip the simulation and notify stages.
 **What E3 DLP covers:** Exchange Online, SharePoint Online, OneDrive for Business. It does not cover Teams messages (requires Purview add-on) or endpoint DLP (requires Purview or E5 compliance).
 Note the gaps in the residual risk statement: DLP at this scope does not cover Teams conversations or files shared through channels. If Teams is a primary working environment for crown-jewel content, document this as a gap pointing toward a Purview engagement.
 ---
 ### Step 8 — Audit Logging
 Audit logging is the foundation of any post-incident forensics capability. If it is not enabled, every breach investigation starts with nothing.
 **Unified Audit Log:**
 ```powershell
 # Verify status
 Get-AdminAuditLogConfig | Select-Object UnifiedAuditLogIngestionEnabled
 # Enable if false
 Set-AdminAuditLogConfig -UnifiedAuditLogIngestionEnabled $true
 ```
 E3 default retention: 90 days. Verify actual retention in the Compliance portal → Audit. If the client has regulatory requirements for longer retention (NIS2, DORA, banking regulations typically require 1 year minimum), document the gap. The E3 upgrade path is the Audit (Premium) add-on or E5 compliance.
 **Mailbox audit logging:**
 ```powershell
 Get-Mailbox -ResultSize Unlimited | 
  Where-Object {$_.AuditEnabled -eq $false} | 
  Set-Mailbox -AuditEnabled $true
 ```
 Verify that key mailbox audit operations are captured: MailboxLogin, SendAs, SendOnBehalf, HardDelete, FolderBind.
 **Critical audit events to verify are captured:**
 | Event category | Why it matters |
 |---------------|---------------|
 | File and page activities | Accessed, downloaded, shared — the data exfiltration footprint |
 | Sharing and access request activities | External shares created; guest invitations sent |
 | Synchronization activities | Files synced to devices (OneDrive sync client) |
 | Exchange admin activities | Transport rule creation/modification; external forwarding |
 | Azure AD sign-in events | Anomalous sign-ins, MFA failures, conditional access decisions |
 | DLP rule matches | Evidence that DLP policies are firing |
 ---
 ## Structural Resilience Checklist
 Controls that hold without ongoing human willingness after this engagement closes.
 - [ ] Anonymous sharing blocked at tenant level — confirmed by SharePoint sharing settings
 - [ ] Existing anonymous links revoked — count documented
 - [ ] External auto-forwarding blocked at tenant level — confirmed by transport rule and outbound spam policy
 - [ ] Active external forwarding rules reviewed and removed
 - [ ] DKIM enabled for all domains
 - [ ] DMARC deployed at minimum `p=quarantine` after monitoring period
 - [ ] User OAuth consent restricted — admin consent workflow active
 - [ ] High-permission user-consented OAuth grants reviewed
 - [ ] Guest expiration configured — new guests expire by default
 - [ ] Stale guests removed (90+ days inactive, no active sponsorship)
 - [ ] Guest access review cadence documented with named owner
 - [ ] Sensitivity labels published to all users — Highly Confidential label with encryption
 - [ ] DLP baseline policies active (post-simulation and notify stages) — not in simulation only
 - [ ] Unified Audit Log enabled
 - [ ] Mailbox audit logging enabled for all mailboxes
 ---
 ## Kill Chain Contribution
 **What this assignment closes:**
 | Attack vector | Control deployed |
 |---------------|-----------------|
 | Data exfiltration via anonymous link (bypasses all identity controls) | Anonymous link prohibition + existing link revocation |
 | Business email compromise via mailbox forwarding rule | External auto-forwarding block + rule audit |
 | OAuth consent phishing (malicious app requesting mail/file access) | User consent restriction + high-permission grant review |
 | Domain spoofing (impersonation of the client's domain in email) | DMARC `p=quarantine` |
 | Phishing email impersonating known users or domain | Anti-phishing impersonation protection |
 | Crown-jewel document leaking outside the tenant | Sensitivity label encryption (Highly Confidential) — protection travels with file |
 | Known sensitive data patterns transiting email or SharePoint | DLP baseline policies |
 | Stale guest accounts as standing external foothold | Guest expiration + stale guest removal |
 **What this assignment does not close:**
 | Remaining gap | Addressed by |
 |---------------|-------------|
 | Advanced phishing: Safe Links, Safe Attachments | Defender for Office 365 P1 (E5 or add-on) |
 | Teams message DLP | Purview compliance add-on |
 | Endpoint DLP (data leaving via USB, local app) | Purview E5 compliance or endpoint DLP engagement |
 | Full data lifecycle governance (retention, disposal) | Purview engagement |
 | MDCA session controls (block download from browser on unmanaged device) | MDCA engagement |
 | Full guest lifecycle management (access packages, entitlement) | Entra ID Governance (P2) engagement |
 | Residual data on unmanaged/BYOD devices | App Protection Policies (Intune assignment) |
 ---
 ## Leave-Behind Package
 | Artifact | Description |
 |----------|-------------|
 | **Surface map report** | Before-state: "Anyone" link count, external shares, guest list with last sign-in, forwarding rules found, OAuth grant inventory, SPF/DKIM/DMARC status |
 | **Anonymous link revocation record** | Links revoked: count, method, date |
 | **External forwarding rule audit** | Rules found, disposition of each (removed / confirmed legitimate / flagged as suspicious) |
 | **OAuth grant review record** | Grants reviewed, grants revoked, grants retained with justification |
 | **EOP policy documentation** | Anti-phishing, anti-malware, anti-spam settings with rationale |
 | **DMARC monitoring report** | DMARC aggregate reports at `p=none` before moving to `p=quarantine`; confirmation of quarantine deployment |
 | **Sharing governance configuration** | Tenant sharing settings, crown-jewel site configurations |
 | **Guest governance documentation** | Expiration settings, access review configuration, stale guest removal count, review cadence with named owner |
 | **Sensitivity label documentation** | Label taxonomy, label policy, encryption configuration for Highly Confidential |
 | **DLP policy documentation** | Each policy: target pattern, scope, actions, simulation results before enforcement |
 | **Audit logging confirmation** | Unified Audit Log status, retention period, mailbox audit status |
 | **Scope boundary log** | Every finding outside this scope, named and prioritized |
 | **Residual risk statement** | What this assignment did not close: Teams DLP gap, endpoint exfil path, advanced phishing gap, guest lifecycle limitations |
 ---
 ## Scope Boundary Signals
 | Signal | Points toward |
 |--------|--------------|
 | Significant Teams usage for crown-jewel content; Teams DLP not covered | Purview compliance engagement |
 | No independent M365 backup — Microsoft recycle bin only | Recovery and detection engagement (Book VI) |
 | Audit log retention < regulatory requirement | Audit (Premium) add-on; or compliance-driven M365 upgrade |
 | On-premises Exchange still in the estate | Hybrid Exchange engagement — decommissioning path |
 | Advanced phishing; no Defender for Office 365 P1 | E5 / MDO add-on evaluation |
 | High volume of user-consented high-permission OAuth apps | Entitlement management engagement |
 | Crown-jewel data accessible to broad internal groups | Information architecture engagement (governance, IA, Purview classification) |
 | No independent M365 backup | Recovery and detection engagement |
 | No incident response plan | IR planning engagement |
 ---
 ## Completing the "Secure M365" Engagement
 When all four assignments are delivered, the client has:
 **Identity Baseline** — MFA enforced for all users and phishing-resistant MFA for admins. Legacy authentication blocked at the tenant level. Break-glass accounts established and monitored. Admin accounts separated and audited.
 **CA Architecture** — A named, documented, principled CA policy set. Layer 1 (identity) and Layer 2 (admin elevation) enforced. Layer 3 (device compliance) activated following the Intune assignment. Per-user MFA conflict resolved.
 **Intune Security Baseline** — Device compliance policies returning results for the enrolled fleet. Compliant device required for M365 access (CA Layer 3 active). BitLocker, patch compliance, and LAPS deployed. Update rings with canary. App Protection Policies for BYOD. The real device population is mapped and documented.
 **Collaboration and Data Security** — Anonymous links removed. External auto-forwarding blocked. Email authentication at DMARC quarantine. External sharing governed. Stale guests removed. Sensitivity labels deployed with crown-jewel encryption. DLP baseline active for known high-value patterns. Audit logging enabled.
 **What this engagement does not close** — and what the CISO has in writing:
 - Session token theft (AiTM phishing) → Entra ID P2 + token protection
 - EDR and post-compromise detection → Defender for Endpoint P2 or Wazuh augmentation
 - Standing privilege → PIM / PAM engagement
 - Active Directory on-premises hardening → hybrid identity and AD hardening engagement
 - Full data governance → Purview engagement
 - Backup and recovery → recovery and detection engagement
 - Incident response capability → IR planning and detection baseline engagement
 The residual risk statement across all four packages is the honest description of what has been built and what remains. It is not a sales document — it is the record that the client's security posture was improved deliberately, with full awareness of what was and was not in scope.
 ---
 *For the identity foundation, see [Assignment: Identity Baseline](assignment-identity-baseline.md).*
 *For the CA architecture, see [Assignment: CA Architecture](assignment-ca-architecture.md).*
 *For the device security baseline, see [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md).*
 *For the data and collaboration philosophy, see [Book V — Data & Collaboration](../books/04-data-and-collaboration.md).*
 *For the recovery and detection layer this engagement exposes as the next priority, see [Book VI — Recovery & Detection](../books/05-recovery-and-detection.md).*
@@ -0,0 +1,222 @@
 # Assignment: Identity Baseline
 > *Enforce what you already have. Every other M365 security control is downstream of this one.*
 This is a **scoped assignment package** — a complete, principled delivery guide for one specific client brief. It is designed to work with limited organizational engagement and to leave behind infrastructure that holds without anyone needing to want it.
 ---
 ## The Brief
 Client requests that fall within this scope:
 - *"Secure our M365 / our identities are a mess"*
 - *"We need MFA enforced — the auditor asked for it"*
 - *"We got phished and IT wants to prevent it happening again"*
 - *"Review our user accounts and admin accounts"*
 - *"Make sure only the right people have access"*
 This assignment does not require executive sponsorship. It requires one named IT lead with Global Administrator access and a tolerance for findings.
 ---
 ## Scope Boundary
 **In scope:**
 - Entra ID authentication configuration (MFA, legacy auth, auth methods)
 - Conditional Access policy review for existing policies (not full CA architecture)
 - Global Administrator and other privileged role audit
 - Break-glass account establishment
 - Entra ID Protection risk policy baseline
 - Authentication method registration and SSPR configuration
 - Service principal and app registration review (inventory and flag — not remediate)
 **Out of scope:**
 - Conditional Access policy design and architecture → [Assignment: CA Architecture](assignment-ca-architecture.md)
 - Device compliance and Intune → [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md)
 - Privileged Access Management (PIM, PAM, PAW) → separate privileged access engagement
 - Active Directory on-premises → hybrid identity engagement
 - Application permissions remediation → separate service identity engagement
 When the client asks for something adjacent, log it in the scope boundary signals section at the end of the engagement. Do not absorb it silently and do not pitch the next engagement. The log is the record.
 ---
 ## Before You Touch Anything
 These three steps happen before any change, on day one.
 **1. Break-glass accounts.**
 If the tenant has no cloud-only break-glass accounts excluded from all CA policies, create two before proceeding. Document their credentials out of band (not in the same tenant). Alert on their sign-in. This is the safety net. Without it, a misconfigured CA policy can lock the entire tenant — including you.
 **2. CAExporter baseline.**
 Export the current CA policy state using [CAExporter](https://github.com/merill/caexporter). This JSON export is the before-state. Every change made during this engagement is measurable against it. It is also the rollback reference if something breaks.
 **3. Authentication sign-in log baseline.**
 Export 30 days of Entra sign-in logs, filtered for legacy authentication clients. This is the baseline for measuring the impact of legacy auth block and the evidence that the block is complete. Without it, you cannot demonstrate that legacy auth is actually gone — only that a policy exists.
 ---
 ## Principles Applied
 **Automation over procedure.**
 Every control in this assignment is a policy, not a document. MFA enforcement is a CA policy, not a user awareness campaign. Legacy auth block is an authentication policy or CA rule, not a helpdesk notification. A procedure only works when someone follows it. A policy works when no one is looking.
 **Kill chain first.**
 There are two controls in this assignment that matter more than all others: MFA enforcement on all users, and legacy auth block. Everything else — admin hygiene, SSPR configuration, risk policies — is valuable but secondary. If the engagement ends early, these two must be complete.
 **Visibility as accountability.**
 Every export, every report, every baseline produced during this engagement exists in the client's own tenant and documentation system permanently. A sign-in log showing zero legacy auth clients is evidence that outlasts the engagement. An admin account inventory with a date on it creates accountability that does not require anyone to actively manage it.
 **Scope discipline.**
 Anything discovered outside scope goes into the scope boundary log — not into the work plan. A consultant who silently fixes adjacent problems during a scoped engagement creates unscoped liability and destroys the client's ability to understand what was done. Log it, name it, leave it.
 ---
 ## Delivery Architecture
 Sequenced by impact, not by calendar. Each step depends on the one before it.
 ### Step 1 — Baseline (no changes)
 | Action | Output |
 |--------|--------|
 | CAExporter export | CA policy baseline JSON |
 | Break-glass accounts created and monitored | Break-glass documentation (out of band) |
 | Sign-in log export: legacy auth clients | Legacy auth client list |
 | Global Administrator audit: who holds it, cloud-only vs synced, standing vs eligible | Admin account inventory |
 | Service principal inventory: client secrets expiry, Graph permissions, admin consent | Service principal risk log |
 | Authentication method registration report | Who has MFA registered, by method |
 | SSPR configuration review | Current state documented |
 At the end of Step 1, share the admin account inventory and legacy auth client list with the named client lead. No recommendations yet. Just findings, plainly stated.
 ---
 ### Step 2 — Kill Chain (two controls)
 **Legacy authentication block.**
 Deploy via Entra authentication policies (tenant-wide, preferred) or CA policy (targeted by legacy auth client type). Stage it: report mode for 48 hours, confirm zero legitimate legacy auth clients in sign-in logs, then enforce. The 48-hour window exists because there are always surprises — a printer, a shared mailbox script, an MFA-unregistered VIP. Find them before enforcement, not after.
 **MFA enforcement.**
 If the client has no CA policies at all: deploy one CA policy requiring MFA for all users, all cloud apps, excluding break-glass accounts. If the client has existing CA policies: review coverage gaps and close them. Staged: exclude a pilot group of 10 users for 24 hours, confirm no breakage, then enforce broadly.
 These two controls are the assignment's kill chain contribution. Legacy auth block plus MFA enforcement closes the most common attack path in the Microsoft ecosystem. Both should be complete before Step 3 begins.
 ---
 ### Step 3 — Admin Hygiene
 **Global Administrator audit.**
 Every account with Global Administrator should be cloud-only (not synced from on-premises AD — a synced account can be compromised on-prem to take the cloud). Count standing Global Admins. The target is zero standing Global Admins beyond break-glass and emergency access. If PIM is not in scope, document the gap and log it. If the client has PIM licensing (P2), note it — it is the correct next step.
 **Admin account separation.**
 Admins should have a dedicated admin account separate from their daily-use account. If they do not, log it as a scope boundary signal for a privileged access engagement. If the client will accept one quick win: rename or create dedicated admin accounts for any standing Global Admins. This is a short task with meaningful blast-radius reduction.
 **Service principal review.**
 Flag any service principal with:
 - Client secrets expiring in under 30 days (operational risk, not security risk — but surfaces the gap)
 - Tenant-wide admin consent granted
 - Graph permissions: `RoleManagement.ReadWrite.Directory`, `AppRoleAssignment.ReadWrite.All`, `Application.ReadWrite.All`, `Directory.ReadWrite.All`
 Log all flags in the scope boundary signals. Do not remediate service principals in this assignment — it requires application owner coordination and deserves its own scoped engagement.
 ---
 ### Step 4 — Risk Baseline
 **Entra ID Protection.**
 If the tenant has P2 licensing (included in E5, available separately), deploy:
 - User risk policy: require password change at High risk (Conditional Access, not legacy user risk policy)
 - Sign-in risk policy: require MFA step-up at Medium or High risk
 If no P2: document the gap. Log the licensing delta for the leave-behind.
 **SSPR.**
 If SSPR is not enabled: enable it for all users with a minimum of two authentication methods required. Default to Microsoft Authenticator + email or phone. SSPR with strong auth methods removes helpdesk dependency for password resets and is a prerequisite for a healthy MFA rollout.
 ---
 ## Structural Resilience Checklist
 Controls that hold without ongoing human willingness after this engagement closes.
 - [ ] MFA enforcement CA policy active — not in report mode
 - [ ] Legacy authentication blocked at tenant level — not just reported
 - [ ] Break-glass accounts exist, are cloud-only, are excluded from CA, are monitored with alerts
 - [ ] Break-glass credentials documented out of band
 - [ ] Sign-in risk and user risk policies active (if P2 licensed)
 - [ ] CAExporter export stored in client documentation
 - [ ] SSPR active for all users
 These are the controls that keep working after the engagement ends. If any item is not checked at handover, document why and log the residual risk.
 ---
 ## Kill Chain Contribution
 **What this assignment closes:**
 | Attack vector | Control deployed |
 |---------------|-----------------|
 | Password spray against cloud accounts | MFA enforcement |
 | Credential stuffing using breached passwords | MFA enforcement + Entra ID Protection |
 | Legacy authentication protocol abuse (SMTP, IMAP, MAPI) | Legacy auth block |
 | Basic phishing for MFA bypass via legacy clients | Legacy auth block |
 | Attacker using compromised admin account persistently | Break-glass monitoring, admin hygiene |
 **What this assignment does not close:**
 | Remaining gap | Addressed by |
 |---------------|-------------|
 | Device-based attacks (unmanaged device as access vector) | [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md) |
 | Adversary-in-the-middle / session token theft | Device compliance in CA + token protection |
 | Standing Global Administrator accounts | Privileged access engagement (PIM) |
 | Service principal over-permission | Service identity engagement |
 | Data exfiltration through sanctioned apps | Collaboration and data security assignment |
 | Persistence via application consent abuse | Service identity engagement |
 The kill chain contribution of this assignment is significant and real. The residual gaps are also real. Both belong in the leave-behind.
 ---
 ## Leave-Behind Package
 Every item below must be delivered at handover. The engagement is not complete until all items exist in the client's own documentation system.
 | Artifact | Description |
 |----------|-------------|
 | **CAExporter JSON (before)** | CA policy state at engagement start |
 | **CAExporter JSON (after)** | CA policy state at engagement close |
 | **Admin account inventory** | Every privileged role assignment: account name, role, cloud-only vs. synced, standing vs. eligible, last sign-in |
 | **Legacy auth sign-in confirmation** | Sign-in log export showing zero legacy auth clients post-block |
 | **MFA registration report** | Authentication method registration by user, at engagement close |
 | **Break-glass documentation** | Account names, monitoring alert confirmation, out-of-band credential storage reference |
 | **Service principal risk log** | Flagged principals with permissions and expiry dates |
 | **Scope boundary log** | Every finding outside this scope, named and prioritized |
 | **Residual risk statement** | Plain-language summary of what this assignment did not close and why |
 The residual risk statement is not optional. A client who receives a clean handover without a residual risk statement has been misled about their posture.
 ---
 ## Scope Boundary Signals
 Log these when you find them. Do not fix them. Do not pitch them. The log is the record.
 | Signal | Points toward |
 |--------|--------------|
 | No device compliance policies exist | Intune Security Baseline assignment |
 | CA policies exist but are poorly designed (overlapping, unnamed, undocumented) | CA Architecture assignment |
 | Global Admins have standing privilege with no PIM | Privileged access engagement |
 | Entra Connect / Cloud Sync server is domain-joined to production domain | Hybrid identity engagement — T0 isolation |
 | AD FS present | Hybrid identity engagement — Golden SAML risk, migration to PHS |
 | Service principals with tenant-wide admin consent | Service identity engagement |
 | No Defender for Office 365 baseline | Collaboration security assignment |
 | Audit logging not configured or retention < 90 days | Detection baseline assignment |
 ---
 *For the conditional access architecture built on top of this baseline, see [Assignment: CA Architecture](assignment-ca-architecture.md).*
 *For technical depth on hybrid identity and the sync server risk, see [Book II — Hybrid Identity](../books/01-hybrid-identity.md).*
 *For privileged access architecture, see [Book III — Privileged Access](../books/02-privileged-access.md).*
@@ -0,0 +1,384 @@
 # Assignment: Intune Security Baseline
 > *The device will be compromised. Compliant is not the same as secure, and the portal toggle is not the same as the device's behaviour. Build for the compromise, not against it.*
 This is a **scoped assignment package** — a complete, principled delivery guide for one specific client brief. It closes the device-layer gap and activates the CA Layer 3 policies designed in [Assignment: CA Architecture](assignment-ca-architecture.md). It can be delivered standalone, but its full structural value is realised when CA Layer 3 is activated at the end.
 ---
 ## The Brief
 Client requests that fall within this scope:
 - *"Deliver a security baseline for our Intune-managed endpoints"*
 - *"Set up Intune / we need device management"*
 - *"We need compliant devices to be required for M365 access"*
 - *"Our auditor wants evidence that devices are encrypted and patched"*
 - *"We have Intune but nobody set up the security policies"*
 - *"We're retiring SCCM and going cloud-native"* (if co-management migration is explicitly scoped)
 This assignment does not require executive sponsorship. It requires one named IT lead with Intune Administrator access, a tolerance for a grace-period before enforcement, and an understanding that the enrollment rate at the start is almost never what the CMDB says.
 ---
 ## Scope Boundary
 **In scope:**
 - Device population mapping (what is actually authenticating, vs. what is enrolled, vs. what the CMDB says)
 - Compliance policies: Windows, macOS, iOS, Android — as applicable to the fleet
 - Device configuration profiles: Windows security baseline settings
 - Windows Update rings (quality and feature updates)
 - Windows LAPS (local admin password management)
 - App Protection Policies for BYOD iOS and Android (MAM without MDM)
 - Enrollment review and gaps (not a new enrollment deployment unless scoped separately)
 - CA Layer 3 activation: connecting compliance state to Conditional Access
 **Out of scope:**
 - SCCM co-management migration → separate engagement (scope is complex and fleet-specific)
 - Autopilot setup and Autopilot-based provisioning → separate deployment engagement
 - EDR configuration: Defender for Endpoint advanced features, custom detection rules → separate or within E5 engagement
 - WDAC / Smart App Control / application allowlisting → advanced application control engagement
 - Driver and firmware update management → note as gap, recommend Windows Update for Business or third-party where Intune is insufficient
 - GPO conflict resolution for hybrid-joined estates → flag; recommend cloud-native migration path
 - Endpoint Privilege Management (JIT local admin elevation) → note as follow-on if standing local admin cannot be removed
 When the client asks about SCCM migration or Autopilot, scope it separately. Co-management is a legitimate transitional architecture but it adds complexity that deserves its own scoped engagement with its own completion criteria.
 ---
 ## Before You Touch Anything
 **1. Break-glass exclusion.**
 Confirm that break-glass accounts are excluded from all device-compliance CA policies. A flaky compliance signal must never lock out tenant recovery. If CA Layer 3 is not yet designed, this step ensures the door is open when it is deployed.
 **2. Four-population mapping.**
 The CMDB is a claim. Authentication logs are facts. Before configuring compliance policies, build the real device picture from four sources:
 | Population | Source |
 |-----------|--------|
 | **Enrolled (MDM)** | Intune device list |
 | **Registered (Entra)** | Entra ID → Devices → All devices |
 | **Authenticating** | Entra sign-in logs (30 days), filtered by device detail |
 | **CMDB** | Whatever the client has |
 Map the differences. Devices in sign-in logs but not in Intune are known-unmanaged — they reach data and you cannot apply compliance policies to them. Devices in the CMDB but not in sign-in logs may be retired equipment or offline devices that have never actually authenticated. The gap between enrolled and authenticating is the real finding, and it belongs in the leave-behind regardless of whether it is addressed in this engagement.
 **3. Existing Intune policy audit.**
 If Intune has been configured before — even partially — audit what exists before touching anything. Duplicate compliance policies, conflicting configuration profiles, and orphaned enrollment restrictions are common. A client who says "Intune is set up" often has one compliance policy created in 2021, three enrollment profiles nobody recognises, and a Windows security baseline applied to a group that no longer exists. Export the current state.
 **4. CA Layer 3 status.**
 Check whether `CA-AllUsers-AllApps-RequireCompliantDevice` exists in report-only mode from the CA Architecture assignment. If it does, this assignment ends by activating it. If it does not exist, design and deploy it in report-only mode as part of this assignment — but do not activate it until compliance coverage is proven.
 ---
 ## Principles Applied
 **Compliance is a signal, not a checkbox.**
 A device marked compliant in Intune carries a staleness window: compliance is evaluated on check-in cadence, not continuously. A device can fall out of compliance — lose encryption, miss patches, be rooted — and still hold a valid compliant token and access grant for hours. Design around this: the compliance requirement at CA is a meaningful control that raises the cost of attack, not a guarantee of device integrity. Document what it is and what it isn't.
 **Test on real devices, not portal configurations.**
 A Conditional Access policy can show a perfectly correct configuration in the portal and enforce nothing. The same applies to compliance policies: a policy assigned to a group can appear active and produce no compliance results for enrolled devices whose group membership has drifted. And MAM/App Protection enforcement has documented gaps between the toggle and the actual device behaviour — gaps that vary by platform, OS build, and companion app version. For every control that matters, confirm it with a real device producing the expected result. Write the expected result down before you test, not after.
 **Velocity with a brake.**
 Update rings exist not to slow patching but to make patching safe at speed. An unbraked push to the entire fleet is one bad update away from a mass outage — the kind that stops production, not the kind that stops attackers. A canary ring with a real halt-and-rollback capability is the mechanism that lets the rest of the fleet patch fast and safely. The canary must be tested — an untested canary is just the first domino with a friendly name.
 **The device is disposable; the data boundary is the protection.**
 Every design decision in this assignment should ask: if this device is wiped and reprovisioned in an hour, does anything important break? A device that can be reprovisioned in an hour is antifragile. A device whose compromise is a crisis is fragile, regardless of how many compliance policies are applied to it. Build for reprovisionability: Autopilot, LAPS, application deployment from Intune, user profile from OneDrive. The compliance baseline hardens the device; the reprovision capability makes its loss survivable.
 ---
 ## Delivery Architecture
 ### Step 1 — Population Mapping and Audit (no changes)
 | Action | Output |
 |--------|--------|
 | Four-population mapping (enrolled / registered / authenticating / CMDB) | Device population report: counts, deltas, known-unmanaged estimate |
 | Existing compliance policy audit | Policy inventory: assignments, settings, mode, last modified |
 | Existing configuration profile audit | Profile inventory: conflicts, orphaned assignments, platform coverage |
 | Update ring inventory | Current rings or absence of rings |
 | Sign-in log: device compliance state | What proportion of sign-ins carried a compliant device signal in the last 30 days |
 | LAPS status | Whether Windows LAPS is deployed or legacy LAPS or neither |
 Share the device population report with the named client lead before writing any policies. The finding is almost always the same: the managed fleet is smaller than assumed, the dark population is larger than assumed, and several CMDB entries have not authenticated in months. State it plainly.
 ---
 ### Step 2 — Compliance Policies (report mode first)
 Deploy all compliance policies in report mode. Review results for 72 hours before activating noncompliance actions. The goal at this step is to see the real compliance state of the fleet — not to block anyone.
 **Noncompliance action sequence (apply to all compliance policies):**
 | Day | Action |
 |-----|--------|
 | 0 | Mark noncompliant (reporting only — this is immediate and always on) |
 | 1 | Send email notification to user |
 | 7 | Block access (activates when `CA-AllUsers-AllApps-RequireCompliantDevice` is enabled) |
 | 30 | Retire device (for persistent noncompliance — confirm with client lead before activating) |
 The 7-day grace window is not leniency — it is the window in which IT can identify and remediate legitimate noncompliance (device in repair, device offline, missed check-in) before a user is blocked. Without it, the first enforcement wave produces a support ticket flood. With it, enforcement is gradual and explainable.
 **Windows compliance policy — baseline settings:**
 | Setting | Value | Rationale |
 |---------|-------|-----------|
 | BitLocker required | Yes | Unencrypted devices lose data on physical theft |
 | OS minimum version | Windows 10 22H2 / Windows 11 22H2 | Below this: no Windows LAPS; OS in extended support only |
 | Defender AV enabled | Yes | Baseline detection |
 | Defender real-time protection | Yes | |
 | Firewall enabled | Yes | |
 | Secure boot enabled | Yes | Blocks bootkit-level compromise |
 | TPM required | Yes (for new enrollments; consider exclusion group for legacy hardware) | PRT TPM-binding requires TPM |
 | Password required | Yes | Minimum complexity, minimum length 8 |
 | Maximum inactivity before screen lock | 15 minutes | |
 Do not configure the compliance policy to evaluate Microsoft Defender for Endpoint risk score unless Defender for Endpoint P2 (E5) is licensed. Misconfiguring this setting against an E3 tenant produces false noncompliance for all devices.
 **macOS compliance policy (if fleet includes Macs):**
 | Setting | Value |
 |---------|-------|
 | FileVault enabled | Yes |
 | OS minimum version | macOS 13 (Ventura) or later |
 | Password required | Yes |
 | Firewall enabled | Yes |
 | System Integrity Protection | Yes |
 **iOS compliance policy:**
 | Setting | Value |
 |---------|-------|
 | OS minimum version | iOS 16 or later |
 | Passcode required | Yes |
 | Jailbreak detection | Block jailbroken devices |
 | Device threat level | Secured (no threat level tolerance) |
 **Android compliance policy:**
 | Setting | Value |
 |---------|-------|
 | OS minimum version | Android 12 or later |
 | Device PIN required | Yes |
 | Rooted devices | Block |
 | Minimum security patch level | Within 90 days |
 **The honest note on jailbreak/root detection:** detection is an arms race. A motivated attacker with a current tool bypasses it. Treat root detection as a tripwire that raises the cost of the attack, never as a barrier that stops it. Document this in the residual risk statement.
 ---
 ### Step 3 — Device Configuration Baseline
 The Microsoft Windows Security Baseline (available in Intune → Endpoint security → Security baselines) is the starting point. It encodes Microsoft's recommended settings as an Intune profile that enforces continuously.
 **Deployment approach:**
 1. Deploy the Windows Security Baseline in **report mode** to a pilot group (10–20 devices, IT team first)
 2. Review conflicts and configuration gaps for 48 hours
 3. Resolve any conflicts with existing policies (overlapping profiles produce unpredictable results — Intune applies the stricter setting per-setting by default, but conflicting values create undefined behaviour)
 4. Expand to production groups
 5. Monitor Intune reports for policy conflicts and noncompliance
 **Additional configuration profiles (deploy after the security baseline is stable):**
 | Profile | Purpose | Notes |
 |---------|---------|-------|
 | **BitLocker configuration** | Enable BitLocker silently, escrow recovery keys to Entra | Separate from compliance (compliance requires BitLocker; this profile configures how it's applied) |
 | **Microsoft Defender AV** | Configure exclusions, scheduled scans, PUA protection | Do not configure AV exclusions broadly — each exclusion reduces coverage |
 | **Firewall configuration** | Block inbound connections, logging | Complements compliance requirement |
 | **Edge browser baseline** | SmartScreen, extension management, safe browsing, disable password manager sync | Applies to corporate Edge profile; test carefully — extension management can break legitimate workflows |
 | **Windows Hello for Business** | Phishing-resistant authentication at device layer | If deploying phishing-resistant MFA (required by CA-Admins policy), WHfB is the most practical path |
 ---
 ### Step 4 — Update Rings
 Update rings are the mechanism that makes patching fast and safe simultaneously. Deploy three rings minimum.
 **Ring structure:**
 | Ring | Assignment | Quality update deferral | Feature update deferral | Notes |
 |------|-----------|------------------------|------------------------|-------|
 | **Canary** | IT team (5–10 devices) | 0 days | 0 days | Takes every update immediately. Canary for production rings. Must include at least one machine that runs every critical business application. |
 | **Pilot** | 10–15% of fleet, varied roles | 7 days | 30 days | Broad business representation. If Canary is clear after 7 days, Pilot proceeds. |
 | **Production** | Remainder | 14 days | 90 days | Conservative deferral. If Pilot is clear after 7 days, Production proceeds. |
 **Pause and rollback configuration:**
 Configure Intune update rings with the pause capability enabled. Define in the client's runbook:
 - Who has authority to pause an update ring (named person, not a committee)
 - What the trigger is for pausing (Canary devices showing a known issue, not a vague "something might be wrong")
 - Maximum pause duration before the pause is reviewed (7 days)
 An untested pause capability is a fiction. Test it during the engagement: deploy an update to Canary, confirm it lands, pause the ring, confirm the pause holds, resume. This takes 30 minutes and is the only proof the mechanism works.
 ---
 ### Step 5 — Windows LAPS
 Standing local administrator accounts are the device-layer version of standing privilege. If the same local admin password is shared across the fleet (common in legacy environments), one compromised device yields lateral movement credentials for the entire estate.
 **Windows LAPS (cloud-native):**
 - Available on Windows 10 22H2+ and Windows 11 22H2+ with current patches
 - Configure backup target: Entra ID (cloud-native; no on-prem infrastructure required)
 - Rotation schedule: 30 days, plus rotate on device handoff
 - Requires Entra ID P1 (included in E3)
 **Deployment:**
 1. Enable LAPS in Entra ID (Entra admin center → Devices → Device settings → Enable Microsoft Entra Local Administrator Password Solution)
 2. Create an Intune LAPS policy (Endpoint security → Account protection → LAPS)
 3. Assign to a pilot group; confirm password backup to Entra after check-in
 4. Expand to production
 **For legacy LAPS (on-prem AD environments where Windows LAPS is not yet deployable):**
 Legacy LAPS (the original Microsoft LAPS MSI) remains deployable via Intune for hybrid-joined devices. Flag this as a transitional state — cloud-native Windows LAPS is the destination.
 **What this does not solve:** if standing Domain Admin or local admin is provided to specific IT staff outside of LAPS, that standing privilege is out of scope for this assignment. Log it in scope boundary signals.
 ---
 ### Step 6 — App Protection Policies (BYOD)
 App Protection Policies (MAM without MDM) manage the data layer on personal devices without enrolling the device. This is the correct model for BYOD: wall the corporate data, not the device.
 **The honest caveat, stated plainly:** App Protection Policy enforcement has gaps. The policy controls what managed apps should do; the actual enforcement is dependent on the app version, OS version, companion app (Company Portal on Android), and specific API support. "Block copy/paste to unmanaged apps" blocks in documented paths — it does not block screenshots, OS-level share sheet on some platforms, or every third-party clipboard manager. Test on real devices. Document what you verified and where the limits are.
 **Deploy separate policies per platform.** iOS and Android are not symmetric. A policy that works on iOS may not produce the same behaviour on Android. Test both independently.
 **iOS App Protection Policy — baseline settings:**
 | Setting | Value |
 |---------|-------|
 | Prevent "Save As" to personal storage | Block |
 | Restrict cut/copy/paste to managed apps only | Managed apps with paste in |
 | Require PIN for app access | Yes (after 5 minutes inactivity) |
 | Minimum OS version | iOS 16 |
 | Offline grace period before access blocked | 720 hours (30 days) |
 | Selective wipe after failed PIN attempts | Yes (after 10 attempts) |
 | Minimum app version | Latest − 1 (configure per app) |
 | Jailbroken/rooted devices | Block |
 Apply to: Outlook, Teams, Edge, OneDrive, SharePoint mobile. These are the apps through which corporate data flows on BYOD devices.
 **Android App Protection Policy — same baseline settings.** Test enforcement independently — behaviour on Android differs, particularly clipboard controls and "open in" restrictions.
 **Selective wipe verification:**
 Test selective wipe on a real BYOD device before the engagement closes. Confirm that corporate data (email, files, Teams content) is removed and personal data (photos, personal apps) is not. This is the capability that makes MAM politically viable — if the user doesn't trust that it won't touch their personal data, enrollment fails. Document the test.
 ---
 ### Step 7 — CA Layer 3 Activation
 This is the step that connects device compliance to access control. Everything before this point has been deploying and measuring; this step makes compliance matter for access.
 **Prerequisites before activating:**
 - [ ] Compliance policy deployed and returning results for ≥ 80% of the enrolled fleet
 - [ ] 72 hours of report-only compliance results reviewed — no widespread false noncompliance identified
 - [ ] Break-glass accounts confirmed excluded from device compliance CA policies
 - [ ] Named client lead has approved activation in writing
 - [ ] IT team briefed on noncompliance action timeline (users blocked after day 7 if noncompliant)
 - [ ] Helpdesk runbook written: what to do when a user is blocked due to noncompliance
 **Activation sequence:**
 1. Switch `CA-AllUsers-AllApps-RequireCompliantDevice` from report-only to **enabled**
 2. Monitor Intune compliance dashboard and Entra sign-in logs for 24 hours
 3. Confirm: compliant devices are signing in successfully; noncompliant devices are being blocked at CA
 4. Confirm: break-glass accounts are not blocked
 Do not activate device-compliance CA policies on a Monday or before a public holiday. An unexpected compliance failure during a period of low IT staffing is a bad outcome that a one-day wait entirely prevents.
 **After activation, the compliance signal is live.** A device that loses compliance — drops encryption, falls behind on patches, is rooted — will be blocked from M365 access within the 7-day noncompliance action window. This is the control working as designed.
 ---
 ## Structural Resilience Checklist
 Controls that hold without ongoing human willingness after this engagement closes.
 - [ ] Compliance policies deployed and returning results for enrolled devices
 - [ ] Noncompliance action timer active (day 7 block — not just report)
 - [ ] Windows Security Baseline profile active on production fleet
 - [ ] Update rings deployed with Canary, Pilot, and Production separation
 - [ ] Update ring pause tested at least once
 - [ ] Windows LAPS deployed; local admin passwords backing up to Entra
 - [ ] App Protection Policies active for iOS and Android BYOD (tested on real devices)
 - [ ] Selective wipe tested on BYOD device
 - [ ] `CA-AllUsers-AllApps-RequireCompliantDevice` **enabled** (not report-only)
 - [ ] Break-glass accounts excluded from device compliance CA policies — confirmed with a real sign-in
 ---
 ## Kill Chain Contribution
 **What this assignment closes (or significantly raises the cost of):**
 | Attack vector | Control deployed |
 |---------------|-----------------|
 | Stolen credentials used from unmanaged/unknown device | CA Layer 3: compliant device required |
 | Physical theft of unencrypted device | BitLocker compliance requirement |
 | Lateral movement via shared local admin credentials | Windows LAPS: unique per-device passwords |
 | Unpatched OS exploited at known CVE | Update rings: enforced patch cadence |
 | BYOD personal device accessing corporate data without controls | App Protection Policies: data container on unmanaged device |
 | Attacker persistence on device after credential reset | Compliance noncompliance action: device retired after persistent noncompliance |
 **What this assignment does not close:**
 | Remaining gap | Addressed by |
 |---------------|-------------|
 | Session token theft post-compliance check (AiTM phishing) | Entra token protection (P2) + continuous access evaluation |
 | Compromised but still-compliant device (stale signal window) | Defender for Endpoint device risk integration (E5) |
 | App-layer data exfiltration through sanctioned apps | Collaboration and data security assignment |
 | Advanced malware, post-exploitation on managed device | EDR: Defender for Endpoint P2 (E5) or Wazuh/Sysmon augmentation |
 | Standing privilege on servers accessed from managed devices | Privileged access engagement |
 | Dark access (legacy auth, long-lived tokens bypassing CA) | Legacy auth block (identity baseline) + token lifetime policies |
 The most important gap to document plainly: a managed, compliant device that carries a stolen session token (issued after legitimate MFA) still has access. The compliance signal does not re-evaluate session tokens retroactively. Continuous Access Evaluation (CAE) narrows this window for supported apps — verify which apps in the client's environment support CAE, and document the remainder as residual risk.
 ---
 ## Leave-Behind Package
 | Artifact | Description |
 |----------|-------------|
 | **Device population report** | Four-population map: enrolled, registered, authenticating, CMDB; delta analysis; known-unmanaged estimate |
 | **Compliance policy documentation** | Every policy: settings, assignments, noncompliance action timeline, rationale |
 | **Compliance dashboard export** | Compliance rates by policy and platform at engagement close |
 | **Configuration profile documentation** | Security baseline and supplemental profiles: settings, assignments, conflict analysis |
 | **Update ring documentation** | Ring structure, deferral schedule, pause/rollback procedure, pause test result |
 | **LAPS deployment confirmation** | Devices with LAPS active; Entra backup confirmed; rotation schedule |
 | **App Protection Policy documentation** | iOS and Android policies: settings, tested behaviours, documented gaps per platform |
 | **Selective wipe test record** | Device tested, result, personal data confirmed intact |
 | **CA Layer 3 activation confirmation** | Sign-in log showing compliant devices accessing successfully, noncompliant devices blocked |
 | **Scope boundary log** | Every finding outside this scope, named and prioritized |
 | **Residual risk statement** | What this assignment did not close: stale compliance signal, AiTM token theft, EDR gap, dark access |
 ---
 ## Scope Boundary Signals
 | Signal | Points toward |
 |--------|--------------|
 | Shadow IT apps visible in Intune application inventory | Collaboration and data security assignment; shadow AI discovery |
 | SCCM co-management active; GPO policies conflicting with Intune | Co-management migration engagement; AD hardening |
 | Hybrid-joined devices that depend on line-of-sight to DC | Cloud-native migration path; hybrid identity engagement |
 | No Defender for Endpoint P2; device risk signal not feeding CA | E5 licensing gap; E3 augmentation with Wazuh/Sysmon |
 | Standing local admin accounts for IT staff outside LAPS scope | Privileged access engagement (Endpoint Privilege Management) |
 | Autopilot not configured; device reprovision takes days not hours | Autopilot deployment engagement |
 | Legacy devices below Windows 10 22H2 in the compliance-excluded group | Accelerate OS refresh; document as known risk with timeline |
 | Audit log retention < 90 days | Detection baseline assignment |
 | MAM enforcement gaps found during BYOD testing | Document with vendor; consider MDM enrollment for corporate-issued mobile |
 ---
 ## Buildable-On: What the Next Assignment Depends On
 The Collaboration and Data Security assignment builds on the device posture deployed here. Specifically:
 1. **`CA-AllUsers-UnmanagedDevice-AppEnforcedRestrictions` behaviour** is now testable against the real unmanaged device population. With enrolled and unmanaged devices mapped, you know which users will be affected by app-enforced restrictions and can design the policy accurately.
 2. **The application inventory from Intune** surfaces the shadow IT picture that informs data security scope — what apps are running, what cloud storage is installed, whether consumer AI tools are present.
 3. **Managed device as a data exfiltration boundary** — with compliant devices required for access, the remaining data risk is through sanctioned apps on managed devices. That is the scope of the next assignment.
 ---
 *For the identity foundation, see [Assignment: Identity Baseline](assignment-identity-baseline.md).*
 *For the CA Layer 3 policies this assignment activates, see [Assignment: CA Architecture](assignment-ca-architecture.md).*
 *For the governing philosophy on device posture, see [Book IV — Devices & Endpoint](../books/03-devices-and-intune.md).*
@@ -14,9 +14,9 @@ This template provides a reusable structure for building financial justification
 | Element | Content |
 |---------|---------|
-| **Investment ask** | €[X] over 180 days, phase-gated with go/no-go decisions at days 30, 60, 90 |
+| **Investment ask** | €[X] over 180 days, phase-gated with go/no-go decisions at days 60, 120, 180 |
-| **Primary return** | Reduction of existential cyber risk; regulatory compliance evidence; competitive differentiation through AI sovereignty |
+| **Primary return** | Reduction of existential cyber risk; regulatory compliance evidence; operational resilience demonstrable to auditors and insurers |
-| **Break-even** | Day 90 (via avoided regulatory fine exposure, reduced insurance premiums, or operational resilience) |
+| **Break-even** | 12–18 months post-programme: insurance premium reductions take one renewal cycle; regulatory evidence value accumulates from day 1; incident avoidance value is probabilistic but compounding |
 | **Risk of inaction** | Quantified below; summary: [X]% probability of material incident within 24 months at estimated cost of €[Y] |
 ### Page 2: Cost of Inaction
@@ -27,11 +27,11 @@ This template provides a reusable structure for building financial justification
 | Risk Category | Probability (Client-Specific) | Average Industry Cost | Expected Value |
 |--------------|------------------------------|----------------------|----------------|
-| Ransomware incident (recovery + downtime) | [X]% | €4.5M | €[X * 4.5M] |
+| Ransomware incident (recovery + downtime) | [X]% | €4.5M average (IBM 2024) | €[X * 4.5M] |
-| Regulatory fine (DORA / NIS2 / national) | [X]% | 1-2% global turnover | €[X * % GT] |
+| Regulatory fine (DORA / NIS2 / national) | [X]% | Up to 2% global turnover (NIS2); up to 1% daily (DORA) | €[X * % GT] |
-| Data breach notification and remediation | [X]% | €3.8M (per IBM Cost of Data Breach Report) | €[X * 3.8M] |
+| Data breach notification and remediation | [X]% | €3.8M average (IBM Cost of Data Breach 2024) | €[X * 3.8M] |
-| Cloud AI vendor price increase / lock-in | [X]% | 200-500% price shock | €[X * shock] |
+| Incident response and forensics | [X]% | €150K–500K (external IR firm + legal + crisis comms, independent of breach cost) | €[X * 325K] |
-| Competitive intelligence loss (cloud AI training) | [X]% | Unquantifiable but existential | High |
+| Business interruption during recovery | [X]% | €[daily revenue] × [estimated downtime days] — client-specific | €[X * daily] |
 **Calculation**:
@@ -58,11 +58,11 @@ Present this as: *"Without intervention, the organization faces an expected loss
 | Phase | Timeline | Primary Activity | Estimated Cost | Go/No-Go Gate |
 |-------|----------|-----------------|----------------|---------------|
-| **1. Hygiene** | Days 0-30 | Configuration of existing tools; identity cleanse; visibility | €[X] (primarily labor) | Day 30: Demonstrate risk reduction or stop |
+| **1. Visibility** | Days 0–60 | Kill chain mapping; T0 identity hardening; ASTRAL/PULSAR deployment; T0 backup verified | €[X] (primarily labor) | Day 60: Kill chain documented and T0 hardening complete |
-| **2. Control** | Days 30-60 | ASR, MFA enforcement, network segmentation, vendor lockdown | €[X] (labor + minimal tooling) | Day 60: Validate control effectiveness |
+| **2. Control** | Days 60–120 | MFA for all users; CA baseline; attack surface reduction; vendor hardening | €[X] (labor + minimal tooling) | Day 120: MFA enforced 100%; P0/P1 vulnerabilities closed |
-| **3. Sovereignty** | Days 60-90 | Local AI pilot; recovery drills; T0 asset protection | €[X] (labor + local inference hardware if needed) | Day 90: Prove local AI viability |
+| **3. Signal** | Days 120–180 | Detection rules; alert runbooks; knowledge transfer; housekeeping stream operational | €[X] (labor) | Day 180: Client operates independently; housekeeping running |
-| **4. Antifragility** | Days 90-180 | Chaos engineering; red team; continuous improvement | €[X] (labor + external testing) | Day 180: Maturity assessment and next-phase planning |
+| **4. Retained capability** | Ongoing | Quarterly retained scope; detection engineering; housekeeping; structural improvements | €[X]/quarter | Ongoing: measurable queue reduction; annual BloodHound/Elysium |
-| **Total** | 180 days | | **€[X]** | |
+| **Total (180-day programme)** | 180 days | | **€[X]** | |
 #### Cost Categories
@@ -78,11 +78,11 @@ Present this as: *"Without intervention, the organization faces an expected loss
 | Alternative Approach | Cost | Timeline | Risk |
 |---------------------|------|----------|------|
-| **Do nothing** | €0 | — | Expected loss €[X] over 24 months |
+| **Do nothing** | €0 | — | Expected loss €[X] over 24 months; growing regulatory exposure |
-| **Traditional security audit** | €[X] | 90 days | Produces report; no structural change |
+| **Traditional security audit** | €[X] | 90 days | Produces report; no structural change; findings age immediately |
-| **Full E5 licensing upgrade** | €[X]/user/year | 30 days | Solves some gaps; does not address architecture or AI sovereignty |
+| **Full E5 licensing upgrade** | €[X]/user/year | 30 days | Solves tooling gaps; does not address architecture, process, or accumulated technical debt |
-| **Managed security service (MSSP)** | €[X]/month | Ongoing | Outsources detection; does not reduce structural fragility |
+| **Managed security service (MSSP)** | €[X]/month | Ongoing | Outsources detection; does not reduce structural fragility; dependency without capability transfer |
-| **Antifragile program (this proposal)** | €[X] | 180 days | Structural change, regulatory evidence, AI sovereignty, measurable resilience |
+| **Antifragile programme (this proposal)** | €[X] | 180 days + retained | Structural change, regulatory evidence, measurable kill chain closure, client operational independence |
 ---
@@ -97,7 +97,7 @@ Present this as: *"Without intervention, the organization faces an expected loss
 | Avoided ransomware recovery | Probability reduction × €4.5M | €[X] | €[Y] |
 | Avoided regulatory fine | Probability reduction × % GT | €[X] | €[Y] |
 | Insurance premium reduction | 10-20% reduction on cyber premium | €[X] | €[Y] |
-| Cloud AI cost stabilization | Shift from variable API costs to fixed infra | €[X] | €[Y] |
+| Audit preparation time reduction | ASTRAL Git trail replaces manual evidence gathering for ISO 27001, NIS2, DORA | €[X] | €[Y] |
 | Reduced incident response cost | Faster detection and containment | €[X] | €[Y] |
 | **Total Quantifiable Return** | | **€[X]** | **€[Y]** |
@@ -105,7 +105,7 @@ Present this as: *"Without intervention, the organization faces an expected loss
 | Return Category | Description |
 |----------------|-------------|
-| **Competitive moat** | Proprietary data improves only your models; competitors cannot replicate your operational intelligence |
+| **Regulatory agility** | Demonstrable continuous controls accelerate regulatory approvals, certification audits, and partnership due diligence |
 | **Regulatory agility** | Demonstrable resilience accelerates regulatory approvals, market entries, and partnership discussions |
 | **Talent retention** | Engineers and security professionals prefer organizations that invest in durability over firefighting |
 | **M&A readiness** | Clean identity architecture, tested recovery, and documented controls increase valuation and reduce due-diligence friction |
@@ -139,17 +139,18 @@ Present as: *"This program delivers a [X]% return in year one, rising to [Y]% in
 | Scenario | Investment Adjustment | Outcome |
 |----------|----------------------|---------|
-| **Best case** | No additional tooling needed | Program completes under budget; all value from configuration |
+| **Best case** | No additional tooling needed; client IT team engaged and responsive | Programme completes on timeline; all value from configuration; client operational independence achieved at day 180 |
-| **Base case** | Local AI hardware required for pilot | Slight budget increase; sovereign intelligence proven |
+| **Base case** | Minor tooling additions; moderate IT team availability; some change management friction | Programme completes with 2–4 week slippage on Phase 2 (MFA rollout change management is the usual bottleneck); strong kill chain closure and detection capability |
-| **Worst case** | Deeper technical debt than anticipated | Extend Phase 1 by 30 days; additional labor cost; still cheaper than incident |
+| **Challenging** | Significant technical debt discovered in Phase 1; IT team constrained; change windows infrequent | Phase 1 extended by 4–6 weeks; Phase 2 scope narrowed to kill chain critical path; programme value is still genuine — the findings alone are worth the investment; honest client conversation required at day 60 gate |
 | **Abort condition** | Executive sponsor departure; IT team fully occupied by another major project; scope fundamentally different from discovery call | Programme paused or stopped at the next gate. Partial phases produce partial value — ASTRAL/PULSAR deployed, kill chain documented. Better to stop honestly than to produce a report that nobody acts on. |
 ---
 ### Page 6: Recommendation and Next Steps
-**The Ask (Full Program)**:
+**The Ask (Full Programme)**:
-> *"We recommend approval of a 180-day antifragile enterprise program, structured in four 30-60-90-180 day phases with hard go/no-go gates. The initial 30-day investment is €[X] with a defined deliverable: identification and initial closure of the organizational kill chain. If measurable risk reduction is not demonstrated by Day 30, the program stops with no further obligation."*
+> *"We recommend approval of a 180-day antifragile enterprise programme with three hard milestones. By Day 30: your kill chain is documented, ASTRAL and PULSAR are live, and your most privileged accounts are hardened. By Day 90: MFA covers the entire organisation, your kill chain is closed, and you have detection capability on M365. By Day 180: your team operates the systems independently, housekeeping is running as a permanent stream, and everything we built is in your repository. That is the 180-day programme. What comes after is a retained scope — scoped separately, renewed quarterly."*
 **The Ask (Modular Alternative)**:
@@ -0,0 +1,213 @@
 # CQRE Product Suite: ASTRAL, PULSAR, and AURORA
 > *"Three questions every M365 administrator eventually asks: what does my configuration look like, what happened in my tenant, and what does it mean? The CQRE suite is built to answer all three — each product independently valuable, progressively more powerful in combination."*
 This document describes the three CQRE-built products, how they fit into the antifragile consulting framework, and how to position and deploy them in engagements.
 ---
 ## Suite Overview
 | Product | Full Name | What It Answers | Model | Repo |
 |---------|-----------|-----------------|-------|------|
 | **ASTRAL** | Admin Security: Tenant Review, Automation & Lifecycle | *What does my M365 configuration look like and what has changed?* | Free, open source | [github.com/cqrenet/astral](https://github.com/cqrenet/astral) |
 | **PULSAR** | Platform for Unified Log Search, Alerting & Review | *What happened in my tenant, when, and by whom?* | Free, open source | [github.com/cqrenet/pulsar](https://github.com/cqrenet/pulsar) |
 | **AURORA** | Audit, Unified Review, Observability & Remediation for Administrators | *What does it mean and what should I do?* | Paid | [aurora.cqre.net](https://aurora.cqre.net) |
 **The product narrative in one sentence**: PULSAR captures the signal, ASTRAL holds the baseline, AURORA makes sense of both.
 ---
 ## Framework Alignment
 Each product maps directly to specific antifragile pillars from the [Antifragile Manifest](../core/antifragile-manifest.md).
 ### ASTRAL → Pillar 1 (Structural Decoupling) + Pillar 5 (Asymmetric Payoff Design)
 ASTRAL treats M365 configuration as code — Git-tracked snapshots with PR-based review, drift detection, and baseline restore. Every Intune profile, Conditional Access policy, and Entra setting is versioned, auditable, and recoverable.
 **Pillar 1 alignment**: ASTRAL surfaces hidden coupling in M365 configuration. Conditional Access policies with undocumented exclusion groups, Intune profiles silently competing, admin roles accumulating without review — these are the dependencies that produce fragility. ASTRAL makes them visible and governable.
 **Pillar 5 alignment**: The deployment cost is low (one Azure DevOps project, one Entra app registration). The protection payoff is disproportionately large: compliance evidence produced automatically, configuration changes reviewed before they become incidents, rollback available in minutes.
 **Kill chain relevance**: A compromised admin account that modifies Conditional Access policies is a kill-chain event. ASTRAL detects this drift within minutes via the event-driven change probe and surfaces it as a PR requiring review.
 ### PULSAR → Pillar 3 (Stress-to-Signal Conversion)
 PULSAR ingests M365 audit events — Entra directory changes, Intune actions, Exchange/SharePoint/Teams operations — into a searchable, retained store with alerting and SIEM forwarding.
 **Pillar 3 alignment**: PULSAR is the instrumentation layer for M365. Without it, admin actions are visible for 90 days in the M365 portal and then gone. With it, every admin action becomes permanent, searchable signal. An incident that would have been un-investigable three months later becomes reconstructible in minutes.
 The antifragile principle is explicit: **every stress event produces a signal**. PULSAR ensures no signal is lost.
 **Kill chain relevance**: When a threat actor enumerates admin accounts, modifies authentication methods, or creates a new enterprise application for persistence, PULSAR captures these events. Combined with alerting rules, PULSAR converts audit noise into actionable detection.
 ### AURORA → Pillar 4 (Sovereign Intelligence) + Pillar 2 (Optionality Preservation)
 AURORA connects to PULSAR and ASTRAL via their MCP servers and exposes a unified AI-assisted interface for cross-tool diagnostics, multi-scope orchestration, and enriched SIEM forwarding.
 **Pillar 4 alignment**: AURORA is sovereign intelligence applied to M365 operations. The cross-tool diagnostic tools — correlating audit events with configuration state — produce intelligence that no commercial tool natively generates. This intelligence lives in your infrastructure (self-hosted) or in EU-hosted infrastructure (managed tier), not in a vendor platform you cannot control.
 **Pillar 2 alignment**: AURORA is designed for optionality at every layer. It stores no data itself — data lives in PULSAR's MongoDB and ASTRAL's Git repository, both under your control. The AI layer is pluggable (Azure OpenAI, Ollama, or the managed `llm.cqre.net` endpoint). Switching the underlying model requires one config line.
 ---
 ## Product Details
 ### ASTRAL
 **What it tracks**:
 *Intune*: App Configuration, App Protection, Applications, Compliance Policies, Device Configurations, Enrollment Configurations, Filters, Scope Tags, Scripts, Settings Catalog, and more.
 *Entra*: Named Locations, Authentication Strengths, Conditional Access, App Registrations, Enterprise Applications.
 **How it works**:
 1. An Azure DevOps pipeline runs daily (and on-demand via an event-driven change probe) to export the full tenant configuration.
 2. Drift from the committed baseline is committed to a drift branch and surfaced as a rolling PR.
 3. Reviewers approve or reject individual changes in the PR. Approved changes merge; rejected changes trigger an automated restore.
 4. The entire history lives in Git — indefinite, auditable, diff-able.
 **AI (optional)**: Bring your own Azure OpenAI endpoint to generate human-readable PR narratives. ASTRAL is complete without it — AI is supplementary, not required.
 **MCP server**: ASTRAL includes an MCP server (Azure Container Apps) that exposes tenant state and drift history to AI assistants via natural-language queries. This is what AURORA connects to.
 **Deployment**: Azure DevOps, one pipeline set per tenant. Full setup guide in [`deploy/onboarding-runbook.md`](https://github.com/cqrenet/astral/blob/main/deploy/onboarding-runbook.md).
 **Engagement module pairings**: Modules 1–5 (all M365 modules), Module 3 (M365 Hardening) as primary drift detection layer. Used in the first-week baseline checklist for every M365 engagement.
 ---
 ### PULSAR
 **What it ingests**:
 - Entra ID directory audit logs
 - Intune audit logs
 - Exchange Online, SharePoint, and Teams via the Office 365 Management Activity API
 **Core capabilities**:
 - Watermark-based incremental ingestion with MongoDB persistence
 - Search and filter UI with REST API
 - Alerting rules engine with webhook delivery *(see maturity note below)*
 - SIEM forwarding *(see maturity note below)*
 - MCP server (stdio and SSE): `search_events`, `get_event`, `get_summary`
 - Entra OIDC auth and Azure Key Vault integration
 > **Maturity note — alerting and SIEM forwarding**: Both features are functional but proof-of-concept quality. They are suitable for evaluation and non-critical environments. Alerting has no UI for rule management and webhook delivery has no retry logic. SIEM forwarding is basic with no delivery guarantees. Production hardening of both is on the roadmap. Do not recommend these features for production use in critical environments without documenting this caveat to the client.
 **MCP server**: PULSAR's MCP server exposes audit event search to AI assistants. AURORA connects to this endpoint for cross-tool diagnostics.
 **Deployment**: Docker Compose. Full quickstart in the [GitHub README](https://github.com/cqrenet/pulsar). Azure deployment guide in `DEPLOY-AZURE.md`.
 **Engagement module pairings**: Module 12 (Blue/Purple Team Foundation) as the M365 detection layer; Module 10 (AI-Assisted TVM) for audit-trail enrichment; any retained capability engagement where M365 log retention is a client requirement.
 ---
 ### AURORA
 **What it does**: A unified operations platform that sits in front of PULSAR and ASTRAL. Connects to both via MCP, exposes a single AI interface, and provides cross-tool diagnostics that neither product can answer alone.
 AURORA stores no data. All data lives in PULSAR (MongoDB) and ASTRAL (Git).
 **Cross-tool diagnostic tools**:
 | Tool | What It Answers |
 |------|----------------|
 | `diagnose_policy_errors` | "Why is this Intune compliance policy succeeding on most devices but erroring on some?" — pulls ASTRAL policy config and PULSAR audit events for the same policy |
 | `explain_device_compliance` | "Why did this device suddenly become non-compliant?" — combines ASTRAL assignment data with PULSAR event timeline |
 | `correlate_drift_with_audit` | "Who in the portal triggered this configuration drift commit?" — matches ASTRAL Git commits with PULSAR audit events by timestamp |
 | `tenant_security_summary` | "What happened in my tenant this week that I should know about?" — combines open ASTRAL drift PRs with PULSAR event summary, generates executive briefing |
 | `compare_scopes` | "What's different between my production and development Conditional Access policies?" — cross-scope comparison |
 **Multi-scope orchestration**: AURORA connects to multiple named ASTRAL instances. Production read-only and development read-write in the same interface. Directly useful for clients with strict prod/non-prod separation.
 **Enriched SIEM forwarding**: PULSAR forwards raw audit events. AURORA forwards enriched events — audit events correlated with ASTRAL configuration state at the time of the event. This produces materially higher-quality data for SIEM detection rules.
 **Pricing** (EUR, ex. VAT):
 | Tier | Self-hosted | Hosted (fully managed) |
 |------|-------------|----------------------|
 | Single tenant | €259/mo (€2,590/yr) | €389/mo (€3,890/yr) |
 | Up to 5 scopes | €429/mo (€4,290/yr) | €599/mo (€5,990/yr) |
 | Enterprise | Custom | Custom |
 Self-hosted customers bring their own Azure OpenAI endpoint (or any OpenAI-compatible API including Ollama for local models). Hosted tier includes managed AI (~500 queries/month fair use).
 ---
 ## Regulatory Alignment (EU)
 The CQRE suite was designed with EU regulatory requirements as primary constraints, not afterthoughts.
 | Regulation | Requirement | CQRE capability |
 |------------|-------------|-----------------|
 | **NIS2** Art. 21 | Configuration management, logging and monitoring, access control | ASTRAL (config), PULSAR (logging), AURORA (cross-tool analysis) |
 | **DORA** Art. 10 | ICT incident log retention and monitoring | PULSAR (permanent audit log retention, searchable) |
 | **DORA** Art. 11 | ICT change management records | ASTRAL Git trail (timestamped, reviewed, approved) |
 | **GDPR** Art. 5(2) | Accountability principle — demonstrate compliance | ASTRAL Git history is directly usable as audit evidence |
 | **GDPR** Art. 32 | Appropriate technical measures for data protection | Continuous config governance + audit log retention |
 | **GDPR** Art. 33 | 72-hour breach notification | PULSAR enables rapid incident reconstruction |
 | **ISO 27001** A.8.9 | Configuration management | ASTRAL |
 | **ISO 27001** A.8.15–16 | Logging and monitoring | PULSAR |
 **Consultant talking point**: For clients in NIS2-regulated sectors (health, finance, digital infrastructure, public sector), the CQRE suite is not a nice-to-have — it directly maps onto mandatory Article 21 measures. Frame the deployment cost against the supervisory authority's enforcement posture in their country, not against a generic security ROI.
 ---
 ## Positioning in Engagements
 ### Combination A — PULSAR + ASTRAL (free entry, any engagement)
 Deploy the free stack at the start of any M365 engagement. ASTRAL provides the baseline capture that week 1 requires. PULSAR provides the audit trail that retained capability clients need. Both are free — there is no procurement barrier.
 ### Combination B — ASTRAL only (compliance-driven clients)
 Clients with ISO 27001 in progress, DORA obligations, or NIS2 scope often need the config change governance story before they need event correlation. ASTRAL alone answers the auditor's question: "show me every M365 change in the last 12 months with evidence it was reviewed and approved."
 ### Combination C — PULSAR only (incident-response or log-retention clients)
 Clients who have had a recent incident and discovered their audit logs were gone, or clients facing insurance requirements for log retention, are natural PULSAR deployments. Value is immediate — longer retention, bulk search, alerting.
 ### Combination D — Full stack with AURORA (mature clients, retained relationships)
 Clients who have run PULSAR + ASTRAL for at least one module cycle are ready for AURORA. The upsell requires no education — they already know the cross-tool investigation pain that AURORA removes. AURORA self-hosted is the right recommendation for technically capable clients with data sovereignty requirements. AURORA hosted is the right recommendation for SMBs who want zero operational burden.
 ### What to avoid
 Do not lead with AURORA in a first engagement. The value of the cross-tool diagnostics is only legible to clients who have experienced the investigation friction of running PULSAR and ASTRAL separately. Clients who have not felt that pain will not pay for the solution.
 ---
 ## Deployment Prerequisites
 | Product | Prerequisites |
 |---------|--------------|
 | ASTRAL | Azure DevOps organisation, Entra app registration (provisioned by bootstrap script), write access to ADO Git repo |
 | PULSAR | Docker Compose capable host (or Azure Container Apps for cloud deploy), Entra app registration (provisioned by bootstrap script), MongoDB |
 | AURORA | Running PULSAR + ASTRAL with MCP servers enabled; AURORA licence key; Docker Compose or Azure Container Apps |
 ---
 ## Objection Handling
 **"We already have Microsoft Purview / Sentinel."**
 Purview and Sentinel are E5 features — €28+/user/month. The CQRE free stack provides comparable log retention and config governance for the entire engineering cost of deploying it. For clients already at E5, AURORA provides the correlation layer that Sentinel and Purview still do not natively deliver.
 **"We don't want to run our own infrastructure."**
 AURORA hosted solves this. CQRE manages the entire stack. Single tenant starts at €389/month — less than one day of external incident response.
 **"We tried open source tools before and found them too complex."**
 The complexity objection is usually a maintenance objection, not a deployment objection. Address it directly: who will own this after we leave? If the client cannot name a person, the sovereign stack requires a retained capability support arrangement. If they can, the deployment is a few hours of consultant time.
 **"Can we see the code?"**
 ASTRAL and PULSAR are fully open source (MIT licensed) on GitHub. AURORA is commercial source — clients can request a code review under NDA as part of enterprise procurement.
 ---
 *For the full sovereign tool stack including third-party open source tools, see [Sovereign Tool Stack](sovereign-tool-stack.md).*
 *For module pairings and engagement sequencing, see [Modular Engagements](../core/modular-engagements.md).*
 *For retained capability support arrangements, see [Retained Capability](../core/retained-capability.md).*
@@ -0,0 +1,93 @@
 # Kill Chain Assessment App
 > *"We say it in every engagement: find the kill chain first. But how do you find it in territory you've never seen? You don't start with the chain — you start with the questions that surface the edges, and you let the graph tell you where the shortest path to the end of the company actually runs."*
 This document specifies the **Kill Chain Assessment app** — a single-file, offline browser tool a consultant runs during the diagnostic to turn an unknown estate into a mapped attack graph, compute the shortest existential path (the kill chain), and size every node on it into a remediation [quantum](../core/quantum-vulnerability-management.md).
 **The tool:** [`tools/kill-chain-assessment.html`](../tools/kill-chain-assessment.html) — open it in any browser. No install, no network, no data leaves the machine. State persists locally and exports to `.json` (to resume) and `.md` (to drop straight into the report or the [Findings Backlog](../assessment-templates/findings-backlog.md)).
 ---
 ## Why this needed to be built
 The handbook and the [Move Fast and Fix Things](../core/move-fast-and-fix-things.md) posture both rest on a single instruction: *fix the kill chain first.* The [assessment team guide](../assessment-templates/assessment-team-guide.md) tells you what to run (BloodHound, Purple Knight, Elysium, Entra checks); the [sample engagement](sample-engagement-mid-market.md) shows a finished kill chain drawn as an ASCII path. But between "run the tools" and "here is the finished chain" there is a synthesis step that has always lived only in the consultant's head: **taking a pile of findings about an unfamiliar estate and working out which sequence of them actually ends the company.**
 In unknown territory that synthesis is hard, inconsistent between consultants, and easy to get wrong — the obvious 9.8 grabs attention while the cheap two-hop path to the backups goes unseen. The app makes the synthesis explicit and repeatable: capture what you find as nodes and attacker moves, and let a shortest-path computation surface the chain you'd otherwise have to spot by eye. It is the missing instrument for the first and most important act of every engagement.
 ---
 ## The model
 ### Nodes
 A **node** is any asset, foothold, identity, or system. Each carries the attributes that determine its position in the chain:
 | Attribute | Meaning | Drives |
 |-----------|---------|--------|
 | **Layer** | entry / identity / privilege / device / data / infra-OT / recovery | Orientation, report grouping |
 | **Tier** | T0 / T1 / T2 ([T0 Asset Framework](../core/t0-asset-framework.md)) | Blast-radius weighting |
 | **Entry point** | Internet-reachable or unauth foothold | Source of the chain |
 | **Crown jewel** | Existential — the org cannot operate without it | End of the chain |
 | **Reachable?** | Can the adversary actually get to it (yes/no/**unknown**) | Quantum sizing |
 | **Exploit available?** | Working path/exploit in the wild (yes/no/**unknown**) | Quantum sizing |
 | **Compensating control** | EDR / WAF / segmentation already in front | Quantum sizing (the ~90% subtraction) |
 The "unknown" values are first-class, not placeholders: a node you cannot characterise is a **dark quantum**, and capturing it honestly is the point.
 ### Moves (edges)
 A **move** is one directed attacker step — "from here, an attacker can reach there" — with a *mechanism* (how: DCSync, NTLM relay, password spray, reused credential, OAuth consent) and an *effort* weight from 1 (trivial) to 5 (very hard). Effort is the consultant's judgement of how hard that single hop is for the adversary.
 ### The computation
 The app runs a **multi-source Dijkstra** from every entry point across the move graph, and finds the **lowest-total-effort path to any crown jewel.** That path *is* the kill chain — the cheapest route from foothold to existential impact. The tool then classifies every node:
 - **P0** — on the shortest chain. Break any one link and the existential path is severed.
 - **P1** — on *some* path from an entry to a jewel (reachable-from-entry ∧ can-reach-a-jewel), but not on the cheapest one.
 - **P2 / off-chain** — not on any path to a crown jewel. Real, but not existential — housekeeping, not kill chain.
 This is the [Move Fast](../core/move-fast-and-fix-things.md) doctrine made computable: *kill-chain position sets priority, not CVSS.*
 ### Quantum sizing
 Each node on a chain is sized into a [quantum](../core/quantum-vulnerability-management.md) by the same logic the framework defines:
 | Quantum | Condition | Budget / action |
 |---------|-----------|-----------------|
 | **Critical** | On shortest chain, reachable **yes**, exploit **yes**, not compensated | **Hours** — sever reachability / compensating control now |
 | **Severe** | On a chain, reachable **or** exploit = yes | **Days** — one change window, verify enforcement |
 | **Standard** | On a chain, neither reachable nor exploitable yet | **Sprint** — batch; patch velocity fits here |
 | **Dark** | On a chain but reachability **or** exploit = unknown | **Unsized** — route to discovery; characterise first |
 ---
 ## How to run it in an engagement
 1. **Open the tool** and clear the sample (or keep it as a worked reference). Switch to the **Discovery** tab — it lists, per layer, the questions and commands that surface edges (external scan for entries, the Connect sync account for the cloud↔on-prem bridge, BloodHound `shortestPath` for privilege, "what stops the business operating?" for jewels, flat-network checks for blast radius). This is the unknown-territory protocol.
 2. **Capture as you go.** Every finding from the [assessment team guide](../assessment-templates/assessment-team-guide.md) becomes a node; every "an attacker could move from X to Y" becomes a move. Mark entries and jewels. Leave reachability/exploit as *unknown* when you genuinely don't know — that flags the dark quanta to chase.
 3. **Read the chain.** The centre panel draws the attack graph and highlights the shortest existential path in red. The right panel sizes the quanta. If no path is found, either the estate is genuinely segmented there (note it as a win) or you haven't mapped the connecting moves yet — in unknown territory, assume the latter until proven.
 4. **Export.** `Export report .md` produces a kill-chain section, quantum-bucketed remediation, and a priority table ready to paste into the diagnostic deliverable. `Save .json` lets you resume or hand off.
 5. **Close the loop.** After remediation, reload the `.json` and ask the antifragile question the framework demands: *did the chain get shorter?* A severed link or a collapsed privilege should visibly lengthen the shortest path or remove it entirely.
 ---
 ## What it is and is not
 It is a **synthesis and prioritisation instrument** — it makes the consultant's kill-chain judgement explicit, repeatable, and exportable, and it removes the human error of eyeballing the cheapest path. It is deliberately **offline and dependency-free** (Pillar 4, Sovereign Intelligence: the attack graph of a client estate must never leave the consultant's machine for a vendor cloud).
 It is **not** a scanner and not an autonomous agent. It does not discover assets for you — it structures what you discover. The discovery still comes from the tools in the [assessment team guide](../assessment-templates/assessment-team-guide.md) and the [zero-budget discovery](zero-budget-vulnerability-discovery.md) playbooks; the autonomous hours-lane execution lives in [AI-Assisted TVM](ai-assisted-tvm.md). This tool is the bridge between them: it turns raw discovery into a sized, prioritised chain that the rest of the programme acts on.
 ---
 ## Roadmap (build-later)
 The current tool is a self-contained synthesis instrument. Natural extensions, in priority order:
 1. **Import from BloodHound / Purple Knight** — ingest exported attack paths directly as nodes and moves, rather than hand-entry.
 2. **PULSAR / ASTRAL signal overlay** — pull live reachability and config-drift signal so "reachable?" is answered by observation, not assertion (Book I: validate by observation).
 3. **Chain-shortening tracker** — store successive `.json` snapshots and chart kill-chain length over time, making the antifragile feedback loop a number on a dashboard.
 4. **Multi-chain view** — surface the top-N existential paths, not just the cheapest, so secondary chains (the [sample engagement](sample-engagement-mid-market.md) on-prem path) aren't hidden behind the primary.
 ---
 *Specified for [Book VII — Vulnerability Management](../books/06-vulnerability-management.md) and the [Quantum Vulnerability Management](../core/quantum-vulnerability-management.md) framework. The tool: [`tools/kill-chain-assessment.html`](../tools/kill-chain-assessment.html).*
@@ -122,7 +122,7 @@ Set-AdminAuditLogConfig -UnifiedAuditLogIngestionEnabled $true
 - Retention: 90 days (E3 default); document the gap vs. 1-year requirement in some regulations
 - Export for analysis: `Search-UnifiedAuditLog` or use Microsoft Purview Audit (Standard) if available
- **AOC integration**: For clients with AOC deployed, unified audit logs are ingested automatically and correlated with Entra ID sign-in events to surface anomalous admin behaviour without manual PowerShell queries
+- **PULSAR integration**: For clients with PULSAR deployed, unified audit logs are ingested automatically and correlated with Entra ID sign-in events to surface anomalous admin behaviour without manual PowerShell queries
 **Enable Mailbox Auditing**
@@ -344,6 +344,6 @@ See [Vertical: Banking](../reference/vertical-banking.md) for full regulatory al
 *Previous: [Zero-Budget Hardening](zero-budget-hardening.md)*
 *Next: [AD and Endpoint Hardening](ad-endpoint-hardening.md)*
-*For the complete open-source tool arsenal including ASTRAL and AOC, see [Sovereign Tool Stack](sovereign-tool-stack.md)*
+*For the complete open-source tool arsenal including ASTRAL and PULSAR, see [Sovereign Tool Stack](sovereign-tool-stack.md)*
 For how Intune deployment becomes the natural entry point for broader security transformation, see [Endpoint Management Entry Vector](endpoint-management-entry-vector.md).
@@ -0,0 +1,251 @@
 # ORION — Technical Proposition
 > *"The kill chain exists before you have access to a single system. It's already drawn — in the org chart, the procurement history, the sector's threat landscape, and the things people will tell you in a room if you ask the right questions. ORION is the instrument for reading that chain on day zero, before a single tool has touched the estate."*
 **Codename:** ORION (the Hunter — it hunts the kill chain). Celestial, consistent with ASTRAL / PULSAR / AURORA. Rename freely.
 **Status:** Technical proposition — pre-build. This document exists to be argued with before any code is written.
 **One line:** ORION is the pre-engagement intake, interview, and threat-intelligence layer that produces the input the [Kill Chain Assessment app](kill-chain-assessment-app.md) (L1) consumes — turning structured human answers and public intelligence into a *hypothesised* attack graph, without ever touching client infrastructure.
 ---
 ## 1. Why this needs to exist
 The L1 [Kill Chain Assessment app](kill-chain-assessment-app.md) is a synthesis instrument: you feed it nodes and attacker moves you've already discovered, and it computes the shortest existential path and sizes the [quanta](../core/quantum-vulnerability-management.md). It assumes you already have findings — BloodHound paths, Entra checks, the [assessment team guide](../assessment-templates/assessment-team-guide.md) output.
 But on **day zero of a new engagement** you have none of that. You may not even have access yet — the contract may not permit infrastructure contact, the change-advisory board hasn't met, the client's legal team is still reviewing the scope. And yet this is exactly the moment the consultant most needs a hypothesis: *where is this company's kill chain likely to run, what should we ask, and what should we look at first when access arrives?*
 Today that reasoning lives entirely in the experienced consultant's head. It is the single least reproducible, least scalable part of the practice — a senior consultant walks in, asks fifteen sharp questions, and forms a mental model of the likely kill chain; a junior consultant asks the obvious questions and misses it. ORION makes that reasoning **explicit, structured, intel-informed, and repeatable** — and it does so in the window before fieldwork is even possible.
 ORION is, deliberately, the "What If" tool of the assessment world (Book I). It produces a *declared* picture — what the client says, what public intel suggests — which is precisely the picture the rest of the engagement exists to validate by observation. Naming that honestly is the whole design (see §7).
 ---
 ## 2. The hard boundary: ORION never touches client infrastructure
 This is the defining constraint and the primary selling point, not a limitation to apologise for.
 ORION works from exactly two input classes:
 1. **What humans tell it** — structured intake and questionnaire responses from the client.
 2. **Passive public intelligence** — sector threat landscape, CISA KEV, vendor advisories, exploited-CVE feeds, public OSINT about the named technology stack. **Passive only**: ORION reads public and threat-intelligence sources. It does *not* perform active external scanning — that is a separate, consented capability (see [Perimeter Scanning Capability](perimeter-scanning-capability.md)) and explicitly out of ORION's scope.
 What this buys:
 - **Zero onboarding friction.** No credentials, no agent, no firewall change, no data-processing agreement for telemetry. ORION can run during the sales conversation, in the pre-contract phase, or in a sector where the client cannot yet grant access.
 - **No incident risk.** A tool that touches nothing breaks nothing and triggers no alerts. It can never be the cause of an outage or a "who ran that scan?" conversation.
 - **Clean legal posture.** The only client data ORION holds is what the client deliberately typed into a questionnaire. That is a categorically simpler privacy and liability position than any tool that ingests infrastructure data.
 The boundary is also the honest limit: because ORION observes nothing, everything it produces is a hypothesis (§7).
 ---
 ## 3. The three-stage workflow
 ### Stage 1 — Intake (minutes)
 A short structured form establishes the engagement's shape. The consultant fills this, usually from the first call:
 - Sector and sub-sector (drives the threat-landscape lookup and the regulatory profile)
 - Size, geography, and regulatory exposure (NIS2 / DORA / GDPR / sector-specific)
 - Technology footprint at a coarse level: M365 (E3/E5/BP), hybrid AD vs cloud-only, major cloud, OT/ICS presence, internet-facing services they'll admit to
 - Business-level crown jewels: "what stops the company operating?" — ERP, payment rails, OT control, the customer database
 - Known history: prior incidents, prior pentest, known pain points
 ### Stage 2 — Generate the tailored questionnaire (the core trick)
 ORION's LLM expands the intake into a **detailed, role-targeted, adaptive questionnaire**, and this is where it earns its keep. The questionnaire is:
 - **Role-segmented** — separate tracks for the identity/AD admin, the M365 admin, the network/OT lead, and the business owner. Each person answers only what they'd know.
 - **Adaptive** — questions branch on prior answers. Hybrid AD declared → the Entra Connect sync-account and DCSync questions appear. OT declared → Purdue-model and remote-vendor-access questions appear. Cloud-only → the questionnaire skips on-prem forest-recovery questions entirely.
 - **Framed against the kill chain, not compliance** — every question maps to a candidate node or edge ("Do any standing Domain Admins log into normal workstations for email?" targets a known privilege-path edge), not to a control checkbox. This is the inversion the whole practice rests on.
 The client fills it via a shared per-engagement link, partially and over time, with their own people answering their own sections.
 ### Stage 3 — Synthesis → hypothesised kill chain → L1 export
 From the responses plus the threat intel, ORION proposes:
 - **Candidate entry points** (internet-facing services, legacy auth, the contractor-access pattern), each with the intel that suggests it.
 - **Candidate crown jewels** (from the business answers).
 - **Hypothesised moves** between them, each with a *mechanism*, a *confidence*, and a *rationale citing its source* ("hybrid AD + unrotated KRBTGT declared → likely Entra-Connect→on-prem DCSync edge").
 - **A prioritised "look here first" list** for when fieldwork begins — what to point BloodHound, the Entra review, and the L1 app at on day one.
 The synthesis exports directly to the **L1 Kill Chain Assessment app's `.json` schema**, so the consultant opens L1 with the hypothesised graph already drawn and spends fieldwork *validating and correcting* it rather than building from a blank canvas. ORION hypothesises; L1 plus fieldwork confirm or kill each hypothesis by observation.
 ---
 ## 4. Threat-intelligence layer
 ORION continuously contextualises the client against the *current* threat environment — the dimension a static questionnaire can't capture and the one that feeds the [quantum](../core/quantum-vulnerability-management.md) sort key's "exploit availability" axis:
 - **CISA KEV and exploited-CVE feeds** — for the client's named technologies, what is being exploited *now*.
 - **Vendor advisories** — current critical advisories for their declared stack (the VPN appliance, the mail gateway, the ERP).
 - **Sector threat landscape** — which actors and ransomware groups are currently targeting their vertical, drawn from public reporting.
 Each intel item carries **provenance** (source, date, URL) because ORION's output is advisory and the consultant must be able to trace and re-verify every claim. Threat intel ages fast; ORION timestamps everything and treats stale intel as a prompt to re-check, never as fact.
 ---
 ## 5. Architecture
 Deliberately mirrors CISO Assistant and the AURORA model so it's familiar to operate and fits the suite.
 ```
 ┌─────────────────────────────────────────────────────────────┐
 │  ORION (Docker Compose, consultant self-hosted)              │
 │                                                              │
 │  ┌────────────┐   ┌──────────────┐   ┌───────────────────┐  │
 │  │  Web UI    │   │  API backend │   │  PostgreSQL       │  │
 │  │ (SvelteKit │◄─►│ (FastAPI or  │◄─►│  engagements,     │  │
 │  │  or React) │   │  Django/DRF) │   │  responses,       │  │
 │  └────────────┘   └──────┬───────┘   │  hypotheses       │  │
 │   client fills           │           └───────────────────┘  │
 │   questionnaire          │                                   │
 │   via shared link        ▼                                   │
 │              ┌──────────────────────┐                        │
 │              │  LLM abstraction     │  pluggable backend     │
 │              │  layer               │──► Ollama (default)    │
 │              └──────────────────────┘──► Azure OpenAI (opt)  │
 │                         │           └──► llm.cqre.net (opt)  │
 │                         ▼                                     │
 │              ┌──────────────────────┐                        │
 │              │ Threat-intel         │  passive fetch only:   │
 │              │ connector module     │──► CISA KEV, advisories│
 │              └──────────────────────┘──► curated OSINT/search│
 │                         │                                     │
 │              ┌──────────┴───────────┐   ┌─────────────────┐  │
 │              │ L1 export adapter    │──►│ kill-chain .json│  │
 │              └──────────────────────┘   └─────────────────┘  │
 │              ┌──────────────────────┐                        │
 │              │ MCP server           │  AURORA / Claude can   │
 │              │ (query ORION)        │  query engagements     │
 │              └──────────────────────┘                        │
 └─────────────────────────────────────────────────────────────┘
            NO connection to client infrastructure
 ```
 Components:
 - **Backend** — FastAPI (Python) or Django REST, matching CISO Assistant's proven stack. Houses the questionnaire engine, synthesis orchestration, and export.
 - **Frontend** — SvelteKit or React. Two surfaces: the consultant console and the client-facing questionnaire (shareable per-engagement link, no client login burden beyond a token).
 - **LLM abstraction layer** — single internal interface, swappable backend. **Default: local Ollama** so sensitive intake data never leaves the box (§6). Optional: Azure OpenAI (EU) or managed `llm.cqre.net`, exactly as ASTRAL/AURORA offer.
 - **Questionnaire engine — questions-as-data** — adopting CISO Assistant's "frameworks as data, not code" principle: questionnaire templates, branching rules, and node/edge mappings live in the database as editable data, so new sector packs and question sets ship without code changes.
 - **Threat-intel connector** — passive fetchers for KEV, advisories, and curated search, each normalised into a provenance-tagged `ThreatIntelItem`.
 - **L1 export adapter** — emits the exact `.json` schema the L1 app imports.
 - **MCP server** — exposes ORION engagement state to AURORA and to AI assistants, consistent with the rest of the suite.
 ### Data model (sketch)
 | Entity | Holds | Notes |
 |--------|-------|-------|
 | `Engagement` | Client, scope, status | Per-engagement isolation boundary |
 | `IntakeProfile` | Stage-1 answers | Drives questionnaire generation |
 | `QuestionnaireTemplate` | Questions, branching rules, node/edge mappings | Questions-as-data; sector packs |
 | `Response` | Client answers, respondent role, timestamp | Sensitive — encrypted at rest |
 | `ThreatIntelItem` | Intel + source + date + URL | Provenance mandatory |
 | `Hypothesis` | Candidate node/edge + confidence + rationale + sources | The advisory output; never a "finding" |
 | `Export` | Generated L1 `.json` snapshots | Versioned, so you can diff intake-time vs post-fieldwork |
 ---
 ## 6. Sovereignty and data handling
 ORION holds something genuinely sensitive: a client's own description of where they are weak. That is a map of the kill chain drawn by the victim. The data posture must be uncompromising and is a direct expression of Pillar 4 (Sovereign Intelligence — never rent your ability to think) and Pillar 1.
 - **Local LLM by default.** Ollama runs in the same Compose stack; intake and responses never leave the consultant's host unless a backend is *explicitly* switched. The default must be the safe one.
 - **Encryption at rest** for `Response` and `Hypothesis` data; per-engagement key isolation.
 - **Retention and deletion.** Each engagement has a retention clock and a hard "right to delete" — when the engagement closes, the client's answers can be destroyed and the destruction evidenced (GDPR-friendly, and the right thing).
 - **No telemetry, no phone-home.** Consistent with the offline ethos of the L1 tool.
 - **Untrusted-content handling.** Threat-intel fetched from the web is untrusted input — treated as data, never as instructions to the LLM (prompt-injection defence, §8).
 ---
 ## 7. The epistemic honesty layer (the most important section)
 ORION's single greatest risk is that its confident, well-written output gets mistaken for fact. The repo's founding principle (Book I) is *validate by observation, never by inspection* — and ORION, by design, observes nothing. So the design must make its own uncertainty impossible to ignore:
 - **Everything ORION emits is a `Hypothesis`, never a `Finding`.** The vocabulary is enforced in the data model and the UI. A finding comes from the [assessment team guide](../assessment-templates/assessment-team-guide.md) fieldwork and lands in the [Findings Backlog](../assessment-templates/findings-backlog.md); a hypothesis comes from ORION and lands in L1 as something *to test*.
 - **Confidence and provenance on every claim.** No hypothesis without a stated confidence and the source(s) — the client answer or the intel item — that produced it.
 - **The "ghost-assessment" trap, named.** Just as a ghost CA policy displays correct config while enforcing nothing (Book I corollary), a client questionnaire can describe a control that has rotted into a ghost. ORION's hypotheses inherit the client's blind spots. The output must say so, loudly, and route every load-bearing claim to observation.
 - **The handoff is explicit.** ORION's deliverable is not "here is your kill chain." It is "here is the kill chain we *expect*, ranked by where to look first — now go and prove or disprove each link." That handoff into L1 and fieldwork is the product, not the hypothesis itself.
 Get this section right and ORION strengthens the practice. Get it wrong and it becomes the most dangerous thing in the toolkit: a confident map of a territory no one checked.
 ---
 ## 8. LLM guardrails
 - **Human-in-the-loop, always.** ORION proposes; the consultant disposes. No hypothesis auto-promotes to a finding, and ORION takes no action on anything.
 - **Prompt-injection defence.** Web/threat-intel content is wrapped and labelled as untrusted data; the system prompt instructs the model to treat fetched content as evidence to summarise, never as commands.
 - **Hallucination control.** Provenance is mandatory; a claim with no traceable source is flagged, not shown as fact. The consultant can click any hypothesis through to its sources.
 - **Quality floor.** Local models are weaker; the proposition should set an expectation that the default Ollama model is adequate for questionnaire generation and basic synthesis, with Azure OpenAI recommended where deeper reasoning materially helps — and the UI should make the active model and its limits visible.
 ---
 ## 9. How it fits the engagement
 | Phase | ORION's role |
 |-------|--------------|
 | Pre-contract / sales | Stage-1 intake during the first conversation; instant sector threat-landscape briefing as a credibility opener |
 | [Brownhat Diagnostic](../assessment-templates/nist-csf-baseline.md) intake | Generate and distribute the tailored questionnaire; collect responses before the on-site half-days |
 | Fieldwork ([assessment team guide](../assessment-templates/assessment-team-guide.md)) | Hand the consultant a hypothesised graph and a "look here first" list; fieldwork validates by observation |
 | L1 mapping | Import ORION's `.json`; correct and confirm; compute the real shortest existential path |
 | Reporting | Diff intake-time hypotheses against confirmed findings — a powerful "what you told us vs what we found" narrative for the client |
 ---
 ## 10. Regulatory alignment (EU)
 | Regulation | Requirement | ORION relevance |
 |------------|-------------|-----------------|
 | **NIS2** Art. 21 | Risk analysis, supply-chain and access governance | Structured intake produces documented evidence of risk-analysis scoping at engagement start |
 | **DORA** | ICT risk identification | The hypothesised kill chain is an ICT-risk-identification artefact (clearly marked as preliminary) |
 | **GDPR** Art. 5/32 | Data minimisation, appropriate measures, accountability | Local-LLM default, encryption, retention/deletion — minimal, sovereign handling of the only PII it holds |
 ---
 ## 11. Phased build (proposed MVP → product)
 1. **Phase 1 — MVP.** Stage-1 intake, LLM questionnaire generation (Ollama), manual-assisted synthesis, L1 `.json` export. No threat intel yet. Proves the core loop.
 2. **Phase 2 — Threat intel.** KEV / advisory / curated-search connectors with provenance; exploit-availability enrichment of hypotheses.
 3. **Phase 3 — Adaptive + integrated.** Full branching questionnaire engine (questions-as-data), MCP server, AURORA integration, sector question packs.
 4. **Phase 4 — Productisation.** Hosted tier, multi-engagement console, RBAC, retention automation.
 ---
 ## 12. Provisional commercial framing
 Positioned like AURORA — self-hosted and hosted tiers — though pricing is a placeholder pending the build decision:
 | Tier | Self-hosted | Hosted (managed) |
 |------|-------------|------------------|
 | Per-consultant / small practice | TBD | TBD |
 | Practice / multi-seat | TBD | TBD |
 Self-hosters bring their own LLM (Ollama / Azure OpenAI); hosted tier includes a managed model. Note the natural bundling: ORION (pre-engagement) → L1 Kill Chain Assessment (synthesis) → ASTRAL/PULSAR/AURORA (the operational layer once access exists).
 ---
 ## 13. What ORION is NOT
 - **Not a scanner and not an agent.** It touches no client system, active-scans nothing, and runs nothing in the client environment.
 - **Not autonomous.** It proposes hypotheses for a consultant; it never acts and never self-promotes a hypothesis to a finding.
 - **Not a replacement for fieldwork or for L1.** It is the layer *before* them — it tells you where to look, it does not tell you what is true.
 - **Not a compliance questionnaire tool.** The questions target the kill chain, not a control checklist; CISO Assistant covers the GRC/framework job and ORION should integrate with it, not duplicate it.
 ---
 ## 14. Open questions for the build decision
 1. **Backend choice** — FastAPI (lighter, our synthesis is bespoke) vs Django/DRF (matches CISO Assistant, more batteries). Leaning FastAPI.
 2. **Client-facing surface** — shared tokenised link (low friction) vs lightweight client login (more control). Leaning tokenised link with per-engagement expiry.
 3. **Where is the OSINT/active line drawn exactly?** Confirm ORION stays strictly passive and that any external scanning is deferred to the consented [Perimeter Scanning Capability](perimeter-scanning-capability.md).
 4. **CISO Assistant integration depth** — loose (export/import) vs deep (shared data model). Loose first.
 5. **Default Ollama model and the quality floor** — which local model is "good enough" for questionnaire generation, and where do we tell consultants to switch to Azure OpenAI.
 6. **Hypothesis accuracy expectations** — how do we measure and communicate that ORION's day-zero map is a starting hypothesis, and track how often it was right once fieldwork closed the loop?
 ---
 *Companion to the [Kill Chain Assessment app](kill-chain-assessment-app.md) (L1), [Book VII — Vulnerability Management](../books/06-vulnerability-management.md), and the [Quantum Vulnerability Management](../core/quantum-vulnerability-management.md) framework. Positioned in the suite alongside [ASTRAL, PULSAR, and AURORA](cqre-product-suite.md).*
@@ -17,6 +17,51 @@ The antifragile answer is a two-layer architecture: **network access** (Tailscal
 ---
 ## When overlay management networks help — and when they don't
 **Enterprises with their own data centres** already have the physical substrate for a proper management network: dedicated VLANs, hardware segmentation, jump boxes. Adding an overlay management network introduces a new Tier 0 component (the coordinator) on top of infrastructure that already solves the problem. The complexity cost outweighs the benefit. Traditional management VLAN segmentation, done properly, is the right answer.
 **SME clients with multi-cloud resources, containers, and DevOps workloads** have a different problem: there is no physical network to segment. Resources are scattered across Azure, AWS, a colo, and maybe on-prem. The management plane does not exist yet — you are building it. An overlay is how you build it, and it is the right answer for this context.
 **The T0/T1 split** — applying the tier model to the overlay itself:
 - **T0 systems** (domain controllers, ADCS, Entra Connect sync server — the identity control plane): use **Nebula**. No coordinator in the runtime path — once certificates are distributed, the overlay functions with zero external dependencies. The Nebula CA is the only Tier 0 component, and it can be kept offline. This means no coordinator to compromise, no external API call, no cloud service availability dependency for reaching your most critical systems.
 - **T1 systems** (member servers, cloud workloads, Kubernetes clusters, multi-cloud management): use **Tailscale** (or Headscale for sovereign requirements). Per-node ACLs, Entra OIDC integration, per-session MFA via key expiry and IdP enforcement. The coordinator trust concern is more acceptable at T1 — a compromised coordinator affects T1 access, not T0.
 **The T0 node count is not scary.** For a 5,000-person organisation, the realistic T0 Nebula population is:
 | Component | Count |
 |-----------|-------|
 | Domain Controllers | 4–8 |
 | Entra Connect / Cloud Sync server | 1–2 |
 | ADCS issuing CA | 1–2 |
 | AD FS servers (if not yet removed) | 0–4 |
 | Cloud admin VMs / PAWs | 5–10 |
 | **Total** | **~15–25 nodes** |
 Certificate management for 15–25 nodes is a documented procedure, not an operational burden. The CA signing ceremony happens a few times a year when a PAW is replaced or an admin leaves. This is tractable.
 ---
 ## The PAW problem and the cloud admin VM
 Physical PAWs are the right principle. They almost never get deployed. Hardware procurement, second device on the desk, behaviour change — the project dies before it starts.
 The **cloud-hosted admin workstation** preserves the essential security properties without the hardware problem:
 - A Windows 365 or Azure Virtual Desktop VM provisioned from a hardened template
 - Used only for privileged tasks (no email, no general browsing)
 - Connected to the Nebula T0 overlay (for DC access) and Tailscale T1 overlay (for server/cloud access)
 - Accessed by the admin from their normal device via browser or RDP client
 - Privileged credentials live in the cloud VM, not on the admin's local device
 - Compromise response: wipe the VM, reprovision from template in 20 minutes
 The security property that matters — privileged credentials do not touch the device used for email and browsing — is preserved. An attacker who compromises the admin's local device gets a browser session to a cloud VM that requires phishing-resistant MFA to reach. They do not get cached credentials, session tokens, or WireGuard keys for the management overlay.
 **When to use a physical PAW instead:** clients with a strong security culture and genuine appetite for the operational overhead, OT/ICS environments where the management workstation may need to be air-gapped, or engagements where the threat model includes a sophisticated attacker who would attempt to compromise the RDP session interactively.
 ---
 ## The Two Layers
 ### Layer 1: Network Access — Tailscale / Headscale + WireGuard
@@ -130,6 +175,30 @@ This catches more clients than it appears. A manufacturing company with 800 empl
 ---
 ### Nebula — T0 Management Overlay
 | Attribute | Detail |
 |-----------|--------|
 | **What it does** | WireGuard-based overlay mesh with no coordinator in the runtime path. Nodes authenticate via pre-distributed certificates signed by a local CA. Lighthouse nodes handle NAT traversal only — they are not in the authentication path. |
 | **Why it is right for T0** | No external runtime dependency. A compromised or unavailable coordinator cannot affect T0 access. The CA (the actual trust anchor) can be kept offline and brought up only for certificate issuance. |
 | **Trade-off vs Tailscale** | No dynamic node management (adding/removing a node requires a CA operation and cert redistribution); no cloud-managed control plane; higher initial setup complexity; certificate revocation requires distributing an updated blocklist |
 | **Why the trade-off is acceptable for T0** | T0 node population is small (15–25 nodes) and stable. Revocation events (lost PAW, departing admin) are rare and known immediately. The operational overhead is a documented ceremony run a few times a year, not a recurring burden. |
 | **Antifragile pillar** | Structural Decoupling, Sovereign Intelligence |
 | **When to deploy** | T0 systems (DCs, sync server, ADCS) in any estate; air-gapped or restricted environments; clients where the management plane must have zero external runtime dependencies |
 **Nebula CA management — the one non-trivial operation:**
 The Nebula CA private key is the trust anchor for the entire T0 overlay. It must be treated accordingly:
 - Air-gapped machine (a dedicated laptop that is never networked, or a hardware security module)
 - Documented signing ceremony: who is authorised to sign a new certificate, what approval is required, what the procedure is
 - Named individuals (minimum two) who know the procedure and can perform it
 - CA key backup: encrypted, stored separately from the signing machine, tested
 - Short certificate lifetimes (90–180 days) so revocation is handled implicitly by non-renewal as much as by explicit blocklist distribution
 This is the same discipline as an offline root CA — because that is functionally what it is.
 ---
 ### Smallstep — Certificate-Based SSH Access
 | Attribute | Detail |
@@ -145,20 +214,34 @@ This catches more clients than it appears. A manufacturing company with 800 empl
 ## The Decision Framework
 ```
-Does the client have legacy VPN sprawl or flat-network vendor access?
+Does the client have their own data centre with physical network infrastructure?
-├── YES → Deploy Layer 1 (network access) first
+├── YES → Traditional management VLAN segmentation + jump box
-│   ├── Wants managed service + commercial support → Tailscale (partnership)
+│          Overlay adds complexity without proportional benefit here
 └── NO / Multi-cloud / Scattered resources → Overlay is the right management plane
 Does the client need a T0 management overlay (DC, ADCS, sync server access)?
 ├── YES → Nebula (no external runtime dependency, CA offline)
 │   └── Admin workstation: cloud admin VM (W365/AVD) or physical PAW, enrolled in Nebula
 │
 Does the client need a T1 overlay (servers, cloud workloads, K8s, DevOps)?
 ├── YES → Layer 1 (network access)
 │   ├── Wants managed service + commercial support → Tailscale + Entra OIDC + key expiry MFA
 │   └── Wants full sovereignty / data residency → Headscale + WireGuard
 │
 Does the client need protocol-aware session recording / JIT / DB access?
 ├── YES → Add Layer 2 (PAM)
 │   ├── < 100 employees AND < $10M revenue → Teleport CE (free, self-hosted)
-│   ├── Larger org / needs support → Teleport Enterprise (commercial)
+│   ├── Larger org / needs support → Teleport Enterprise (commercial, verify current pricing)
-│   └── SSH-only, budget-constrained → Smallstep (certificates only)
+│   └── SSH-only, budget-constrained → Smallstep (certificates only, no session recording)
 │
-Does the client need both layers?
+Typical SME multi-cloud client:
-├── MOST CLIENTS → Tailscale (network) + Teleport CE/Enterprise (PAM)
+├── T0: Nebula + cloud admin VMs
-└── OT/CRITICAL INFRA → Headscale (sovereign network) + Teleport (recorded vendor access)
+├── T1: Tailscale + Entra OIDC
 └── Session recording: Teleport CE if eligible, otherwise accept the gap and compensate with
    cloud VM audit logging and Tailscale connection logs
 OT / Critical infrastructure:
 └── Headscale (sovereign T1) + Nebula (T0 where applicable) + Teleport (vendor session recording)
 ```
 ---
@@ -4,20 +4,104 @@
 ## For the Executive Reader
-This is not a three-year digital transformation. It is a **180-day strategic reset** with measurable business outcomes at each phase gate.
+This is not a three-year digital transformation. It is a **180-day foundation programme** with measurable progress at each phase gate.
 | Phase | Timeline | What the Board Sees |
 |-------|----------|---------------------|
-| **Hygiene** | Days 0-30 | Visibility. For the first time, we know every identity, asset, and gap that could end the company. |
+| **Visibility** | Days 0–60 | We know the kill chain. T0 assets are identified, critical privileges are mapped, and logging is operational. |
-| **Control** | Days 30-60 | Containment. The highest-risk exposures are closed using tools already owned. |
+| **Control** | Days 60–120 | The highest-risk kill chain nodes are closed. MFA is enforced on privileged accounts. Critical gaps have evidence-backed remediation. |
-| **Sovereignty** | Days 60-90 | Ownership. Proprietary intelligence is reclaimed. Recovery from disaster is proven, not assumed. |
+| **Signal** | Days 120–180 | Detection capability is built on the hardened foundation. Housekeeping is running as a permanent stream. The organisation can operate and maintain what was built. |
-| **Antifragility** | Days 90-180 | Advantage. The organization learns faster from disruption than competitors do. |
+| **Antifragility** | Ongoing | Structural improvement, retained capability, and progressive reduction of technical debt. This phase does not end. |
 **What 180 days delivers**: A hardened foundation, closed kill chain, operational detection capability, and the processes to sustain them. Not a complete transformation — a credible, maintained starting point.
 **What 180 days does not deliver**: Elimination of all technical debt (that takes years), full AI sovereignty (that is a multi-year journey), or zero vendor dependencies (that is an ongoing programme). Promising otherwise is dishonest and destroys client trust when reality arrives.
 **Investment principle**: Configuration first. Procurement only if justified. Most value is extracted from existing tools before any new purchase is discussed.
-**Governance**: Weekly steering committee. Monthly board update. Quarterly antifragility assessment. Hard go/no-go gates at days 30, 60, and 90.
+**Governance**: Weekly check-in with named client lead. Monthly steering committee. Hard go/no-go gates at days 30, 90, and 180.
-**Modularity**: While this document presents the full 180-day program, every phase can be delivered as an independent, fixed-scope module. See [Modular Engagements](../core/modular-engagements.md) for the menu of standalone engagements.
+**Modularity**: Every phase can be delivered as an independent, fixed-scope module. See [Modular Engagements](../core/modular-engagements.md) for the standalone engagement menu.
 ---
 ## Milestone Deliverables: What You Hold in Your Hands
 The three milestone dates — Day 30, Day 90, Day 180 — are not arbitrary progress checkpoints. Each produces a specific, verifiable set of deliverables. A client who stops at Day 30 still holds something of lasting value. A client who reaches Day 180 holds everything below in a form they can operate without us.
 ### Day 30: Intelligence and Immunity
 *Precondition: Brownhat Diagnostic complete, access provisioned by kickoff.*
 | # | Deliverable | Verified by |
 |---|-------------|-------------|
 | 1 | **Brownhat Diagnostic report** — kill chain identified, up to 5 immediate quick wins, prioritised module roadmap | Delivered document |
 | 2 | **ASTRAL deployed** — complete M365 tenant configuration snapshot committed to Git; drift detection running; event-driven change probe active | First drift PR visible in ADO |
 | 3 | **PULSAR deployed** — all M365 admin audit events ingesting; logs searchable from day 1 forward; 12-month retention accumulating | Oldest log entry confirmed in UI |
 | 4 | **T0 accounts hardened** — every Global Admin, Domain Admin, and high-privilege service principal identified; MFA enforced; documented with owner | CA sign-in logs show MFA enforced for T0 accounts |
 | 5 | **Public attack surface report** — all internet-facing assets enumerated; P0 findings (internet-exposed + critical CVE) identified and prioritised | Delivered report |
 | 6 | **Quick wins closed** — up to 5 immediate improvements from Brownhat findings, using existing tools, zero procurement | Closed items documented in change log |
 | 7 | **Findings backlog opened** — all Brownhat Diagnostic findings entered with P0/P1/P2 priority, owner assigned per item, monthly cadence confirmed; this is the input queue for the housekeeping stream | Backlog visible in agreed system; all findings from items 1–6 above entered |
 **The Day 30 value**: You know your kill chain. Your M365 configuration is under version control and your audit logs are being retained — permanently, from this day. Your most privileged accounts are hardened. This stands on its own regardless of what follows.
 *Day 30 is a hard gate. If ASTRAL and PULSAR are not deployed and T0 accounts are not confirmed as MFA-enforced by day 30, the engagement has a resourcing or access problem that must be resolved before proceeding.*
 ---
 ### Day 90: Kill Chain Closed
 *Everything from Day 30, plus:*
 | # | Deliverable | Verified by |
 |---|-------------|-------------|
 | 8 | **MFA enforced for all users** — not just enrolled; enforced via Conditional Access policy | CA sign-in logs: zero successful authentications without MFA for in-scope users |
 | 9 | **Legacy authentication blocked tenant-wide** | CAExporter export + sign-in logs: zero legacy auth sign-ins in past 7 days |
 | 10 | **Conditional Access baseline deployed** — device compliance, sign-in risk, location policies active and tested | CA policy set exported by CAExporter; test sign-in matrix documented |
 | 11 | **P0 and P1 vulnerabilities closed** — from Day 30 attack surface report | Rescan confirming closure; residual items in risk register |
 | 12 | **AD attack paths reduced** — BloodHound before/after comparison showing measurable reduction in paths to Domain Admin | BloodHound report: path count comparison |
 | 13 | **Vendor remote access hardened** — time-bounded, MFA-required, session-recorded for all third-party access | Vendor access log showing new controls enforced |
 | 14 | **T0 backup integrity verified** — at least one successful restore per T0 system, timed and documented | Backup test report per T0 system |
 | 15 | **ASTRAL: first restore drill** — a rejected drift PR has triggered the restore pipeline; restore validated against a real change | ADO restore pipeline run log |
 | 16 | **PULSAR: top 5 alert rules operational** — rules written, test-triggered, runbooks drafted for each | Alert rule set visible; test trigger documented |
 **The Day 90 value**: Your kill chain is closed. MFA covers the entire organisation. The highest-risk attack paths are measurably reduced. Any incident from this point has a detection and response capability behind it — and your configuration is auditable back to day 1.
 ---
 ### Day 180: Operational Independence
 *Everything from Day 90, plus:*
 | # | Deliverable | Verified by |
 |---|-------------|-------------|
 | 17 | **Alert runbooks complete** — documented response procedure for every active PULSAR alert rule; escalation paths defined | Runbook set reviewed and signed off by client IT lead |
 | 18 | **Custom detection rules** — at least 3 rules written for client-specific TTPs identified in Phase 1 kill chain | Rules deployed; test-triggered and confirmed |
 | 19 | **Client IT lead operational independence** — client IT lead demonstrates ability to: review ASTRAL drift PRs, search PULSAR events, trigger and verify an alert rule | Live walkthrough completed without consultant prompting |
 | 20 | **Housekeeping stream running** — 3 consecutive monthly cycles completed; accounts resolved per cycle tracked | Queue status report showing 3 cycles; measurable reduction |
 | 21 | **Module completion packages delivered** — every runbook, script, configuration file, and detection rule in client's own repository | Repository contents confirmed; client confirms ownership |
 | 22 | **Risk register closure evidence** — before/after comparison for every risk addressed during the programme; residual risks documented | Risk register delivered and reviewed with executive sponsor |
 | 23 | **Retained capability scope agreed** — written scope for continuation: cadence, activities, named owner | Signed retained scope or explicit decision to defer |
 **The Day 180 value**: You are no longer dependent on us. The systems run, the detections fire, the housekeeping happens on schedule. What continues in the retained scope is enhancement — not maintenance of what we built.
 ---
 ### What These Milestones Assume
 These deliverables are based on a typical M365-primary environment with:
 - A named client lead with 30–40% availability
 - Access provisioned before kickoff (accounts, MFA, Global Admin for ASTRAL/PULSAR bootstrap)
 - An IT admin who can execute configuration changes with guidance
 - Weekly check-ins not cancelled
 **What shifts the timeline:**
 - Large or complex AD environments add 2–4 weeks to Day 90 work
 - High user count (500+) adds 2–4 weeks to MFA rollout change management
 - Constrained IT team availability is the single most common cause of slippage — budget for it honestly in the scope
 - OT environments: see the [Critical Infrastructure Adaptation](move-fast-and-fix-things.md#the-critical-infrastructure-adaptation); Day 90 timelines for network segmentation work are longer
 **What does not shift the Day 30 milestone:** ASTRAL and PULSAR deploy in hours. The Brownhat Diagnostic is 2 half-day workshops. T0 account hardening is 1–2 weeks of focused work. If Day 30 deliverables are not met, the bottleneck is access provisioning or client availability — both of which are addressable before kickoff.
 *For the business case and financial justification, see [Business Case Template](business-case-template.md).*
 *For board conversation guidance, see [C-Suite Conversation Guide](../core/c-suite-conversation-guide.md).*
@@ -26,295 +110,290 @@ This is not a three-year digital transformation. It is a **180-day strategic res
 ## For the Practitioner
-This playbook provides a **time-boxed, phase-gated roadmap** for transforming a fragile enterprise into an antifragile one. It is designed for immediate deployment in consulting engagements and can be adapted to organizational size, industry, and regulatory context.
+### What This Plan Is Not
-The plan is structured in **four phases**: Hygiene (30 days), Control (60 days), Sovereignty (90 days), and Antifragility (180 days). Each phase builds on the previous. Skipping phases creates the illusion of progress while leaving structural fragility intact.
+Before using this roadmap with a client, be honest about what it commits to.
-> **Core tenet**: Before any new purchase is discussed, exhaust the capabilities of existing tooling. See the [Zero-Budget Hardening Playbook](zero-budget-hardening.md) for the tactical expression of this principle.
+**Not a sprint.** The most common failure mode is treating security modernisation as a project that ends. It does not end. The 180-day programme establishes processes and capabilities that must run permanently. If the client does not have the internal resources to continue what we build, we need to have that conversation before we start.
 **Not a full audit.** Phase 1 does not produce a complete identity inventory, a comprehensive vulnerability assessment, or an exhaustive compliance gap analysis. It produces a kill chain map and enough visibility to close existential risks. The full audit takes months and tends to produce reports that paralyse rather than mobilise.
 **Not compatible with staff paralysis.** Organisations dealing with active incidents, leadership changes, or major concurrent projects cannot execute this plan on the stated timeline. The timeline is predicated on a named client lead with 30–40% availability and access provisioned before day 1.
 **Not vendor-agnostic in execution.** The plan references Microsoft 365 environments as the primary context because that is most clients' reality. Non-Microsoft environments follow the same logic but require different specific tools. See the Platform Adaptation appendix in [Modular Engagements](../core/modular-engagements.md).
 ---
-## Phase 1: Hygiene (Days 0–30)
+## Phase 1: Visibility (Days 0–60)
-**Theme**: *You cannot defend what you cannot see.*
+**Theme**: *You cannot defend what you cannot see. You cannot fix what you cannot prioritise.*
-The first 30 days are aggressive, disruptive, and non-negotiable. The goal is not perfection; it is **visibility**. Every unknown identity, unmapped dependency, and unmonitored access path is a latent failure waiting to happen.
+The first 60 days are about **kill chain mapping and critical visibility** — not about fixing everything. The goal is a clear, ranked picture of what would end the organisation, and initial closure of the most accessible existential gaps.
-### Week 1-2: Identity and Access Blitz
+> **Why 60 days, not 30**: A 30-day identity blitz sounds fast. It is also the fastest path to disabling a service account that runs payroll at 2 AM on Friday. Week 1 is documentation and baseline. Fixes require understanding the environment first. See the engagement model's week 1 discipline — it applies to every phase of this plan.
-**Tool strategy**: Use existing AD / Entra ID / IAM. No new purchases.
+### Weeks 1–2: Baseline and Kill Chain Mapping
-| Action | Owner | Deliverable | Existing Tool Leverage |
+**No changes in week 1.** Document and understand.
 |--------|-------|-------------|------------------------|
 | Aggressive identity audit | IAM / Security | Complete inventory of all human and non-human identities | ADUC, Entra ID portal, AWS IAM console |
 | Disable all unknown / unused accounts | IAM | List of disabled accounts with business justification for exceptions | Existing IAM + PowerShell / CLI scripts |
 | Rotate all critical passwords and shared secrets | Security Ops | Rotation log with verification | Existing IAM + LAPS (free from Microsoft) |
 | Target: admin accounts, service accounts, krbtgt equivalents | AD / Cloud IAM | Documentation of every privileged account | Existing directory services |
 | Implement password hygiene (minimum: audit) | IAM | Baseline report on password policy compliance | Native password policies + audit logs |
-### Week 2-3: Perimeter and Communication Mapping
+| Action | Owner | Deliverable |
 |--------|-------|-------------|
 | Export current identity state: all accounts, groups, privilege assignments | IAM / Security | Identity inventory — stale, active, privileged, service |
 | Run BloodHound collection; run Elysium password audit | Security | AD attack path map; compromised credential list |
 | Run CAExporter for Conditional Access documentation | Security | Human-readable CA policy register with gaps highlighted |
 | Deploy ASTRAL for M365 configuration baseline | Security | Committed tenant baseline; first drift detection operational |
 | Map all public-facing assets | Security | External attack surface register with P0 classification |
 | Identify the kill chain: shortest path from "nothing bad" to "organisation fails" | Security Architect | Kill chain document — maximum 2 pages; reviewed with executive sponsor |
-**Tool strategy**: Use native firewall management, open-source scanners, and manual audit before purchasing new NDR/VM platforms.
+### Weeks 3–4: T0 Identity Hardening
-| Action | Owner | Deliverable | Existing Tool Leverage |
+Target: privileged accounts only. Not all accounts.
 |--------|-------|-------------|------------------------|
 | Audit all vendor / supplier access paths | Security / Procurement | Inventory of VPN, RDP, Citrix, SSH, FTP, SCP, API keys | Existing IAM, VPN logs, firewall logs |
 | Review and document firewall rules | Network Team | Rule set with business justification for each | Native firewall management interfaces |
 | Map public-facing assets from external perspective | Security | Attack surface report with P0 classification | Free/open-source: Shodan, certificate transparency logs, nmap |
 | Implement aggressive vulnerability scanning | Security | Weekly scan results with trending | Existing scanner, Microsoft Defender Vulnerability Management, or OpenVAS |
-### Week 3-4: Visibility and Monitoring Baseline
+| Action | Owner | Deliverable |
 |--------|-------|-------------|
 | Force-reset accounts identified as compromised by Elysium (P0) | IAM | Password reset log with verification |
 | Enforce MFA on all T0 accounts: Global Admins, Domain Admins, backup admins, service principals with high privilege | IAM | MFA coverage report for T0 accounts |
 | Review and disable accounts that are clearly orphaned: departed employees confirmed by HR | IAM | Disable log — only accounts with confirmed ownership resolution |
 | Rotate KRBTGT and critical service account passwords | AD | Rotation log; tested without service disruption |
 | Review and remove direct Global Admin assignments; move toward PIM or named individual accounts | IAM | Privilege assignment review |
-**Tool strategy**: Maximize existing EDR/SIEM before considering new platforms. A spreadsheet CMDB is infinitely better than no CMDB.
+> **What we do not do in weeks 3–4**: We do not attempt to disable all unknown accounts. We do not attempt to resolve all service account ownership. We do not attempt to achieve 100% MFA on all users. These are Phase 2 activities, started after the kill chain is closed and the environment is understood.
-| Action | Owner | Deliverable | Existing Tool Leverage |
+### Weeks 5–6: Logging, Perimeter, and Critical Asset Inventory
-|--------|-------|-------------|------------------------|
+
-| Deploy endpoint detection on all managed devices | SOC / MDE | Coverage report: % of estate monitored | Existing EDR (Defender, CrowdStrike, SentinelOne) |
+| Action | Owner | Deliverable |
-| Establish log aggregation for critical systems | Security | Centralized logging for T0 and T1 assets | Existing SIEM, syslog server, or cloud native logging (Sentinel, CloudWatch, Cloud Logging) |
+|--------|-------|-------------|
-| Create initial CMDB seed for critical systems | IT / Security | CMDB populated with crown jewels | Existing ITAM, ServiceNow, or spreadsheet |
+| Deploy PULSAR for M365 audit log ingestion | Security | Audit events ingested; watermarks established; search operational |
-| Document "kill chain": shortest path to organizational failure | Security Architect | Threat model and mitigation map | Manual analysis + stakeholder interviews |
+| Enable logging for T0 systems where it is missing | Security | Logging coverage report for T0/T1 assets |
 | Audit all vendor and third-party remote access paths | Security / Procurement | Vendor access inventory with remove/restrict list |
 | Scan public-facing assets for critical CVEs | Security | Prioritised findings: P0 (internet-facing, critical CVE), P1, P2 |
 | Seed CMDB with T0 assets | IT / Security | T0 asset register with ownership, backup status, recovery procedure |
 | Validate backup integrity for T0 assets | Backup Admin | Backup test report — at least one successful restore per T0 system |
 ### Weeks 7–8: Kill Chain Closure and Phase 1 Wrap
 | Action | Owner | Deliverable |
 |--------|-------|-------------|
 | Close P0 vulnerabilities identified in week 5–6 scan | Security | Remediation log with verification |
 | Restrict or close the highest-risk vendor access paths | Security / Procurement | Vendor access changes confirmed |
 | Implement basic network segmentation between IT and OT (if applicable) | Network / OT | Segmentation policy; validated firewall rules |
 | Phase 1 review: re-run BloodHound and Elysium against week 1 baseline | Security | Before/after comparison; revised kill chain assessment |
 | Establish findings backlog: all Phase 1 findings entered with priority, owner, target date | IAM / Security | [Findings Backlog](../assessment-templates/findings-backlog.md) populated; named owner; monthly housekeeping cadence confirmed |
 ### Phase 1 Exit Criteria
- [ ] 100% of identities known and validated
+- [ ] Kill chain documented and reviewed with executive sponsor
- [ ] 100% of privileged access reviewed
+- [ ] T0 accounts: MFA enforced, privilege reviewed, compromised credentials reset
- [ ] All public-facing assets identified and scanned
+- [ ] P0 vulnerabilities (internet-facing, critical CVE) closed
- [ ] Centralized logging operational for critical systems
+- [ ] ASTRAL deployed; M365 baseline committed
- [ ] CMDB seeded with T0/T1 assets
+- [ ] PULSAR deployed; M365 audit logs ingesting
- [ ] Initial "kill chain" documented
+- [ ] T0 asset CMDB complete with backup integrity verified
 - [ ] Vendor access inventory complete; highest-risk paths closed
 - [ ] Housekeeping stream established: named owner, cadence, populated queue
-### Phase 1 Mantra
+**What "complete" does not mean at day 60**: All identities validated. All shared accounts eliminated. MFA on 100% of users. Zero legacy protocols. These are legitimate targets — they belong in the housekeeping queue and Phase 2 work, tracked, resourced, and given realistic timescales.
 > *"Do not be afraid to break things temporarily. Disable first, justify second. Visibility before permission."*
 ---
-## Phase 2: Control (Days 30–60)
+## Phase 2: Control (Days 60–120)
-**Theme**: *What we have seen, we must now contain.*
+**Theme**: *Close the kill chain. Build on what is understood, not what is assumed.*
-With visibility established, the next 30 days focus on **closing the highest-risk gaps** without introducing operational paralysis. This is the phase of quick wins and surface reduction.
+Phase 2 takes the kill chain map from Phase 1 and systematically closes the structural gaps. The work is less about discovery and more about verified remediation with proper change management.
-### Week 5-6: Attack Surface Reduction (ASR)
+### Weeks 9–10: MFA and Identity Hardening (Broad Rollout)
-**Tool strategy**: ASR rules and PAWs are native Microsoft capabilities. For non-Microsoft environments, use existing endpoint management.
+Phase 1 hardened T0. Phase 2 extends to all users — with proper change management.
-| Action | Owner | Deliverable | Existing Tool Leverage |
+| Action | Owner | Deliverable |
-|--------|-------|-------------|------------------------|
+|--------|-------|-------------|
-| Eliminate shared accounts where possible | IAM | Reduction metric: % of shared accounts decommissioned | Existing IAM + access review process |
+| Enforce MFA on all remote access: not just T0, but all users | IAM | MFA coverage report (% of users) — target 100% enforced, not just enrolled |
-| Implement Attack Surface Reduction rules on endpoints | Endpoint Security | ASR policy deployed and compliance measured | Microsoft Defender ASR (already owned in E3/E5) |
+| Block legacy authentication protocols tenant-wide | IAM | Legacy auth block confirmed via CAExporter and sign-in log review |
-| Harden admin access: dedicated PAWs, no browsing, no email | Security | PAW architecture documented and deployed | Existing Windows / Intune / GPO |
+| Deploy Conditional Access baseline: device compliance, location, sign-in risk | IAM | CA policy set deployed and tested; rollback documented |
-| Review and minimize permissions across all platforms | IAM / App Owners | Permission matrix with least-privilege gaps identified | Native IAM interfaces + scripts |
+| Continue housekeeping queue: first monthly cycle | IAM | Accounts resolved this cycle; queue status report |
-### Week 6-7: Network and DNS Security
+> **Change management is the constraint here, not technical complexity.** MFA rollout for 500 users requires helpdesk preparation, communication, exception handling, and at minimum two weeks of lead time. Scope this honestly. A rollout that generates 200 support tickets and forces an exception for the CEO because his phone broke is a rollout that gets walked back.
-**Tool strategy**: Use existing DNS infrastructure, firewall segmentation, and open-source sensors (Zeek/Suricata) before buying NDR.
+### Weeks 11–12: Attack Surface Reduction
-| Action | Owner | Deliverable | Existing Tool Leverage |
+| Action | Owner | Deliverable |
-|--------|-------|-------------|------------------------|
+|--------|-------|-------------|
-| Deploy DNS security (filtering, logging, anomaly detection) | Network | DNS security coverage report | Existing DNS infrastructure, Quad9/Cloudflare free tiers, Microsoft DNS security |
+| Deploy Intune compliance policies; enforce device compliance in CA | Endpoint / IAM | Compliance policy set; non-compliant device access blocked |
-| Segment IT/OT networks where they intersect | Network / OT | Network segmentation diagram and policy | Existing firewalls and VLANs |
+| Harden admin access: dedicated admin accounts, PAW where feasible | Security | Admin account architecture; PAW deployed for T0 admins |
-| Deploy network sensors at critical boundaries | SOC | Sensor coverage map with alerting validated | Zeek or Suricata (open-source) or existing IDS/IPS |
+| Implement ASR rules on all managed endpoints | Endpoint Security | ASR policy deployed; compliance measured |
 | Review and remove excessive application permissions (OAuth grants, service principals) | IAM | App permission audit; high-risk grants reviewed and reduced |
-### Week 7-8: Multi-Factor Authentication and Conditional Access
+### Weeks 13–14: Network Hardening and Vendor Governance
-**Tool strategy**: MFA and conditional access are native capabilities of Entra ID, Okta, and cloud IAM. No additional purchase required.
+| Action | Owner | Deliverable |
 |--------|-------|-------------|
 | Implement DNS security: filtering and logging | Network | DNS security coverage report |
 | Harden vendor remote access: time-bounded, MFA, session recording | Security / Procurement | Vendor access gateway operational; access policy enforced |
 | Patch P1 vulnerabilities from Phase 1 scan | Security | Remediation log; rescan confirming closure |
 | Establish change window discipline: all production changes through approved process | IT / Security | Change management process documented and operational |
-| Action | Owner | Deliverable | Existing Tool Leverage |
+### Weeks 15–16: Verification and Phase 2 Wrap
-|--------|-------|-------------|------------------------|
+
-| Enforce MFA on all remote access paths | IAM | MFA coverage: 100% of remote access | Entra ID, Okta, Duo, or native cloud IAM MFA |
+| Action | Owner | Deliverable |
-| Implement conditional access policies | IAM / Cloud | Policy set: device compliance, location, risk score | Entra ID Conditional Access, AWS IAM, GCP IAM |
+|--------|-------|-------------|
-| Review and harden M365 / Google Workspace security | Cloud Team | Cloud security posture report | Microsoft Secure Score, Google Security Health Analytics |
+| Re-run BloodHound, Elysium, and CAExporter against Phase 1 baseline | Security | Attack path reduction report; before/after metrics |
 | Run Purple Knight / E8-CAT against AD and M365 | Security | Security score comparison; residual findings list |
 | Review ASTRAL drift log for Phase 1–2 period | Security | Configuration change audit; unauthorised drift incidents |
 | Review PULSAR audit log: anomalous events flagged, investigated, resolved | Security | Audit review report |
 | Update risk register: what Phase 1–2 closed, what remains open, what Phase 3 addresses | Security | Updated risk register signed off by client lead |
 | Housekeeping queue: second monthly cycle | IAM | Queue status; cumulative accounts resolved |
 ### Phase 2 Exit Criteria
- [ ] Shared accounts reduced by minimum 50%
+- [ ] MFA enforced for 100% of users (not just enrolled — enforced via CA policy)
 - [ ] Legacy authentication blocked tenant-wide
 - [ ] CA baseline deployed and tested
 - [ ] ASR rules active on all managed endpoints
- [ ] MFA enforced on 100% of remote and privileged access
+- [ ] P1 vulnerabilities from Phase 1 scan closed
- [ ] DNS security operational
+- [ ] Vendor remote access hardened and inventoried
- [ ] Network segmentation policy defined and initial segments implemented
+- [ ] Attack path reduction measurable against Phase 1 BloodHound baseline
- [ ] Conditional access policies active for cloud workloads
+- [ ] Housekeeping queue running; two cycles completed
 ### Phase 2 Mantra
 > *"The goal is not to block everything. It is to ensure that every allowed path is known, justified, and monitored."*
 ---
-## Phase 3: Sovereignty (Days 60–90)
+## Phase 3: Signal and Retained Capability (Days 120–180)
-**Theme**: *Reclaim what should never have been rented.*
+**Theme**: *Build detection on the hardened foundation. Build the capability to sustain what was built.*
-This is where the antifragile approach diverges sharply from conventional hardening. The focus shifts from defending the perimeter to **owning the intelligence** that drives the organization.
+Phase 3 starts only after Phase 2 exit criteria are met. Detection engineering on an unhardened environment is waste — the signal-to-noise ratio is too low to produce actionable intelligence.
-### Week 9-10: AI Sovereignty Assessment
+> **Why not AI Sovereignty in Phase 3**: AI sovereignty — local models, owned inference infrastructure, sovereign cognitive capability — is a multi-year programme, not a 30-day sprint. Hardware procurement alone typically takes 6–12 weeks. Claiming it as a Phase 3 deliverable sets up the engagement to fail. AI sovereignty begins with the audit work in Phase 1 (AI usage inventory, classification, assessment of vendor terms) and continues as a separate parallel initiative. The Azure OpenAI Sovereignty Bridge is the appropriate near-term stepping stone. See [AI Sovereignty Framework](../core/ai-sovereignty-framework.md) and [Azure OpenAI Sovereignty Bridge](../core/azure-openai-sovereignty-bridge.md).
-**Tool strategy**: Discovery requires interviews and proxy log analysis. No purchase needed for assessment.
+### Weeks 17–18: Detection Engineering Foundation
-| Action | Owner | Deliverable | Existing Tool Leverage |
+| Action | Owner | Deliverable |
-|--------|-------|-------------|------------------------|
+|--------|-------|-------------|
-| Inventory all AI usage: approved and shadow | Security / AI Lead | AI usage map with data classification | Proxy logs, SaaS billing review, employee interviews |
+| Write initial PULSAR alert rules: CA policy changes, new Global Admin assignments, bulk mailbox export, app permission grants outside change window | Security | Alert rule set deployed; test-triggered and validated |
-| Classify AI workloads by sovereignty requirement | Security Architect | T0/T1/T2 AI asset classification | Existing data classification framework |
+| Review SIEM coverage: which T0 events generate alerts, which do not | Security | Detection coverage map against MITRE ATT&CK top 10 for M365 |
-| Identify highest-value local AI pilot candidate | AI Lead / Business | Pilot scope document with success criteria | Business stakeholder interviews |
+| Tune ASTRAL rolling PRs: configure reviewer notification, test reject/restore flow | Security | ASTRAL review workflow operational; first restore test completed |
-| Assess vendor AI terms: data usage, training, termination | Legal / Security | Risk register for each AI provider | Legal review of existing contracts |
+| Establish alert response runbooks: who gets notified, what they do, what they escalate | Security / Client Lead | Runbooks for top 5 alert types |
-### Week 10-11: Local AI Infrastructure Deployment
+### Weeks 19–20: Endpoint and Identity Detection
-**Tool strategy**: Start with existing hardware or low-cost sovereign cloud. Use open-source inference servers (Ollama, vLLM, llama.cpp).
+| Action | Owner | Deliverable |
 |--------|-------|-------------|
 | Deploy Wazuh or verify existing EDR coverage for on-premise systems | Security | Endpoint detection coverage report |
 | Write custom detection rules for kill chain-specific TTPs identified in Phase 1 | Security | Custom rule set tuned to client environment |
 | Establish weekly threat review cadence: PULSAR event summary + ASTRAL drift review | Security / Client Lead | First weekly review completed; format agreed |
 | AI usage audit: classify current AI workflows by data sensitivity and vendor agreement | Security / Legal | AI usage register; high-risk workflows flagged for remediation |
-| Action | Owner | Deliverable | Existing / Low-Cost Tool Leverage |
+### Weeks 21–24: Knowledge Transfer and Handover
 |--------|-------|-------------|----------------------------------|
 | Deploy local inference infrastructure (on-prem or sovereign cloud) | Infrastructure | Operational inference cluster | Underutilized servers, retired workstations, or sovereign cloud VM |
 | Establish model versioning and artifact management | MLOps / Security | Model registry with provenance tracking | Git + DVC or simple artifact storage |
 | Implement access controls for model weights and training data | Security | T0-class protection for AI assets | Existing file servers, encryption, IAM |
 | Deploy initial pilot: RAG or fine-tuned model on proprietary data | AI Team | Working pilot with performance baseline | Ollama, llama.cpp, or vLLM (open-source) + quantized open models |
-### Week 11-12: Backup, Recovery, and Validation
+The most important deliverable of Phase 3 is **the client's ability to operate everything without us.**
-**Tool strategy**: Use existing backup and DR infrastructure. The goal is to test and document, not to buy.
+| Action | Owner | Deliverable |
-
+|--------|-------|-------------|
-| Action | Owner | Deliverable | Existing Tool Leverage |
+| Runbook completion: every system built or modified has an operating runbook | Security / Client Team | Runbook set reviewed and signed off by client IT lead |
-|--------|-------|-------------|------------------------|
+| Client training: ASTRAL drift review workflow, PULSAR event search, alert response | Security | Training delivered; client IT lead can demonstrate competency |
-| Perform full recovery drill of one critical system from backup | IT / Security | Recovery time documented, gaps identified | Existing backup solution |
+| Housekeeping queue: third and fourth monthly cycles | IAM | Queue status; cumulative resolution metrics |
-| Validate backup integrity for all T0 assets | Backup Admin | Integrity report with sample restorations | Existing backup solution + integrity scripts |
+| Document what was built: configuration baseline document for every module | Security | Module completion package delivered |
-| Test local AI pilot under degraded network conditions | AI / Infrastructure | Resilience validation report | Existing network infrastructure + manual testing |
+| Phase 3 review: risk register update, metrics summary, Phase 4 / retained capability recommendation | Security | Final 180-day programme review with executive sponsor |
 | Document and exercise incident response for AI-specific threats | SOC / Security | Runbook: model poisoning, data exfiltration, adversarial input | Existing IR framework + internal knowledge |
 ### Phase 3 Exit Criteria
- [ ] All AI usage inventoried and classified
+- [ ] PULSAR alert rules operational for top 5 M365 risk scenarios
- [ ] Local inference infrastructure operational
+- [ ] ASTRAL drift review workflow operational; first restore tested
- [ ] One high-value AI pilot deployed and measured
+- [ ] Custom detection rules written for client-specific TTPs
- [ ] T0 protection applied to model weights and training data
+- [ ] Weekly threat review cadence established and running
- [ ] Critical system recovery drill completed successfully
+- [ ] All runbooks completed and signed off by client IT lead
- [ ] AI-specific incident response runbook created
+- [ ] Client IT lead can operate ASTRAL and PULSAR without consultant support
-
+- [ ] AI usage registered and high-risk workflows flagged
-### Phase 3 Mantra
+- [ ] Housekeeping queue: four consecutive cycles completed
 > *"We are moving from being consumers of intelligence to manufacturers of our own. The vault is built; now we fill it."*
 ---
-## Phase 4: Antifragility (Days 90–180)
+## Phase 4: Antifragility (Ongoing)
-**Theme**: *Build systems that grow stronger from disruption.*
+**Theme**: *The programme does not end. The organisation learns faster from disruption than competitors do.*
-The final phase converts the hardened foundation into an adaptive, learning organization. This is where antifragility becomes operational reality.
+Phase 4 is not a 30-day sprint. It is an ongoing operational posture. The 180-day programme establishes the foundation; Phase 4 is what happens when that foundation is maintained and extended over months and years.
-### Month 4: Structural Decoupling and Optionality
+**Phase 4 activities** (initiated at 180 days; sustained indefinitely):
-**Tool strategy**: Documentation, architecture, and open-source chaos tools (Chaos Mesh, Gremlin free tier, custom scripts). Work, not purchases.
+- **Retained capability**: Monthly ASTRAL drift review, PULSAR event summaries, quarterly Elysium/BloodHound scans, housekeeping queue advancement
 - **Detection engineering**: Progressive extension of alert rule coverage; tuning based on real events; quarterly rule review
 - **Structural improvement**: Exit architectures for vendor dependencies, progressive elimination of legacy systems, planned OT technology refresh
 - **Chaos engineering**: Controlled failure exercises — starting with non-production, progressing to production once detection and recovery capability is confirmed
 - **Red team exercises**: Annual structured adversarial testing — not before Phase 2 is complete and detection is operational
 - **AI sovereignty programme**: Local inference infrastructure, where justified by workload and capability; AURORA deployment for M365 governance intelligence; sovereign AI as a parallel multi-year initiative
 - **Greenfield capability building**: Configuration as code for all managed systems; tested migration procedures; documented rebuild path
-| Action | Owner | Deliverable | Existing / Free Tool Leverage |
+**What makes Phase 4 real**: A named person who owns the housekeeping queue. A calendar-blocked weekly threat review. A quarterly retained capability scope. Without these, Phase 4 does not happen — and everything built in 180 days begins to rot.
 |--------|-------|-------------|------------------------------|
 | Document exit architecture for all major platform dependencies | Enterprise Architecture | 90-day exit plan per critical vendor | Architecture documentation, existing runbooks |
 | Implement abstraction layers for proprietary integrations | Engineering | Interface documentation and migration test | Existing development tools and frameworks |
 | Establish dual-vendor readiness for one critical category | Procurement / Engineering | Technical proof of capability | Existing engineering capacity, open standards |
 | Deploy chaos engineering: simulate critical dependency failure | Resilience Team | Chaos experiment report with findings | Chaos Mesh (open-source), custom scripts, Gremlin free tier |
 ### Month 5: Stress-to-Signal Conversion
 **Tool strategy**: Process and culture changes require no licensing. Use existing EDR/SIEM for detection validation.
 | Action | Owner | Deliverable | Existing Tool Leverage |
 |--------|-------|-------------|------------------------|
 | Implement blameless post-mortem process with structural mandates | Culture / Security | Post-mortem template and governance | Existing collaboration tools (Confluence, SharePoint, Notion) |
 | Deploy production chaos engineering with automated rollback | Resilience Team | Monthly chaos experiment schedule | Existing orchestration + open-source chaos tools |
 | Create feedback loop: incident findings → architecture changes | Security Architect | Closed-loop metrics: mean time to structural fix | Existing ticketing system (Jira, ServiceNow) |
 | Launch "red team as a service": continuous adversarial testing | Security | Monthly red team report | Internal team + existing EDR/SIEM for detection validation |
 ### Month 6: Defensive AI and Continuous Modernisation
 **Tool strategy**: Defensive AI runs on the local inference infrastructure already deployed. Posture measurement uses existing APIs and open-source dashboards.
 | Action | Owner | Deliverable | Existing / Low-Cost Tool Leverage |
 |--------|-------|-------------|----------------------------------|
 | Expand local AI to defensive use cases: anomaly detection, code review, vulnerability prioritization | AI / Security | Defensive AI capability map | Local AI cluster deployed in Phase 3 |
 | Implement automated security posture measurement | Security | Continuous compliance dashboard | Existing APIs (Microsoft Graph, AWS APIs) + Grafana or open-source dashboard |
 | Evaluate and migrate additional AI workloads to local infrastructure | AI Lead | Migration roadmap with quarterly targets | Local AI infrastructure + business case templates |
 | Conduct first antifragility maturity assessment | Consultant / Security | Baseline maturity score with gap analysis | Spreadsheet or existing GRC tool |
 | Pilot organizational integration: embed security in one product team | Consultant / Engineering | Shift-left pilot metrics | Existing team structure + collaboration tools |
 | **Deploy AI-assisted TVM operationalization** | AI / Security | AI TVM dashboard; <48h critical CVE response | Defender Exposure Management + Azure OpenAI or local LLM; see [AI-Assisted TVM Blueprint](ai-assisted-tvm.md) |
 ### Phase 4 Exit Criteria
 - [ ] Exit architectures documented for top 5 vendor dependencies
 - [ ] Chaos engineering operational in production
 - [ ] Mean time to structural fix < 14 days from incident
 - [ ] Defensive AI pilot operational
 - [ ] First antifragility maturity assessment completed
 - [ ] Quarterly antifragility review calendar established
 ### Phase 4 Mantra
 > *"We do not want fewer incidents. We want incidents that teach us something we could not have learned any other way."*
 ---
 ## Governance and Cadence
 ### Weekly Steering Committee
 - Review blockers and escalations
 - Validate phase exit criteria
 - Adjust scope based on organizational readiness
 ### Monthly Board Update
 - Risk reduction metrics
 - Antifragility maturity trend
 - Investment vs. risk-exposure reduction
 - Strategic narrative: "This is not a cost centre; it is optionality insurance"
 ### Quarterly Retrospective
 - What failed that taught us something?
 - What assumptions have been invalidated?
 - What new dependencies have emerged?
 - What can be simplified or removed?
 ---
 ## Success Metrics
-| Dimension | Metric | Target |
+| Dimension | Metric | Realistic Target |
-|-----------|--------|--------|
+|-----------|--------|-----------------|
-| **Visibility** | % of assets in CMDB | 100% of T0/T1 within 30 days |
+| **Kill chain** | Kill chain nodes closed | 100% of P0 nodes closed by day 120 |
-| **Control** | Mean time to contain new identity | < 1 hour |
+| **Identity** | MFA enforcement on privileged accounts | 100% of T0 accounts by day 60; 100% of all accounts by day 120 |
-| **Sovereignty** | % of proprietary AI workloads local | 100% of T0-class within 90 days |
+| **Configuration** | ASTRAL drift detected and reviewed | Weekly; 100% of unauthorised drift investigated within 48h |
-| **Resilience** | Recovery time for critical system | < 4 hours |
+| **Audit trail** | PULSAR retention operational | 12+ months of M365 audit events retained by day 60 |
-| **Learning** | Structural fixes per incident | ≥ 1 |
+| **Housekeeping** | Stale accounts resolved per quarter | Measurable queue reduction each cycle; not a fixed % target |
-| **Optionality** | Vendor dependencies without exit plan | 0 |
+| **Recovery** | T0 system recovery test completed | At least one per T0 system within 180 days |
 | **Handover** | Client IT lead operational independence | All built systems operable without consultant by day 180 |
 > **On metrics and honesty**: Avoid targets that sound like achievements but are not verifiable. "100% of identities validated" cannot be verified in 180 days in any organisation with meaningful history. "All T0 accounts with MFA enforced and verified via CA sign-in logs" is verifiable. Write metrics you can prove, not metrics that sound ambitious.
 ---
 ## Governance and Cadence
 ### Weekly Check-In (30 minutes, every week)
 - Change log review: what was completed, what is blocked
 - Client decisions required this week
 - Risks and open items
 *If this meeting is consistently cancelled by the client, the engagement pauses until it resumes.*
 ### Monthly Steering Committee (60 minutes)
 - Phase progress against exit criteria
 - Risk register review
 - Housekeeping queue status
 - Budget and scope review
 - Next phase / retained capability planning
 ### Phase Gate Reviews (Days 60, 120, 180)
 Hard go/no-go decisions. Not formalities. If phase exit criteria are not met, the programme does not advance — it addresses the gaps.
 ---
 ## Adaptation Guide
-### Small Organizations (< 100 employees)
+### Small Organisations (< 100 employees)
- Compress Phases 1-2 into 30 days
+- Phase 1 focus: kill chain, T0 accounts, ASTRAL/PULSAR deployment. Skip broad identity audit — it is not necessary for small populations.
- Use managed sovereign cloud for local AI instead of on-premises hardware
+- Phase 2 focus: MFA for all users (achievable quickly at small scale), basic CA, device compliance.
- Focus on identity, backup, and one high-value AI pilot
+- Phase 3 focus: runbooks and handover. Detection engineering is proportional to environment complexity.
- Leverage Microsoft Business Premium or Google Workspace security features fully before any additional purchase
+- **Do not compress the timeline further.** The bottleneck at small organisations is almost always IT resource availability and change management, not technical complexity.
 ### Regulated Industries (Finance, Healthcare, Critical Infrastructure)
- Extend Phase 1 to 45 days for compliance mapping
+- Extend Phase 1 to 90 days where regulatory mapping and OT inventory are required.
- Integrate regulatory requirements into T0 classification
+- Add compliance validation gates at each phase exit — specific evidence requirements for NIS2/DORA/GDPR.
- Add compliance validation gates at each phase exit
+- The housekeeping stream is non-negotiable for regulators who require demonstrable continuous control.
-### Highly Distributed Organizations
+### Organisations with Heavy Technical Debt
- Prioritize network segmentation and DNS security in Phase 1
+- Accept explicitly, in writing, that 20 years of debt will not be cleared in 180 days.
- Deploy edge inference nodes in Phase 3 instead of central cluster
+- Phase 1 focus is kill chain only. The full debt picture goes into the housekeeping queue and the Phase 4 backlog.
- Emphasize operational resilience and disconnected operations
+- The rapid modernisation plan addresses existential risk. The housekeeping stream addresses accumulated risk over time. Both are necessary; neither replaces the other.
 - Adjust Phase 2 exit criteria to reflect the realistic pace of MFA rollout in high-debt environments — legacy systems often require extended exception handling.
-### Organizations with Heavy Technical Debt
+### OT/Critical Infrastructure Environments
- Accept that 20 years of debt cannot be cleared in 180 days
+- Phase 1 must include OT asset inventory and IT/OT connection map.
- Use defensive AI in Phase 4 to accelerate debt identification and prioritization
+- Phase 2 segmentation work (IT/OT boundary) is the primary kill chain closure, not identity hardening.
- Focus on "kill chain" protection rather than comprehensive cleanup
+- See [Vertical: Power and Utilities](../reference/vertical-power-utilities.md) and the Critical Infrastructure Adaptation in [Move Fast and Fix Things](move-fast-and-fix-things.md#the-critical-infrastructure-adaptation).
 - Map every action to CIS IG1 to show standards alignment without additional framework investment
 ---
@@ -0,0 +1,334 @@
 # Sample Engagement: Mid-Market Hybrid Organisation
 > *This document is a calibration reference for consultants. It walks through a realistic engagement for a specific client profile from first contact through Day 180. Use it to calibrate your own scope estimates, find comparable findings for risk register entries, and understand what a complete engagement looks like for this type of organisation.*
 ---
 ## Client Profile: Nexus Operations s.r.o.
 **Fictional client. All details are representative of a real mid-market profile.**
 | Attribute | Detail |
 |-----------|--------|
 | **Size** | 500 employees, 10 IT/admin staff |
 | **Sector** | Professional services (management consulting + outsourced IT services) — NIS2 **important entity** under digital infrastructure provisions |
 | **Identity** | Active Directory (on-premises, single forest, two domains — legacy acquisition) + Entra ID (hybrid join, Azure AD Connect sync) |
 | **M365 licensing** | E3 — includes Entra ID P1 (Conditional Access), Defender for Endpoint Plan 1, Intune, Exchange Online, SharePoint, Teams. No E5 features: no PIM, no Defender for Identity, no Sentinel, no Purview advanced. |
 | **Endpoint management** | Intune deployed 18 months ago; ~70% Windows enrollment, ~30% macOS enrollment; no iOS/Android policy; Intune used primarily for app deployment, not compliance enforcement |
 | **Third-party tools** | Jira (cloud), GitHub (cloud, mix of org/personal accounts), Confluence (cloud), a legacy on-prem ERP (SAP), an on-prem file server (Windows Server 2016), a CRM (Salesforce), and approximately 12 other SaaS tools identified in procurement; shadow IT suspected |
 | **Infrastructure** | Three offices (Prague HQ, Brno, Warsaw); hybrid work standard; ~80 external contractors at any given time; site-to-site VPN between offices; split DNS; no SD-WAN |
 | **Current security** | No dedicated security tool beyond Defender AV. Microsoft Secure Score: 42%. No SIEM. No SOC. Previous pentest 2 years ago (report available). Previous ISO 27001 attempt abandoned 18 months ago. |
 | **NIS2 status** | In-scope as important entity; national transposition deadline passed; supervisory authority has sent initial questionnaire; response due in 90 days |
 | **Trigger** | NIS2 questionnaire received; CTO has seen the [Brownhat Diagnostic](../assessment-templates/nist-csf-baseline.md) approach referenced by a peer; CISO role vacant (they are looking) |
 ---
 ## Engagement Context
 ### Why They Called
 The NIS2 questionnaire is the proximate trigger but not the underlying problem. The CTO's real concern, surfaced in the discovery call: "We have been growing fast, the acquisition two years ago added a lot of mess, and I genuinely do not know what we would do if we had a serious incident. We have contractors everywhere and I am not sure all of them are properly offboarded when their engagement ends."
 This is a common and honest framing. The NIS2 deadline creates a compliance urgency, but the actual risk is operational — undocumented access, accumulated technical debt from the acquisition, and no detection capability.
 ### What the Discovery Call Revealed
 **The trigger question** ("What happened recently that made you call us?") produced: the NIS2 questionnaire, plus a near-miss three months ago — a contractor who had left six months previously used their still-active account to access a SharePoint site. Nobody noticed until the contractor themselves mentioned it to their former manager. No data exfiltration confirmed but not verified.
 **The accountability question**: Named IT lead is the senior sysadmin, Ondřej Blaha. CTO is the executive sponsor. CISO role vacant — the IT lead is acting as de facto security lead without the title or dedicated time.
 **The tools question**: E3 confirmed. Intune confirmed but underutilised. No SIEM. Previous pentest report available (2 years old). Defender AV on all Windows endpoints; coverage on macOS "mostly."
 **The success question**: "Pass the NIS2 questionnaire. Know that if something happens, we can respond. And if I hire a CISO in six months, I want there to be something to hand over."
 This is an excellent brief. Concrete, honest, achievable.
 ### What Disqualifies This Client?
 Nothing. All green lights:
 - Named executive sponsor with budget authority (CTO)
 - Named IT lead with operational access (Ondřej)
 - Real trigger with a deadline (NIS2 response in 90 days)
 - Honest assessment of current state
 - Realistic success criteria
 **One flag to manage**: The NIS2 questionnaire response is due in 90 days. This creates urgency that may pressure the client to skip the Brownhat Diagnostic and go straight to "give us a report for the regulator." Resist this. The diagnostic *is* the report — it produces evidence directly usable in the NIS2 response. Skipping it produces a worse outcome for both the client and the regulator.
 ---
 ## Brownhat Diagnostic Findings
 *What a competent two-day diagnostic would find in this environment. Presented as the consultant would present it to the CTO.*
 ### Kill Chain Assessment
 The shortest path from "nothing bad has happened yet" to "Nexus Operations cannot operate" runs through identity.
 ```
 Compromised contractor credential (still active after offboarding)
    → Access to M365 (no MFA enforced, or legacy auth bypasses MFA)
    → Access to SharePoint / Teams (all data)
    → Access to Exchange (all email, calendar, contacts)
    → Password spray against Entra ID → escalate to admin account
    → Domain Admin via Entra ID Connect sync account
    → Full AD compromise → all on-prem systems
    → ERP (SAP) → financial data, operational disruption
 ```
 This is not theoretical. The six-month-old contractor account near-miss is one credential spray away from the beginning of this chain.
 **Secondary kill chain** (on-prem):
 ```
 Internet-facing VPN endpoint (legacy firmware, no MFA)
    → Internal network access
    → Lateral movement via NTLM relay (expected: NTLM not disabled)
    → File server → ERP → AD
 ```
 ### Findings by Priority
 #### P0 — Kill Chain Nodes
 | ID | Finding | Evidence |
 |----|---------|----------|
 | P0-001 | **No MFA enforced for remote access or M365** | Entra ID sign-in logs show 34% of sign-ins in past 30 days without MFA; Conditional Access policies exist but are in Report-Only mode, never activated |
 | P0-002 | **Active contractor accounts: 23 confirmed stale** | Elysium identifies 23 accounts with last login > 90 days owned by contractors whose engagements are confirmed ended in HR system; 6 have been inactive for > 6 months |
 | P0-003 | **KRBTGT password never rotated** | Last rotation: 847 days (default since domain creation). Any Golden Ticket attack persists across credential resets until KRBTGT is rotated. |
 | P0-004 | **Azure AD Connect sync account has excessive privilege** | The sync service account has DCSync rights on the on-premises domain. Compromise of Entra ID admin → on-prem domain compromise via this account. |
 | P0-005 | **VPN endpoint: no MFA, outdated firmware** | Cisco ASA, firmware 18 months out of date; no MFA for VPN authentication; used by all contractors and remote employees |
 | P0-006 | **No tested backup restore** | Backups run nightly (confirmed); no restore has ever been tested; ERP backup destination is on the same network segment as the ERP server |
 #### P1 — Material Risk
 | ID | Finding | Evidence |
 |----|---------|----------|
 | P1-001 | **Legacy authentication not blocked** | Sign-in logs: 847 legacy auth attempts in past 30 days from 34 unique accounts; these bypass MFA regardless of CA policy |
 | P1-002 | **Domain Admins using workstations for email and browsing** | BloodHound: 4 of 5 Domain Admin accounts have interactive logon events from standard workstations; no PAW architecture |
 | P1-003 | **Service accounts: 31 with non-expiring passwords, 12 with unknown owners** | AD audit; 7 service accounts have Domain Admin-equivalent rights with no documented purpose |
 | P1-004 | **Intune compliance not enforced in Conditional Access** | Compliant device requirement is in CA policy but excluded for all users via the "AllUsers_ExceptionGroup" group containing 489 of 500 users |
 | P1-005 | **Third-party SaaS access not reviewed** | 12 known SaaS tools; Entra ID app registrations show 47 enterprise applications with consent grants; 11 have "Mail.ReadWrite" or equivalent scopes from unidentified sources |
 | P1-006 | **No MFA on GitHub** | GitHub org admin accounts without MFA enforced at org level; mix of personal and managed accounts; no SSO integration with Entra |
 | P1-007 | **SAP ERP on-prem: default admin credentials not changed on secondary instance** | Confirmed during document review of previous pentest report |
 | P1-008 | **No logging beyond M365 default 90-day retention** | No SIEM; no secondary log retention; M365 audit log at 90-day E3 default; ERP and file server logs local only, 30-day retention |
 #### P2 — Housekeeping Queue
 | ID | Finding |
 |----|---------|
 | P2-001 | NTLM not disabled; NTLMv1 still permitted in GPO |
 | P2-002 | Basic authentication still enabled for Exchange (in addition to legacy auth block needed above) |
 | P2-003 | 89 stale AD accounts (not contractors — former employees; some date to 2019) |
 | P2-004 | DNS records for 14 decommissioned services still exist |
 | P2-005 | Firewall ruleset last reviewed 3 years ago; 23 rules with "any/any" destination |
 | P2-006 | macOS endpoints: Defender coverage patchy; 31 devices not enrolled in Intune |
 | P2-007 | No documented vendor access procedure; contractors provisioned ad hoc |
 | P2-008 | Windows Server 2016 file server: extended support ends October 2026 |
 | P2-009 | Jira/Confluence: 67 former employee accounts still active |
 | P2-010 | SharePoint external sharing enabled globally with no policy; 14 sites have external links active |
 ### Quick Wins (Closeable Before Day 30)
 1. **Activate CA policies** — already in Report-Only; switch to Enabled. MFA enforcement for all sign-ins with zero new tooling. (2 hours)
 2. **Disable 23 confirmed stale contractor accounts** — HR-confirmed departures; disable immediately. (1 hour, needs HR sign-off already obtained)
 3. **Remove AllUsers_ExceptionGroup from CA compliance policy** — 489 users are excepted from device compliance for no documented reason. Remove the exception. (30 minutes)
 4. **Block legacy authentication** — CA policy for legacy auth block already exists in the tenant (Microsoft provides a template); activate it. Test first with sign-in log review. (4 hours including testing)
 5. **Enforce MFA on GitHub org** — Organisation setting, 2 minutes to enable; will force any admin without MFA to enrol at next login. (5 minutes)
 ---
 ## Module Recommendation and Rationale
 ### Recommended Sequence
 ```
 Brownhat Diagnostic + Quick Wins        (Weeks 1-4)
        ↓
 Module 2: M365 Identity Security        (Weeks 4-10)  ← Primary kill chain
        ↓
 Module 6: On-Premise AD Hardening       (Weeks 8-14)  ← Runs in parallel from week 8
        ↓
 Module 1: Endpoint Management           (Weeks 14-18) ← Hardens existing Intune
        ↓
 Module 7: Recovery & Resilience         (Weeks 16-20) ← Runs in parallel from week 16
 ```
 ### Rationale
 **Why Module 2 first**: The kill chain runs through identity. P0-001 (no MFA enforced), P0-002 (stale contractor accounts), and P1-001 (legacy auth) are all Module 2 work. These are also the fastest path to demonstrable NIS2 evidence — Article 21 explicitly requires MFA and access control measures.
 **Why Module 6 second, partially parallel**: P0-003 (KRBTGT rotation), P0-004 (AD Connect privilege), and P1-002 (Domain Admins on standard workstations) require AD access and change windows. This work can start in week 8 as Module 2 is closing — the identity team has already been engaged, the change management process is established.
 **Why Module 1 third, not first**: Intune is already deployed and roughly functional. It is not the kill chain. Hardening Intune (compliance policies, CA integration, full macOS enrollment) is important but secondary to closing the identity gaps. It belongs in Week 14 when identity work is complete.
 **Why Module 7 matters here**: The ERP backup (P0-006) is a kill chain node. Recovery and Resilience validates backup integrity and produces the restore test evidence that NIS2 business continuity requirements directly demand. Starting Module 7 in parallel with Module 1 from Week 16 gets this done within 180 days.
 **Not recommended in this engagement**:
 - Module 5 (AI Sovereignty Bridge): not in the kill chain; deferred to Phase 4
 - Module 10 (Red Team): requires a hardened foundation; schedule at 12 months post-engagement
 - Module 12 (Blue/Purple Team): requires detection infrastructure not yet deployed; follow-on engagement
 - Module 8 (OT): not applicable — no OT environment
 ---
 ## Day 30 / Day 90 / Day 180: This Specific Client
 ### Day 30 Deliverables
 | # | Deliverable | Nexus-specific detail |
 |---|-------------|----------------------|
 | 1 | Brownhat Diagnostic report | Kill chain documented (identity → AD → ERP); 5 quick wins; module roadmap |
 | 2 | ASTRAL deployed | Intune + Entra ID baseline committed; Azure DevOps project `ASTRAL-Nexus` created; drift detection live |
 | 3 | PULSAR deployed | M365 audit events ingesting; Ondřej confirmed as reviewer; Teams tab pinned in IT channel |
 | 4 | T0 accounts hardened | 3 Global Admins: MFA enforced, dedicated admin accounts separated from daily-use accounts |
 | 5 | Attack surface report | VPN endpoint flagged (P0-005); external-facing services enumerated |
 | 6 | Quick wins closed | CA policies activated; 23 contractor accounts disabled; legacy auth blocked; GitHub MFA enforced; Intune compliance exception removed |
 | 7 | Findings backlog opened | All diagnostic findings entered in ADO Work Items; Ondřej named as owner for P0/P1; CTO briefed on P0 count (6) and quick wins status |
 > **NIS2 value at Day 30**: The Brownhat Diagnostic report and the quick wins closure log constitute direct evidence for NIS2 Article 21 (access control, MFA, asset management). PULSAR starts accumulating the audit log retention the questionnaire will ask about.
 ---
 ### Day 90 Deliverables
 | # | Deliverable | Nexus-specific detail |
 |---|-------------|----------------------|
 | 8 | MFA for all users enforced | CA policy covering all 500 users; verified via sign-in logs; helpdesk prepared for exceptions (expected: ~15 users requiring assisted enrolment) |
 | 9 | Legacy auth blocked | Verified: zero legacy auth sign-ins in past 7 days in PULSAR |
 | 10 | CA baseline deployed | Device compliance required; location-based policies for Warsaw office (different risk profile); sign-in risk policy active |
 | 11 | P0 vulnerabilities closed | P0-002 (contractors) ✓ Day 30; P0-003 (KRBTGT) rotated with two-rotation process; P0-004 (AD Connect account) de-privileged; P0-005 (VPN MFA) enforced |
 | 12 | AD attack path reduction | BloodHound before/after: paths to Domain Admin reduced from 847 to <50; service accounts with Domain Admin rights reduced from 7 to 0 |
 | 13 | Vendor access hardened | Contractor provisioning procedure documented; offboarding checklist created and linked to HR process; Ondřej named as monthly reviewer |
 | 14 | T0 backup integrity | ERP backup tested and restored to isolated environment; restore time documented (target: <4 hours); backup destination moved off same network segment |
 | 15 | ASTRAL: first restore drill | Intentional test change made and restored via pipeline; process documented |
 | 16 | PULSAR: top 5 alert rules | CA policy modification; new Global Admin assignment; bulk mailbox export; new high-privilege app consent; VPN authentication failure spike |
 > **NIS2 value at Day 90**: MFA enforcement (Article 21c), access control and account management (Article 21i), audit log retention accumulating since Day 30 (Article 21j), backup integrity evidence (Article 21c business continuity). Sufficient to respond to the NIS2 questionnaire with evidence, not assertions.
 ---
 ### Day 180 Deliverables
 | # | Deliverable | Nexus-specific detail |
 |---|-------------|----------------------|
 | 17 | Alert runbooks | 5 PULSAR alert runbooks signed off by Ondřej; escalation path to CTO documented |
 | 18 | Custom detection rules | Contractor account creation outside HR-approved window; SAP admin login outside business hours; bulk SharePoint download |
 | 19 | Client independence | Ondřej completes live walkthrough: reviews ASTRAL PR, investigates a PULSAR event, resets a compromised Elysium-flagged account |
 | 20 | Housekeeping: 3 cycles | Cycles 1–3 completed; 67 Jira/Confluence accounts resolved; 89 stale AD accounts processed (disabled with justification per account); DNS cleanup in progress |
 | 21 | Module completion packages | Module 2, Module 6, Module 1 completion packages delivered to `nexus-security` ADO repository |
 | 22 | Risk register closure | Before/after comparison: P0 count 6 → 0; P1 count 8 → 2 (P1-007 SAP default credentials and P1-005 app consent review in housekeeping queue) |
 | 23 | Retained capability scope | Agreed quarterly scope: monthly ASTRAL drift review, quarterly BloodHound + Elysium run, PULSAR health check, housekeeping queue advancement |
 ---
 ## Findings Backlog — Initial Population
 *Pre-populated from the Brownhat Diagnostic. Consultants: adapt IDs and details to your actual findings.*
 **ADO Work Items project**: `ASTRAL-Nexus` (same project as ASTRAL deployment)
 **Owner**: Ondřej Blaha
 **Cadence**: Monthly housekeeping review, first Thursday of each month
 ### P0 — Kill Chain (all closed by Day 90)
 | ID | Finding | Source | Owner | Status | Target |
 |----|---------|--------|-------|--------|--------|
 | B-001 | No MFA enforced: 34% of sign-ins without MFA | Brownhat | Ondřej | **Closed** Day 30 | Day 30 |
 | B-002 | 23 stale contractor accounts with valid credentials | Elysium | Ondřej | **Closed** Day 30 | Day 30 |
 | B-003 | KRBTGT password 847 days old | BloodHound | Ondřej | **Closed** Day 75 | Day 60 |
 | B-004 | AD Connect sync account has DCSync rights | BloodHound | Ondřej | **Closed** Day 70 | Day 60 |
 | B-005 | VPN: no MFA, firmware 18 months outdated | Brownhat | Ondřej | **Closed** Day 80 | Day 90 |
 | B-006 | No tested ERP backup restore | Brownhat | Ondřej | **Closed** Day 85 | Day 90 |
 ### P1 — Material Risk
 | ID | Finding | Source | Owner | Status | Target |
 |----|---------|--------|-------|--------|--------|
 | B-010 | Legacy auth not blocked: 847 sign-ins in 30 days | PULSAR | Ondřej | **Closed** Day 30 | Day 30 |
 | B-011 | Domain Admins using standard workstations | BloodHound | Ondřej | **Closed** Day 65 | Day 60 |
 | B-012 | 7 service accounts with Domain Admin rights, no documented purpose | AD audit | Ondřej | **Closed** Day 72 | Day 60 |
 | B-013 | Intune compliance exception covers 489/500 users | ASTRAL | Ondřej | **Closed** Day 30 | Day 30 |
 | B-014 | 47 Entra app registrations with Mail.ReadWrite or higher scope | Entra audit | Ondřej | In Progress | Day 120 |
 | B-015 | GitHub org: no MFA enforcement, personal/managed account mix | Brownhat | Ondřej | **Closed** Day 30 | Day 30 |
 | B-016 | SAP secondary instance: default admin credentials not changed | Pentest report | IT Lead (SAP) | Open | Day 90 |
 | B-017 | No audit log retention beyond 90 days | Brownhat | Ondřej | **Closed** Day 1 (PULSAR) | Day 30 |
 ### P2 — Housekeeping Queue
 | ID | Finding | Source | Owner | Status | Target |
 |----|---------|--------|-------|--------|--------|
 | B-100 | NTLM not disabled; NTLMv1 permitted | AD audit | Ondřej | Open | Q3 |
 | B-101 | 89 stale AD accounts from former employees | Elysium | Ondřej | In Progress (Cycle 2) | Q3 |
 | B-102 | 14 DNS records for decommissioned services | AD audit | Ondřej | Open | Q3 |
 | B-103 | 23 firewall rules with any/any destination | Firewall review | Network | Open | Q4 |
 | B-104 | 31 macOS devices not enrolled in Intune | ASTRAL/Intune | Ondřej | In Progress (Module 1) | Day 180 |
 | B-105 | No documented vendor access procedure | Brownhat | Ondřej | **Closed** Day 85 | Day 90 |
 | B-106 | Windows Server 2016 file server: EOL Oct 2026 | Brownhat | CTO | Open | Oct 2026 |
 | B-107 | 67 former employee accounts in Jira/Confluence | Brownhat | Ondřej | In Progress (Cycle 1) | Q3 |
 | B-108 | SharePoint external sharing: 14 sites with active external links | ASTRAL | Ondřej | Open | Q3 |
 | B-109 | Basic auth still enabled for Exchange | Brownhat | Ondřej | Open | Q2 |
 ---
 ## NIS2 Article 21 Compliance Map
 *Evidence produced by this engagement against the Article 21 measures. Use this table in the NIS2 questionnaire response.*
 | Article 21 Measure | Requirement | Evidence from this engagement |
 |--------------------|-------------|-------------------------------|
 | **21(2)(a)** Policies on risk analysis and information security | Documented policies | Brownhat Diagnostic report; module completion packages; risk register |
 | **21(2)(b)** Incident handling | Detection and response capability | PULSAR alert rules + runbooks; incident escalation procedure |
 | **21(2)(c)** Business continuity, backup, DR | Tested backup and recovery | Module 7: ERP backup restore test report; Recovery Time documented |
 | **21(2)(d)** Supply chain security | Vendor/supplier risk management | Contractor access procedure; vendor access inventory; offboarding checklist |
 | **21(2)(e)** Security in acquisition, development | Secure development and procurement | (Partial — addressed in Phase 4; not covered in 180-day programme) |
 | **21(2)(f)** Policies to assess effectiveness | Metrics and review cadence | ASTRAL drift history; PULSAR event summaries; quarterly BloodHound/Elysium; housekeeping cycle reports |
 | **21(2)(g)** Cyber hygiene and training | Basic hygiene and awareness | MFA enforcement; CA policies; device compliance; housekeeping stream |
 | **21(2)(h)** Cryptography and encryption | Encryption standards | (Addressed via CA device compliance and baseline — documented) |
 | **21(2)(i)** HR security, access control, asset management | Identity governance, privileged access | Module 2: MFA, CA, privileged account management; Module 6: AD hardening; stale account process |
 | **21(2)(j)** Authentication, MFA | MFA for all users | CA policy enforced for all 500 users; verified via sign-in log (Day 90 deliverable #8) |
 **For the supervisory authority questionnaire**: The strongest evidence package is: (1) the Brownhat Diagnostic report showing risk analysis was conducted, (2) the ASTRAL baseline showing configuration management is operational, (3) the PULSAR deployment showing logging and monitoring is in place, and (4) the Day 90 MFA enforcement verification via sign-in logs. These four items directly answer the most common questions in NIS2 supervisory questionnaires.
 ---
 ## Investment Estimate
 *Effort ranges using the module investment levels from [Modular Engagements](../core/modular-engagements.md). Day rates applied per engagement proposal.*
 | Phase | Activity | Estimated Effort |
 |-------|----------|-----------------|
 | Brownhat Diagnostic | 2-day workshop + report | 16–20 consultant hours |
 | Quick wins implementation | CA policies, account disables, GitHub MFA | 8–12 hours (same week as diagnostic) |
 | Module 2: M365 Identity Security | MFA rollout (500 users, 10 admins, contractors), CA baseline, legacy auth block, app consent review, ASTRAL/PULSAR deployment | **Low to medium** (20–30 consultant days) |
 | Module 6: On-Premise AD Hardening | KRBTGT rotation, service account cleanup, PAW for admins, BloodHound remediation, AD Connect de-privilege | **Low to medium** (15–25 consultant days) |
 | Module 1: Endpoint Management | Intune compliance baseline, macOS enrollment, CA integration, ASTRAL hardening | **Low** (8–15 consultant days) |
 | Module 7: Recovery & Resilience | Backup integrity testing, ERP restore drill, DR runbooks | **Low** (8–12 consultant days) |
 | **Total 180-day programme** | | **~55–80 consultant days** |
 **Infrastructure costs** (one-time, at cost):
 - PULSAR hosting: €10–20/month (VPS or Azure Container Apps) — or on the client's existing infrastructure
 - ASTRAL: no additional cost (Azure DevOps pipelines within E3/Microsoft Partner allocation)
 **Retained capability** (post-180 days, quarterly):
 - Monthly ASTRAL drift review and PULSAR health check
 - Quarterly BloodHound + Elysium run + housekeeping cycle
 - Estimated: 3–5 consultant days per quarter
 ---
 ## Consultant Notes
 **The CISO handover opportunity**: The CTO mentioned they want something to hand over when they hire a CISO. Structure the Day 180 deliverables explicitly as a CISO onboarding package: the backlog, the ASTRAL history, the PULSAR event summary, the module completion packages, and the retained scope. A new CISO who inherits a cleaned AD, enforced MFA, running detection, and a maintained backlog is in a position to build — not to firefight.
 **Managing the NIS2 timeline pressure**: The questionnaire is due in 90 days. The Day 90 deliverables are specifically designed to produce the four evidence items (diagnostic, ASTRAL, PULSAR, MFA enforcement) needed to answer the questionnaire. Do not let the regulatory deadline distort the sequence — the diagnostic first, then module work. A questionnaire answered with ASTRAL drift logs and CA sign-in evidence is stronger than one answered with a Word document and good intentions.
 **The two-domain AD**: The acquisition-created second domain adds complexity to Module 6. Scope it explicitly in the kickoff: which domain gets the KRBTGT rotation first? Are there forest-level trusts? BloodHound collection needs to cover both. Add 5–7 days to the Module 6 estimate if the trust relationship is poorly documented.
 **SAP credentials (P1-016)**: This finding is outside the standard M365/AD scope. It requires SAP admin access and coordination with the ERP team (who may not report to Ondřej). Flag it as an explicit dependency at kickoff — it will slip past Day 90 without an owner from the ERP side.
 **Contractors**: 80 contractors at any given time means the offboarding process is a permanent operational concern, not a one-time fix. The contractor provisioning and offboarding procedure (B-105) must name an owner in HR, not just IT. If HR does not send a termination notification, IT cannot offboard. This is a process dependency that the engagement alone cannot fix — it requires a management conversation.
 ---
 *This sample engagement is based on composite real-world findings from mid-market AD+M365 environments. All company names and individual details are fictional.*
 *Related: [Brownhat Diagnostic](../assessment-templates/nist-csf-baseline.md) · [Module Menu](../core/modular-engagements.md) · [Findings Backlog](../assessment-templates/findings-backlog.md) · [NIS2 Mapping](../reference/nist-csf-mapping.md) · [Risk Register Example](../assessment-templates/risk-register-example.md)*
@@ -6,7 +6,7 @@ This document provides the complete capability map for our consulting practice:
 1. **Clients** who want to understand what we bring to an engagement
 2. **Consultants** who need to select the right tool for the right module
-3. **Our own product team** who are building ASTRAL and AOC to close the M365-native gap
+3. **Our own product team** who are building ASTRAL and PULSAR to close the M365-native gap
 ---
@@ -115,11 +115,11 @@ This document provides the complete capability map for our consulting practice:
 | **Antifragile pillar** | Sovereign Intelligence, Asymmetric Payoff Design |
 | **Engagement modules** | Module 4 (Data Governance); Module 11 (Embedded Quality); all compliance-driven clients |
 | **Typical output** | Live compliance dashboard: "DORA Article 12: 14 of 17 controls evidence-complete; 3 gaps assigned to owners with due dates" |
-| **Integration** | Pulls findings from Prowler, osquery, BloodHound, and AOC into unified evidence packages |
+| **Integration** | Pulls findings from Prowler, osquery, BloodHound, and PULSAR into unified evidence packages |
 **The conversation**:
-> *"Your auditor wants evidence that you monitor privileged access. CISO Assistant links the BloodHound scan, the Purple Knight score, the AOC admin activity report, and the osquery listening-ports query into a single evidence package for DORA Article 8. No scrambling for screenshots the night before the audit."*
+> *"Your auditor wants evidence that you monitor privileged access. CISO Assistant links the BloodHound scan, the Purple Knight score, the PULSAR admin activity report, and the osquery listening-ports query into a single evidence package for DORA Article 8. No scrambling for screenshots the night before the audit."*
 ---
@@ -129,16 +129,29 @@ This document provides the complete capability map for our consulting practice:
 | Attribute | Detail |
 |-----------|--------|
-| **What it does** | Intelligent backup, configuration drift detection, and change management for Microsoft Intune, Entra ID, and M365 tenant configurations. Captures baseline state, detects unauthorised or accidental changes, and enables rapid rollback. |
+| **What it does** | Git-tracked snapshots of Microsoft Intune and Entra ID configuration with Azure DevOps pipeline-driven drift detection, PR-based review and approval workflow, and baseline restore capability. Answers: *"what does my tenant configuration look like, what changed, and can we revert it?"* |
-| **Why we built it** | No existing tool treats M365 configuration as code. A tenant with 500 conditional access policies, 200 Intune profiles, and 50 compliance policies is unmanageable without version control and drift detection. ASTRAL provides GitOps for M365. |
+| **Why we built it** | No existing tool treats M365 configuration as code. A tenant with 200 CA policies, 500 Intune profiles, and dozens of authentication methods is unmanageable without version control and drift detection. ASTRAL provides GitOps for M365. |
-| **Antifragile pillar** | Structural Decoupling, Stress-to-Signal Conversion |
+| **Antifragile pillar** | Structural Decoupling (surface hidden dependencies), Asymmetric Payoff Design (high protection from low deployment cost) |
 | **Engagement modules** | Module 1 (Endpoint Management); Module 2 (Identity Security); Module 3 (M365 Security Hardening); retained capability engagements |
-| **Typical output** | "Configuration drift detected: 3 conditional access policies modified outside change window; 1 Intune profile deleted; all changes attributable to [admin account]; rollback initiated automatically" |
+| **Typical output** | Rolling PR: "Drift detected: 3 Conditional Access policies modified outside change window; 1 Intune profile deleted; changes attributed to admin@contoso.com via audit log. Reviewer decision: /accept or /reject." |
-| **Integration** | Feeds change logs into AOC for audit intelligence; exports configuration state to CISO Assistant for compliance evidence |
+| **Repository** | [github.com/cqrenet/astral](https://github.com/cqrenet/astral) — free, open source (MIT) |
 | **Integration** | Entra ID and Intune baseline; feeds CISO Assistant for compliance evidence; AURORA connects to ASTRAL's MCP server for cross-tool diagnostics |
 **What it tracks** (current scope):
 *Intune*: App Configuration, App Protection, Applications, Compliance Policies, Device Configurations, Enrollment Configurations, Filters, Scope Tags, Scripts, Settings Catalog.
 *Entra*: Named Locations, Authentication Strengths, Conditional Access, App Registrations, Enterprise Applications. Admin role assignments and auth methods policies in development (Phase 1 roadmap).
 **Key capabilities**:
 - Event-driven change probe (Azure Function App) triggers backup within minutes of a tenant change — no more hourly polling
 - Reviewer `/accept` and `/reject` commands in ADO PR threads; auto-queued restore on rejection
 - MCP server (Azure Container Apps) exposes tenant state and drift history to AI assistants
 - Optional Azure OpenAI PR narratives — BYOAI, fully optional, ASTRAL is complete without it
 **The conversation**:
-> *"Your M365 tenant has 400 configuration objects and no version control. When an admin accidentally deletes a conditional access policy at 2 AM, you discover it 6 hours later because users are complaining. ASTRAL detects the deletion in 60 seconds, attributes it to the specific admin session, and offers one-click rollback. This is not backup. This is configuration immunity."*
+> *"Your M365 tenant has 400 configuration objects and no version control. When an admin accidentally deletes a Conditional Access policy at 2 AM, you discover it 6 hours later because users are complaining. ASTRAL detects the deletion within minutes via its event-driven change probe, attributes it to the specific admin session, and offers one-click rollback through the restore pipeline. This is not backup. This is configuration governance."*
 **ASTRAL companion utilities (CQRE)**:
@@ -152,20 +165,63 @@ This document provides the complete capability map for our consulting practice:
 ### M365 Audit Log Intelligence
-#### AOC — Admin Operations Center (Our Platform)
+#### PULSAR (Our Platform)
 | Attribute | Detail |
 |-----------|--------|
-| **What it does** | Correlates Microsoft 365 Unified Audit Log, Entra ID sign-in logs, and Intune operational logs into actionable intelligence. Detects anomalous admin behaviour, privilege escalation, shadow IT creation, and data exfiltration patterns. |
+| **What it does** | Ingests Microsoft 365 admin audit events (Entra, Intune, Exchange, SharePoint, Teams) into MongoDB and exposes a UI, REST API, and MCP server for search, filtering, alerting, and SIEM forwarding. Answers: *"what happened in my tenant, when, and by whom?"* |
-| **Why we built it** | The native M365 audit log is a firehose: 10,000+ events per day in a typical tenant, searchable only via slow PowerShell or expensive Sentinel. AOC extracts the 50 events that matter and enriches them with identity context, device state, and business impact. |
+| **Why we built it** | Native M365 audit log retention is capped at 90 days (E3) or 180 days (E5) — searchable only via slow PowerShell or expensive Sentinel. PULSAR provides permanent retention, fast search, and an MCP interface so AI assistants can query audit history directly. |
-| **Antifragile pillar** | Sovereign Intelligence, Stress-to-Signal Conversion |
+| **Antifragile pillar** | Stress-to-Signal Conversion — every admin action becomes permanent, searchable signal |
-| **Engagement modules** | Module 12 (Blue/Purple Team Foundation); retained capability (Detection Engineering); all M365 hardening engagements |
+| **Engagement modules** | Module 12 (Blue/Purple Team Foundation); retained capability (Detection Engineering); any engagement with log retention requirements |
-| **Typical output** | Daily brief: "3 anomalous events flagged: Global Admin [X] added external user at 03:14; Exchange Admin [Y] exported 12,000 mailboxes; Service Principal [Z] granted Mail.Read to unverified publisher. All require validation within 4 hours." |
+| **Typical output** | UI search: "Show me all Conditional Access policy changes by GlobalAdmin@contoso.com in the last 30 days." MCP query: `search_events(actor="globaladmin@contoso.com", operation="Update conditional access policy", days=30)` |
-| **Integration** | Receives alerts from osquery/FleetDM, Wazuh, and Prowler; pushes cases to CISO Assistant for risk register tracking; enriches AI-assisted TVM with insider-threat context; **MCP server** enables Claude and other AI clients to query audit logs in natural language directly from the analyst's desktop |
+| **Repository** | [github.com/cqrenet/pulsar](https://github.com/cqrenet/pulsar) — free, open source (MIT) |
 | **Integration** | AURORA connects to PULSAR's MCP server for cross-tool diagnostics; alerting rules forward to webhook endpoints; SIEM forwarding to Sentinel/Splunk *(see maturity note)* |
 **Sources ingested**:
 - Entra ID directory audit logs
 - Intune audit logs
 - Exchange Online, SharePoint, and Teams via the Office 365 Management Activity API
 **MCP tools**: `search_events`, `get_event`, `get_summary` — available over stdio (local) or SSE (remote, with API key or Entra OIDC auth).
 > **Maturity note — alerting and SIEM forwarding**: Both features are functional but proof-of-concept quality, suitable for evaluation and non-critical environments. Alerting has no rule management UI and webhook delivery has no retry logic. SIEM forwarding is basic with no delivery guarantees and is not tested at volume. Do not recommend these features for production use in environments where reliability is required — hardening is on the roadmap. AURORA provides production-grade enriched SIEM forwarding for clients who need it now.
 **The conversation**:
-> *"Microsoft gives you the audit log. They do not give you the story. AOC reads 50,000 events per night and tells you the three that need human attention: an admin added an external user at 3 AM, another exported 12,000 mailboxes, and a service principal granted Mail.Read to an unverified app. These are not false positives. These are the events that precede breaches."*
+> *"Microsoft gives you the audit log. They also take it away after 90 days. PULSAR keeps it forever. When you have an incident six months from now — and you will — and you need to know who added that external user, who modified that CA policy, and what that service principal was doing at 3 AM the week before the breach — PULSAR answers in seconds. Without it, the question is unanswerable."*
 ---
 ### M365 Governance Intelligence
 #### AURORA (Our Platform — Paid)
 | Attribute | Detail |
 |-----------|--------|
 | **What it does** | A unified operations platform connecting PULSAR and ASTRAL via their MCP servers. Provides AI-assisted cross-tool diagnostics, multi-scope orchestration, and enriched SIEM forwarding that neither product can produce alone. Answers: *"what does it mean and what should I do?"* |
 | **Why we built it** | Running PULSAR and ASTRAL separately leaves an investigation gap: audit events and configuration state live in different places with no correlation layer. AURORA closes that gap. |
 | **Antifragile pillar** | Sovereign Intelligence (owned observability and reasoning infrastructure), Optionality Preservation (data stays yours; AI layer is pluggable) |
 | **Engagement modules** | Retained capability engagements; any client running the full PULSAR + ASTRAL stack |
 | **Pricing** | Self-hosted: €259/mo (single tenant), €429/mo (≤5 scopes). Hosted: €389/mo, €599/mo. Enterprise: custom. |
 | **Repository** | [aurora.cqre.net](https://aurora.cqre.net) — commercial, self-hosted or CQRE-managed |
 **Cross-tool diagnostic tools**:
 | Tool | What it answers |
 |------|----------------|
 | `diagnose_policy_errors` | "Why is this Intune compliance policy erroring on some devices but not others?" — pulls ASTRAL policy config and PULSAR audit events for the same policy |
 | `explain_device_compliance` | "Why did this device suddenly become non-compliant?" — combines ASTRAL assignment data with PULSAR event timeline |
 | `correlate_drift_with_audit` | "Who triggered this configuration drift commit?" — matches ASTRAL Git commits with PULSAR audit events by timestamp |
 | `tenant_security_summary` | "What happened this week that I should know about?" — combines open ASTRAL drift PRs with PULSAR event summary |
 | `compare_scopes` | "What's different between my production and development CA policies?" |
 **AURORA stores no data.** All data lives in PULSAR (MongoDB) and ASTRAL (Git) under the client's control. AURORA is purely the query, orchestration, and intelligence layer.
 **When to recommend**: After at least one module cycle with PULSAR + ASTRAL deployed. The upsell is natural — clients who have investigated an incident using both tools separately will immediately understand AURORA's value.
 **The conversation**:
 > *"You have ASTRAL showing you what changed and PULSAR showing you who did what. AURORA answers the question neither product answers alone: are those two things related? Did the admin action in PULSAR trigger the drift commit in ASTRAL? Was that a legitimate change or a compromise? That correlation currently takes 20 minutes of manual investigation. AURORA does it in 30 seconds."*
 ---
@@ -180,7 +236,7 @@ This document provides the complete capability map for our consulting practice:
 | **Antifragile pillar** | Structural Decoupling, Stress-to-Signal Conversion |
 | **Engagement modules** | Module 2 (M365 Identity Security); Module 3 (M365 Security Hardening); compliance audits requiring CA policy evidence (NIS2, ISO 27001, DORA) |
 | **Typical output** | Excel workbook with one row per policy: policy name, conditions, controls, named groups and apps (not object IDs), assignment scope, current state (enabled/disabled/report-only), and export timestamp. Audit-ready without a single screenshot. |
-| **Integration** | Export feeds into ASTRAL as the human-readable CA policy baseline (state at engagement start); CISO Assistant links the workbook as evidence for Entra ID hardening controls; AOC change alerts are cross-referenced against the export to identify which named policy changed |
+| **Integration** | Export feeds into ASTRAL as the human-readable CA policy baseline (state at engagement start); CISO Assistant links the workbook as evidence for Entra ID hardening controls; PULSAR change alerts are cross-referenced against the export to identify which named policy changed |
 **The conversation**:
@@ -199,7 +255,7 @@ This document provides the complete capability map for our consulting practice:
    ┌───────────────┬───────────────┼───────────────┬───────────────┐
    ▼               ▼               ▼               ▼               ▼
 ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
-│ Prowler │   │BloodHound│   │ ASTRAL  │   │  AOC    │   │ osquery │
+│ Prowler │   │BloodHound│   │ ASTRAL  │   │ PULSAR  │   │ osquery │
 │(Cloud)  │   │  (AD)   │   │ (M365)  │   │(Audit)  │   │(Endpoint)│
 └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘
     │             │             │             │             │
@@ -221,7 +277,7 @@ This document provides the complete capability map for our consulting practice:
 **Data flow**:
 1. **Discovery layer** (Prowler, BloodHound, osquery, ASTRAL) collects raw security state
-2. **Intelligence layer** (AOC, AI-assisted TVM) correlates, enriches, and prioritises
+2. **Intelligence layer** (PULSAR, AI-assisted TVM) correlates, enriches, and prioritises
 3. **Governance layer** (CISO Assistant) maps findings to compliance frameworks and tracks remediation
 4. **Validation layer** (Purple Knight, Forest Druid, purple team exercises) proves fixes work
@@ -233,7 +289,7 @@ Our current stack covers cloud posture, AD security, GRC, M365 configuration, an
 ### Gap 1: Endpoint Detection and Response (EDR) — The Visibility Gap
-**Current state**: osquery provides structured endpoint inventory and compliance. AOC ingests M365 audit logs. What is missing is real-time behavioural detection on the endpoint itself.
+**Current state**: osquery provides structured endpoint inventory and compliance. PULSAR ingests M365 audit logs. What is missing is real-time behavioural detection on the endpoint itself.
 **Recommended close**: **Wazuh + Sysmon** (open-source EDR stack)
@@ -252,7 +308,7 @@ Our current stack covers cloud posture, AD security, GRC, M365 configuration, an
 ### Gap 2: Security Orchestration and Automated Response (SOAR) — The Response Gap
-**Current state**: AOC detects anomalous admin behaviour. ASTRAL detects configuration drift. What is missing is automated response: disabling a compromised account, isolating a device, or revoking an OAuth grant at machine speed.
+**Current state**: PULSAR detects anomalous admin behaviour. ASTRAL detects configuration drift. What is missing is automated response: disabling a compromised account, isolating a device, or revoking an OAuth grant at machine speed.
 **Recommended close**: **Shuffle** (open-source SOAR)
@@ -263,7 +319,7 @@ Our current stack covers cloud posture, AD security, GRC, M365 configuration, an
 | Self-hosted: data never leaves client infrastructure |
 | Replaces €100,000+/year commercial SOAR platforms |
-**Example playbook**: AOC detects impossible-travel sign-in → Shuffle disables account → ASTRAL revokes all active sessions → Slack alerts SOC → CISO Assistant logs incident → Ticket created in client ITSM.
+**Example playbook**: PULSAR detects impossible-travel sign-in → Shuffle disables account → ASTRAL revokes all active sessions → Slack alerts SOC → CISO Assistant logs incident → Ticket created in client ITSM.
 **When to deploy**: Module 12 (Blue/Purple Team Foundation); retained capability engagements.
@@ -271,7 +327,7 @@ Our current stack covers cloud posture, AD security, GRC, M365 configuration, an
 ### Gap 3: Incident Response Case Management — The Coordination Gap
-**Current state**: Findings are scattered across Prowler, BloodHound, AOC, and osquery. What is missing is a single case management system that tracks incidents from detection through remediation to post-mortem.
+**Current state**: Findings are scattered across Prowler, BloodHound, PULSAR, and osquery. What is missing is a single case management system that tracks incidents from detection through remediation to post-mortem.
 **Recommended close**: **TheHive + Cortex** (open-source SOC case management)
@@ -330,7 +386,7 @@ Our current stack covers cloud posture, AD security, GRC, M365 configuration, an
 |----------|--------------|
 | Protocol analysis: extracts metadata from HTTP, DNS, TLS, SMB without full packet storage | IDS/IPS with 30,000+ signatures and emerging threat rules |
 | Scales to 10 Gbps+ on commodity hardware | Can drop malicious traffic inline (IPS mode) |
-| Output is structured JSON—easy to feed into Wazuh or AOC | Native file extraction and malware detection |
+| Output is structured JSON—easy to feed into Wazuh or PULSAR | Native file extraction and malware detection |
 **When to deploy**: Module 8 (OT Security Assessment) for industrial network segmentation validation; Module 12 (Blue/Purple Team) for detection engineering.
@@ -345,7 +401,7 @@ Our current stack covers cloud posture, AD security, GRC, M365 configuration, an
 | AD security assessment | **Purple Knight / Forest Druid** | PingCastle, ADRecon | Semperis Directory Services Protector | AD hardening engagements |
 | GRC and compliance | **CISO Assistant** | OpenGRC, SimpleRisk | ServiceNow GRC, RSA Archer | DORA, NIS2, SOC 2 clients |
 | M365 backup/change mgmt | **ASTRAL** | — (no open-source equivalent) | Veeam, AvePoint, SkyKick | All M365 clients; retained capability |
-| M365 audit intelligence | **AOC** | — (no open-source equivalent) | Microsoft Sentinel, ManageEngine | All M365 clients; SOC co-management |
+| M365 audit intelligence | **PULSAR** | — (no open-source equivalent) | Microsoft Sentinel, ManageEngine | All M365 clients; SOC co-management |
 | CA policy documentation | **CAExporter** | — (no equivalent) | — | Every Module 2 engagement; CA audits |
 | AD password audit | **Elysium** | — (DSInternals manual use) | Netwrix Password Policy, Specops | Every AD engagement; Module 6 |
 | Intune baseline deployment | **macOS_IntuneManagement** | — (no cross-platform equivalent) | — | Tenant migrations; brownfield baseline |
@@ -382,13 +438,13 @@ Our current stack covers cloud posture, AD security, GRC, M365 configuration, an
 **CQRE utilities**: macOS_IntuneManagement (baseline deployment, cross-tenant migration); IntunePolicyParser (policy audit register); M365-Scripts (MDE device lifecycle); E8-CAT (pre/post hardening Essential Eight score)
 ### Module 2: M365 Identity Security
-**Primary**: AOC (audit log intelligence) + BloodHound (hybrid identity attack paths)
+**Primary**: PULSAR (audit log intelligence) + BloodHound (hybrid identity attack paths)
 **Augmentation**: Purple Knight (AD security baseline)
 **CQRE utilities**: CAExporter (CA policy documentation baseline — run first, before any CA hardening)
 ### Module 3: M365 Security Hardening
 **Primary**: ASTRAL (configuration state) + Prowler (Azure posture)
-**Augmentation**: AOC (continuous monitoring of security control changes)
+**Augmentation**: PULSAR (continuous monitoring of security control changes)
 **CQRE utilities**: CAExporter (CA policy register as audit evidence); E8-CAT (macro restriction and application hardening verification)
 ### Module 6: On-Premise AD Hardening
@@ -406,10 +462,10 @@ Our current stack covers cloud posture, AD security, GRC, M365 configuration, an
 ### Module 12: Blue/Purple Team Foundation
 **Primary**: Wazuh + Sysmon + TheHive + Cortex + Shuffle
-**Augmentation**: AOC (M365-specific detections) + Velociraptor (endpoint forensics) + OpenCanary (deception) + OpenCTI (threat intel correlation)
+**Augmentation**: PULSAR (M365-specific detections) + Velociraptor (endpoint forensics) + OpenCanary (deception) + OpenCTI (threat intel correlation)
 ### Retained Capability: Detection Engineering
-**Primary**: Wazuh (rule authoring) + AOC (M365 detections) + Shuffle (response playbooks)
+**Primary**: Wazuh (rule authoring) + PULSAR (M365 detections) + Shuffle (response playbooks)
 **Augmentation**: Zeek + Suricata (network detection rules)
 ---
@@ -423,7 +479,7 @@ Our current stack covers cloud posture, AD security, GRC, M365 configuration, an
 | Purple Knight | 30 minutes | None | Low | Medium (AD scan) |
 | CISO Assistant | 1 day | Docker host or VM | Low | Low-Medium (compliance data) |
 | ASTRAL | 2 hours | SaaS or client-hosted | Low | High (M365 configuration) |
-| AOC | 4 hours | SaaS or client-hosted | Medium | High (audit logs, identity data) |
+| PULSAR | 4 hours | SaaS or client-hosted | Medium | High (audit logs, identity data) |
 | CAExporter | 30 minutes | None (runs from PowerShell) | Low | Low (read-only CA policy export) |
 | Elysium | 1–2 hours | Dedicated secure host (on-premises) | Medium | High (domain password hashes — stays on-prem) |
 | macOS_IntuneManagement | 1 hour | None (PowerShell 7+) | Low | Medium (Intune policy data) |
@@ -463,7 +519,7 @@ Beyond the core stack, these tools address specific niches that arise in sophist
 | **What it does** | Open-source cross-platform adversary simulation and command-and-control (C2) framework. Replaces Cobalt Strike for red team engagements at zero licensing cost. |
 | **Why we use it** | Cobalt Strike costs €30,000+/year and is fingerprinted by most EDR. Sliver is free, actively maintained by Bishop Fox, and supports DNS, HTTPS, mutual TLS, and WireGuard C2 channels. It generates implants for Windows, macOS, and Linux. |
 | **When to deploy** | Module 10 (Red Team & Validation); purple team exercises; EDR efficacy testing |
-| **Integration** | Red team activity detected by Wazuh + Sysmon feeds into TheHive cases; AOC correlates any M365 session anomalies with red team timing |
+| **Integration** | Red team activity detected by Wazuh + Sysmon feeds into TheHive cases; PULSAR correlates any M365 session anomalies with red team timing |
 **The conversation**:
@@ -502,7 +558,7 @@ Beyond the core stack, these tools address specific niches that arise in sophist
 | **What it does** | Runtime security detection for containers, Kubernetes, and Linux hosts. Uses system call monitoring to detect anomalous behaviour: unexpected outbound connections, privileged container escapes, sensitive file access. |
 | **Why we use it** | Syft + Grype find vulnerable packages at build time. Falco detects exploitation at runtime. Without Falco, a container with a CVE can be exploited silently. |
 | **When to deploy** | Any client with Kubernetes or containerised workloads; Module 9 (Organisational Resilience) for CI/CD security gates |
-| **Integration** | Falco alerts feed into Wazuh or directly to TheHive; AOC correlates container events with M365 identity context for supply-chain attack detection |
+| **Integration** | Falco alerts feed into Wazuh or directly to TheHive; PULSAR correlates container events with M365 identity context for supply-chain attack detection |
 ---
@@ -568,7 +624,7 @@ Beyond the core stack, these tools address specific niches that arise in sophist
 | **What it does** | Scans Git repositories for hardcoded secrets: API keys, passwords, tokens, private keys. Supports pre-commit hooks and CI/CD integration. |
 | **Why we use it** | The most common cloud breach vector is not zero-day exploitation. It is a developer committing an AWS access key to GitHub. GitLeaks finds it before the commit—or scans historical commits for existing leakage. |
 | **When to deploy** | Module 9 (Organisational Resilience); DevSecOps engagements; any client with active software development |
-| **Integration** | CI/CD pipeline integration; findings fed into CISO Assistant for evidence tracking; AOC monitors for any M365 session using leaked credentials |
+| **Integration** | CI/CD pipeline integration; findings fed into CISO Assistant for evidence tracking; PULSAR monitors for any M365 session using leaked credentials |
 ---
@@ -592,7 +648,7 @@ Beyond the core stack, these tools address specific niches that arise in sophist
 | **What it does** | Open-source phishing simulation framework. Build campaigns, track click rates, capture credentials (in training mode), and measure user susceptibility over time. |
 | **Why we use it** | Commercial phishing platforms cost €5-15/user/year. GoPhish is free, self-hosted, and produces equivalent metrics. It integrates with LDAP for realistic email targeting. |
 | **When to deploy** | Module 3 (M365 Security Hardening); security awareness programmes; post-incident user training |
-| **Integration** | Results feed into CISO Assistant for training evidence; high-risk users flagged in AOC for enhanced monitoring |
+| **Integration** | Results feed into CISO Assistant for training evidence; high-risk users flagged in PULSAR for enhanced monitoring |
 ---
@@ -682,7 +738,7 @@ These are partnerships we invest in deeply. We train the team, build integration
 | **What they provide** | Managed EDR for SMBs and mid-market: 24/7 threat hunting, incident response, ransomware rollback. Agent deployment via RMM or Intune. |
 | **Why we partner** | Our open-source EDR stack (Wazuh + Sysmon) is excellent for clients who want sovereignty. But it requires us to tune rules, investigate alerts, and respond to incidents. Huntress provides the 24/7 layer we cannot staff at 5-20 people. We bring the strategic context; they bring the night shift. |
 | **Client archetype** | E3 clients without Defender P2; municipalities; professional services; any client who needs EDR but cannot justify CrowdStrike or SentinelOne |
-| **Engagement model** | We deploy and configure Huntress as part of Module 1 or 3. We retain the relationship and add our own detection rules via AOC for M365 context. Huntress handles the endpoint. We handle the narrative. |
+| **Engagement model** | We deploy and configure Huntress as part of Module 1 or 3. We retain the relationship and add our own detection rules via PULSAR for M365 context. Huntress handles the endpoint. We handle the narrative. |
 | **Financial model** | Per-endpoint licensing with partner margin. We bill labour for deployment, tuning, and quarterly reviews. The recurring license revenue funds our growth without proportional labour increase. |
 | **When NOT to use** | Clients who require air-gapped networks; clients with sovereign-data mandates that prohibit third-party agent telemetry; clients who explicitly want to own their detection logic (then we deploy Wazuh) |
@@ -748,7 +804,7 @@ These are tools we purchase for our own team to deliver services more effectivel
 | **Burp Suite Professional** | Web application penetration testing | The industry standard. Community edition is too limited for professional engagements. |
 | **Cobalt Strike** (or **Sliver** for budget-conscious) | Red team C2 and adversary simulation | When clients specifically require Cobalt Strike for insurance or compliance validation. Sliver is our default; Cobalt Strike is the enterprise alternative. |
 | **Offensive Security / SANS training** | Consultant skill development | Our team must maintain current certifications. Training is a cost of doing business, not a partnership. |
-| **Microsoft Action Pack / CSP** | Internal M365 licensing for testing | We need sandbox tenants to test ASTRAL and AOC before client deployment. Microsoft's partner programme provides this at low cost. |
+| **Microsoft Action Pack / CSP** | Internal M365 licensing for testing | We need sandbox tenants to test ASTRAL and PULSAR before client deployment. Microsoft's partner programme provides this at low cost. |
 ---
@@ -757,9 +813,9 @@ These are tools we purchase for our own team to deliver services more effectivel
 | Category | Example | Why We Refuse |
 |----------|---------|---------------|
 | **All-in-one security platforms** | CrowdStrike, Palo Alto, SentinelOne | They replace our entire stack with a black box. We become a reseller, not a consultant. The client loses sovereignty. We lose differentiation. |
-| **Generic SIEM** | Splunk, Datadog, Elastic Cloud | Wazuh + TheHive + AOC covers 90% of client needs. Splunk requires a €100K+ commitment and a dedicated engineer. We refer complex SIEM needs to specialists rather than pretending to be one. |
+| **Generic SIEM** | Splunk, Datadog, Elastic Cloud | Wazuh + TheHive + PULSAR covers 90% of client needs. Splunk requires a €100K+ commitment and a dedicated engineer. We refer complex SIEM needs to specialists rather than pretending to be one. |
 | **AI security startups** | Any vendor claiming "AI-powered" threat detection with no transparent model | Our AI strategy is sovereign: Azure OpenAI bridge and local LLMs. We do not resell opaque AI tools that we cannot explain to a board. |
-| **M365 management competitors** | CoreView, AdminDroid, Quest | ASTRAL and AOC are our proprietary differentiators. Partnering here would undermine our own product investment. |
+| **M365 management competitors** | CoreView, AdminDroid, Quest | ASTRAL and PULSAR are our proprietary differentiators. Partnering here would undermine our own product investment. |
 ---
@@ -775,7 +831,7 @@ These are tools we purchase for our own team to deliver services more effectivel
 - Tier 1: Huntress + Thinkst + Tenable (full enterprise VM partnership)
 - Tier 2: Delinea, KnowBe4, Veeam, Proofpoint (active partner status, trained engineers)
 - Tier 3: Cobalt Strike license for red team; additional SANS/training budget
- ASTRAL and AOC monetised as SaaS products with their own revenue stream
+- ASTRAL and PULSAR monetised as SaaS products with their own revenue stream
 **The rule**: Every commercial partnership must either (a) provide a capability we cannot build, (b) generate recurring revenue without proportional labour, or (c) satisfy a compliance requirement that open-source cannot meet. If it does none of these, we decline.
@@ -802,11 +858,11 @@ These are tools we purchase for our own team to deliver services more effectivel
 | Document | Integration |
 |----------|-------------|
 | [Zero-Budget Vulnerability Discovery](zero-budget-vulnerability-discovery.md) | Syft + Grype container pipeline; osquery endpoint discovery; Prowler cloud-native discovery; GitLeaks secrets scanning |
-| [AI-Assisted TVM Blueprint](ai-assisted-tvm.md) | All discovery tools feed the AI prioritisation engine; AOC provides insider-threat context; OpenCTI enriches with threat actor context |
+| [AI-Assisted TVM Blueprint](ai-assisted-tvm.md) | All discovery tools feed the AI prioritisation engine; PULSAR provides insider-threat context; OpenCTI enriches with threat actor context |
 | [Perimeter Scanning Capability](perimeter-scanning-capability.md) | Nuclei + Amass + Naabu form the open-source active scanning layer; Prowler covers cloud perimeter; CertStream monitors for new subdomains |
 | [Osquery: The Sovereign Discovery Platform](osquery-custom-platform.md) | osquery + FleetDM is the endpoint discovery layer; Wazuh extends to behavioural detection; Velociraptor adds forensic hunting |
-| [Blue/Purple Team Foundation](../core/blue-purple-team-foundation.md) | Wazuh + TheHive + Cortex + Shuffle form the open-source SOC stack; AOC adds M365-specific detection; Sliver enables adversary simulation; OpenCanary provides deception |
+| [Blue/Purple Team Foundation](../core/blue-purple-team-foundation.md) | Wazuh + TheHive + Cortex + Shuffle form the open-source SOC stack; PULSAR adds M365-specific detection; Sliver enables adversary simulation; OpenCanary provides deception |
-| [Retained Capability](../core/retained-capability.md) | Detection Engineering retained capability is built on Wazuh + AOC + Shuffle; Threat Context on TheHive + Cortex + OpenCTI |
+| [Retained Capability](../core/retained-capability.md) | Detection Engineering retained capability is built on Wazuh + PULSAR + Shuffle; Threat Context on TheHive + Cortex + OpenCTI |
 | [Modular Engagements](../core/modular-engagements.md) | Each module has a recommended tool pairing in the matrix above; partnership doctrine defines when commercial tools supplement open-source |
 | [AD and Endpoint Hardening](ad-endpoint-hardening.md) | BloodHound maps attack paths; Purple Knight / Forest Druid score AD security; Velociraptor hunts for indicators of compromise on domain controllers |
 | [Business Case Template](business-case-template.md) | Partnership financial models (Huntress recurring, Thinkst margin, Tenable compliance) feed into client ROI calculations |
@@ -38,7 +38,7 @@ IG1 is the **safeguards that every organization should implement to protect agai
 | Sovereignty (Days 60-90) | Ensure proprietary AI data never leaves perimeter | Local AI infrastructure |
 | Antifragility (Days 90-180) | Automated data loss prevention | Existing CASB or DLP |
-**Antifragile Angle**: Data protection is not encryption at rest. It is **ensuring your proprietary signal does not train your competitor's model**. Local AI is a data protection control.
+**Antifragile Angle**: Data protection is not encryption at rest. It is **ensuring your proprietary operational data stays under your control, with audit rights and data residency you can verify**. Local or sovereign AI is a data protection control.
 ### Control 4: Secure Configuration of Enterprise Assets and Software
@@ -279,6 +279,53 @@ See [M365 E3 Hardening](../playbooks/m365-e3-hardening.md) for tactical hardenin
 ---
 ## The Controlled Burn Adaptation: When Greenfield Is Not an Option
 The antifragile framework holds that organisations should build toward the ability to deploy greenfield — rebuild from scratch, on clean infrastructure, from version-controlled configuration. This is the ultimate expression of structural decoupling: if you can rebuild the environment, no adversary and no vendor holds you hostage.
 Power utilities, water suppliers, and telecom network operators frequently view this principle as inapplicable. The grid does not go dark for a rebuild exercise. Protection relays cannot be factory-reset during a fault. OT systems operate under safety cases that require regulatory approval for any configuration change. The controlled burn, taken literally, cannot happen.
 This is correct. It is also not the end of the conversation.
 **The goal of greenfield capability is to eliminate inherited compromise and return to a known-good operational state.** For IT environments, the method is rebuild. For OT/NT environments, the method is different — but the goal is identical, and it is achievable. The absence of a literal rebuild path does not justify the absence of a recovery plan.
 ### The OT-Adapted Greenfield Stack
 **Layer 1: IT greenfield protects OT.** The corporate IT environment, M365 tenant, SCADA servers, historian, engineering workstations, and HMI layer can almost always be made greenfield-capable even when OT hardware cannot. An adversary who compromises the IT layer and finds a clean rebuild path loses their persistence and pivot path without a single OT device being touched. IT greenfield is the outer perimeter of an OT environment that cannot be rebuilt itself. This is the first investment.
 **Layer 2: OT configuration as code.** PLC logic, IED settings files, protection relay configuration archives, SCADA database snapshots, DCS export files — all of these belong in version-controlled backups with integrity verification. The ability to restore a known-good configuration to existing hardware is the OT equivalent of greenfield: the hardware remains, but the software state is wiped and rebuilt from a verified baseline. This is not a backup exercise. It is a discipline — with the same rigour that ASTRAL applies to M365 configuration, applied to OT configuration archives. Every piece of OT configuration that exists only in the device and nowhere else is a single point of failure.
 **Layer 3: Manual operation as the fallback layer.** The ability to operate critical systems without the automation layer is, in practice, the ability to drop the compromised layer and continue service. A power utility that can maintain 70–80% of service from manual procedures during a SCADA compromise has a fundamentally different risk profile than one that cannot. Manual override procedures must be:
 - Documented in detail, not just referenced in an emergency plan
 - Tested under realistic conditions, not just reviewed in a tabletop
 - Known by currently assigned operations staff, not just veterans who may have left
 - Validated at least annually — capability that is not practised does not exist when it is needed
 **Layer 4: Compartmentalisation as partial burn.** OT environments are typically sectionable. Grid islanding, substation isolation, plant-level control separation, and control centre failover allow the operator to sacrifice and rebuild one section while maintaining critical service in others. This is the OT equivalent of the controlled burn: localised rather than total, sequential rather than simultaneous, but governed by the same principle — designed-in ability to contain, recover, and restore without waiting for a complete environment to be clean.
 **Layer 5: Planned long-cycle refresh.** OT systems have 20–40 year operational lifetimes, but those lifetimes should be a programme, not an accident. Organisations without a documented OT refresh schedule — with component-by-component replacement milestones, firmware escrow requirements, spare parts inventory targets, and vendor succession planning — are not avoiding greenfield. They are deferring it until a crisis forces it under the worst possible conditions: compromised hardware, unavailable vendors, missing documentation, and no tested procedures.
 ### The Acceptance Statement
 Some OT components in critical infrastructure genuinely cannot be replaced on any timescale that security planning can influence. Legacy protection relays on operational transmission lines. Nuclear instrumentation systems under active safety cases. Water treatment chemical dosing controllers that predate the organisation's current IT function.
 For these systems, the correct position is explicit acceptance, not avoidance:
 1. **Name them.** Identify specifically which systems are outside the rebuild envelope and why.
 2. **Isolate them.** The isolation must be proportional to the acknowledged unrepairability. A system that cannot be patched, cannot be replaced, and cannot be rebuilt must be surrounded by compensating controls so thorough that its compromise cannot propagate.
 3. **Monitor them obsessively.** Configuration integrity monitoring, network traffic baselining, and anomaly detection for these specific systems — because when you cannot fix the asset, detection and containment are the only remaining defences.
 4. **Plan their eventual replacement.** "This system cannot be replaced in the current operational context" is acceptable. "This system will never be replaced" is not a security posture — it is a deferred decision that will be made under worse conditions later.
 The acceptance statement is not a sign of weakness. It is the honest foundation of a credible security programme. Regulators, insurers, and incident responders all prefer an organisation that knows exactly where its limits are and has compensating controls in place over one that claims no limits and has no plan.
 ### The OT Greenfield Test
 *"If our IT and SCADA layers were fully compromised tonight: could we maintain critical service from manual procedures within 4 hours? Rebuild the IT layer from clean baselines within 48 hours? Restore full automated operation from verified OT configuration backups within two weeks? And have we actually tested each of these in the past 12 months?"*
 If any answer is no, the gap is in manual procedures, IT rebuild capability, OT configuration management, or test cadence — not in the impossibility of the OT environment itself.
 ---
 ## Evidence Package for Regulators
 | Requirement | Evidence from Antifragile Program |
@@ -0,0 +1,15 @@
 # Tools
 Standalone, runnable instruments that support the engagement — as distinct from the markdown frameworks and playbooks elsewhere in the repository.
 | Tool | What it does | How to run |
 |------|--------------|------------|
 | [`kill-chain-assessment.html`](kill-chain-assessment.html) | Maps an unknown estate into an attack graph, computes the shortest existential path (the kill chain), and sizes every node into a remediation quantum. The synthesis instrument for the first act of every engagement. | Open in any browser. Offline, no install, no network. State persists locally; exports to `.json` and `.md`. |
 ## Design constraints for tools in this directory
 - **Offline and sovereign.** Client attack-surface data must never leave the consultant's machine for a vendor cloud (Antifragile Manifest, Pillar 4). Tools here are single-file and dependency-free wherever possible.
 - **Exportable.** Output drops into the engagement deliverables — the [diagnostic report](../assessment-templates/nist-csf-baseline.md) and the [Findings Backlog](../assessment-templates/findings-backlog.md) — not into a proprietary format.
 - **Explicit, not magic.** A tool makes the consultant's judgement repeatable; it does not replace it.
 See the [Kill Chain Assessment App spec](../playbooks/kill-chain-assessment-app.md) for the model behind the first tool.
@@ -0,0 +1,642 @@
 <!DOCTYPE html>
 <html lang="en">
 <head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
 <title>Kill Chain Assessment — Brownhat / CQRE</title>
 <style>
  :root{
    --bg:#0d1117; --panel:#161b22; --panel2:#1c2330; --line:#30363d; --line2:#3d4654;
    --ink:#e6edf3; --muted:#9aa6b2; --faint:#6e7781;
    --p0:#ff4d4f; --p1:#ff9f0a; --p2:#3fb950; --dark:#a371f7; --entry:#58a6ff; --jewel:#f7c948;
    --accent:#58a6ff; --accent2:#1f6feb;
    --crit:#ff4d4f; --sev:#ff9f0a; --std:#3fb950; --darkq:#a371f7; --house:#6e7781;
  }
  *{box-sizing:border-box}
  body{margin:0;background:var(--bg);color:var(--ink);font:14px/1.5 -apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,Helvetica,Arial,sans-serif}
  header{padding:16px 22px;border-bottom:1px solid var(--line);display:flex;align-items:center;gap:16px;flex-wrap:wrap;background:linear-gradient(180deg,#11161d,#0d1117)}
  header h1{font-size:18px;margin:0;letter-spacing:.3px}
  header .tag{font-size:11px;color:var(--faint);border:1px solid var(--line);padding:2px 8px;border-radius:20px}
  header .sub{color:var(--muted);font-size:12.5px;margin-left:auto;max-width:520px;text-align:right}
  .wrap{display:grid;grid-template-columns:340px 1fr 360px;gap:0;height:calc(100vh - 59px)}
  .col{overflow-y:auto;padding:16px}
  .col.left{border-right:1px solid var(--line)}
  .col.right{border-left:1px solid var(--line);background:#0b0f14}
  h2{font-size:12px;text-transform:uppercase;letter-spacing:1px;color:var(--muted);margin:4px 0 10px;font-weight:600}
  h2 .hint{text-transform:none;letter-spacing:0;font-weight:400;color:var(--faint);display:block;font-size:11.5px;margin-top:3px}
  .panel{background:var(--panel);border:1px solid var(--line);border-radius:10px;padding:13px;margin-bottom:14px}
  label{display:block;font-size:11.5px;color:var(--muted);margin:9px 0 3px}
  input,select,textarea,button{font:inherit;color:var(--ink)}
  input[type=text],select,textarea{width:100%;background:var(--panel2);border:1px solid var(--line2);border-radius:7px;padding:7px 9px}
  input[type=text]:focus,select:focus,textarea:focus{outline:none;border-color:var(--accent)}
  textarea{resize:vertical;min-height:34px}
  .row{display:flex;gap:8px}
  .row>*{flex:1}
  .chk{display:flex;align-items:center;gap:7px;margin:8px 0;font-size:12.5px;color:var(--ink)}
  .chk input{width:auto}
  button{cursor:pointer;background:var(--panel2);border:1px solid var(--line2);border-radius:7px;padding:8px 12px;transition:.12s}
  button:hover{border-color:var(--accent);color:#fff}
  button.primary{background:var(--accent2);border-color:var(--accent2);color:#fff;font-weight:600}
  button.primary:hover{background:#388bfd}
  button.ghost{background:transparent}
  button.danger:hover{border-color:var(--p0);color:var(--p0)}
  .btnrow{display:flex;gap:8px;flex-wrap:wrap;margin-top:10px}
  .btnrow button{flex:1;min-width:0}
  .pill{display:inline-block;font-size:10px;font-weight:700;letter-spacing:.5px;padding:2px 7px;border-radius:20px;text-transform:uppercase}
  .pill.entry{background:rgba(88,166,255,.16);color:var(--entry);border:1px solid var(--entry)}
  .pill.jewel{background:rgba(247,201,72,.14);color:var(--jewel);border:1px solid var(--jewel)}
  .node-item{background:var(--panel2);border:1px solid var(--line);border-radius:8px;padding:9px 10px;margin-bottom:7px;cursor:pointer}
  .node-item:hover{border-color:var(--accent)}
  .node-item.sel{border-color:var(--accent);box-shadow:0 0 0 1px var(--accent) inset}
  .node-item .nm{font-weight:600;display:flex;justify-content:space-between;align-items:center;gap:6px}
  .node-item .meta{font-size:11px;color:var(--faint);margin-top:3px;display:flex;gap:6px;flex-wrap:wrap}
  .edge-item{font-size:12px;background:var(--panel2);border:1px solid var(--line);border-radius:7px;padding:7px 9px;margin-bottom:6px;display:flex;justify-content:space-between;gap:8px;align-items:flex-start}
  .edge-item .x{cursor:pointer;color:var(--faint);flex-shrink:0}
  .edge-item .x:hover{color:var(--p0)}
  .tabs{display:flex;gap:4px;margin-bottom:12px;border-bottom:1px solid var(--line)}
  .tabs button{border:none;border-bottom:2px solid transparent;border-radius:0;background:none;color:var(--muted);padding:8px 12px}
  .tabs button.on{color:#fff;border-bottom-color:var(--accent)}
  svg{width:100%;display:block}
  .empty{color:var(--faint);font-size:12.5px;text-align:center;padding:30px 10px;border:1px dashed var(--line2);border-radius:10px}
  .kc-box{background:var(--panel);border:1px solid var(--line);border-radius:10px;padding:14px;margin-bottom:14px}
  .kc-step{display:flex;align-items:center;gap:10px;padding:7px 0}
  .kc-arrow{color:var(--p0);font-size:18px;text-align:center;margin:-2px 0}
  .kc-node{flex:1;background:var(--panel2);border:1px solid var(--line2);border-left:3px solid var(--p0);border-radius:6px;padding:7px 10px}
  .kc-node .n{font-weight:600;font-size:13px}
  .kc-node .m{font-size:11px;color:var(--muted)}
  .kc-mech{font-size:11px;color:var(--faint);font-style:italic;padding-left:14px}
  .stat{display:flex;justify-content:space-between;padding:5px 0;border-bottom:1px solid var(--line);font-size:13px}
  .stat:last-child{border:none}
  .stat b{font-variant-numeric:tabular-nums}
  .q{border-radius:8px;border:1px solid var(--line);padding:10px 12px;margin-bottom:9px;background:var(--panel)}
  .q .qh{display:flex;justify-content:space-between;align-items:center;font-weight:700;font-size:12px;letter-spacing:.5px;text-transform:uppercase}
  .q.crit{border-left:4px solid var(--crit)} .q.crit .qh{color:var(--crit)}
  .q.sev{border-left:4px solid var(--sev)}  .q.sev .qh{color:var(--sev)}
  .q.std{border-left:4px solid var(--std)}  .q.std .qh{color:var(--std)}
  .q.darkq{border-left:4px solid var(--darkq)} .q.darkq .qh{color:var(--darkq)}
  .q .ql{font-size:12.5px;margin-top:7px}
  .q .qi{padding:4px 0;border-top:1px solid var(--line);margin-top:5px}
  .q .qi:first-of-type{border:none}
  .q .qi .qn{font-weight:600}
  .q .qi .qd{font-size:11px;color:var(--muted)}
  .q .budget{font-size:10.5px;color:var(--faint);font-weight:400;text-transform:none;letter-spacing:0}
  .discovery h3{font-size:12.5px;margin:12px 0 5px;color:var(--accent)}
  .discovery ul{margin:0 0 6px;padding-left:18px;color:var(--muted);font-size:12px}
  .discovery li{margin-bottom:3px}
  .discovery code{background:var(--panel2);border:1px solid var(--line);border-radius:4px;padding:1px 5px;color:#e6edf3;font-size:11px}
  .note{font-size:11.5px;color:var(--faint);margin-top:6px}
  .legend{display:flex;gap:12px;flex-wrap:wrap;font-size:11px;color:var(--muted);margin-bottom:8px}
  .legend span{display:flex;align-items:center;gap:5px}
  .dot{width:10px;height:10px;border-radius:50%}
  .topbtns{display:flex;gap:8px}
  .file-in{display:none}
  ::-webkit-scrollbar{width:10px;height:10px}
  ::-webkit-scrollbar-thumb{background:#222b36;border-radius:6px}
  ::-webkit-scrollbar-track{background:transparent}
  .muted{color:var(--muted)} .small{font-size:11.5px}
 </style>
 </head>
 <body>
 <header>
  <h1>⛓ Kill Chain Assessment</h1>
  <span class="tag">Brownhat · CQRE</span>
  <div class="topbtns">
    <button class="ghost" onclick="loadSample()">Load sample</button>
    <button class="ghost" onclick="exportJSON()">Save .json</button>
    <button class="ghost" onclick="document.getElementById('imp').click()">Open .json</button>
    <button class="primary" onclick="exportMD()">Export report .md</button>
    <input type="file" id="imp" class="file-in" accept=".json" onchange="importJSON(event)">
  </div>
  <div class="sub">Map unknown territory into nodes and attacker moves. The tool finds the shortest path from a foothold to an existential asset — that path <b>is</b> the kill chain — and sizes each node into a remediation quantum.</div>
 </header>
 <div class="wrap">
  <!-- LEFT: capture -->
  <div class="col left">
    <div class="tabs">
      <button id="t-node" class="on" onclick="tab('node')">Nodes</button>
      <button id="t-edge" onclick="tab('edge')">Moves</button>
      <button id="t-disc" onclick="tab('disc')">Discovery</button>
    </div>
    <!-- NODE form -->
    <div id="pane-node">
      <div class="panel">
        <h2>Add / edit node<span class="hint">An asset, foothold, identity, or system in the estate.</span></h2>
        <label>Name</label>
        <input type="text" id="n-name" placeholder="e.g. Entra ID Connect sync server">
        <div class="row">
          <div>
            <label>Layer</label>
            <select id="n-type">
              <option value="entry">Entry / exposure</option>
              <option value="identity">Identity</option>
              <option value="privilege">Privilege</option>
              <option value="device">Device / endpoint</option>
              <option value="data">Data / collaboration</option>
              <option value="infra">Infrastructure / OT</option>
              <option value="recovery">Recovery / backup</option>
            </select>
          </div>
          <div>
            <label>Tier</label>
            <select id="n-tier">
              <option value="">— unknown —</option>
              <option value="T0">T0 (control plane)</option>
              <option value="T1">T1 (servers/apps)</option>
              <option value="T2">T2 (workstations)</option>
            </select>
          </div>
        </div>
        <div class="chk"><input type="checkbox" id="n-entry"><label style="margin:0;color:var(--entry)">Adversary entry point (internet-reachable / unauth foothold)</label></div>
        <div class="chk"><input type="checkbox" id="n-jewel"><label style="margin:0;color:var(--jewel)">Crown jewel (existential — org cannot operate if lost)</label></div>
        <div class="row">
          <div>
            <label>Reachable by adversary?</label>
            <select id="n-reach"><option value="unknown">Unknown</option><option value="yes">Yes</option><option value="no">No</option></select>
          </div>
          <div>
            <label>Exploit / path available?</label>
            <select id="n-expl"><option value="unknown">Unknown</option><option value="yes">Yes</option><option value="no">No</option></select>
          </div>
        </div>
        <div class="chk"><input type="checkbox" id="n-comp"><label style="margin:0">Compensating control already in front of it (EDR, WAF, segmentation)</label></div>
        <label>Finding / note (optional)</label>
        <textarea id="n-note" placeholder="What's wrong here, evidence, CVE…"></textarea>
        <div class="btnrow">
          <button class="primary" onclick="saveNode()">Save node</button>
          <button class="ghost" onclick="clearNodeForm()">Clear</button>
        </div>
      </div>
      <h2>Nodes <span id="n-count" class="muted small"></span></h2>
      <div id="node-list"></div>
    </div>
    <!-- EDGE form -->
    <div id="pane-edge" style="display:none">
      <div class="panel">
        <h2>Add attacker move<span class="hint">A directed step: "from here, an attacker can reach there."</span></h2>
        <label>From</label>
        <select id="e-from"></select>
        <label>To</label>
        <select id="e-to"></select>
        <label>Mechanism (how)</label>
        <input type="text" id="e-mech" placeholder="e.g. DCSync via sync-account rights">
        <label>Adversary effort: <span id="e-wlabel">3 — moderate</span></label>
        <input type="range" id="e-weight" min="1" max="5" value="3" style="width:100%" oninput="document.getElementById('e-wlabel').textContent=effortLabel(this.value)">
        <div class="note">Lower effort = easier for the attacker. The kill chain is the <i>lowest-effort</i> path to a crown jewel.</div>
        <div class="btnrow"><button class="primary" onclick="saveEdge()">Add move</button></div>
      </div>
      <h2>Moves <span id="e-count" class="muted small"></span></h2>
      <div id="edge-list"></div>
    </div>
    <!-- DISCOVERY -->
    <div id="pane-disc" style="display:none">
      <div class="panel discovery">
        <h2>Discovering the chain in unknown territory<span class="hint">What to ask and run to surface the edges you can't see yet. Each answer becomes a node or a move.</span></h2>
        <h3>1 · Find the entry points (reachability)</h3>
        <ul>
          <li>What does the internet see? External scan / Shodan / attack-surface mapping → every internet-facing service is a candidate entry node.</li>
          <li>Internet-facing VPN, RDP, mail, web apps, appliances — firmware current? MFA enforced?</li>
          <li>Legacy auth still enabled? (bypasses MFA — a silent entry edge)</li>
        </ul>
        <h3>2 · Find the identity bridges (Book II)</h3>
        <ul>
          <li><code>Entra Connect sync account</code> — does it hold DCSync rights on-prem? That's a cloud→on-prem edge.</li>
          <li>Federation / PTA / PHS path, writeback, seamless SSO — map the bridge.</li>
        </ul>
        <h3>3 · Find privilege paths (Book III)</h3>
        <ul>
          <li>BloodHound: <code>shortestPath</code> to Domain Admins from non-admins — every path is a chain of edges.</li>
          <li>Kerberoastable / AS-REP-roastable high-priv accounts; KRBTGT last-set date.</li>
          <li>App registrations with <code>RoleManagement.ReadWrite.Directory</code>, <code>Mail.ReadWrite</code> — OAuth consent edges.</li>
        </ul>
        <h3>4 · Find the crown jewels (existential nodes)</h3>
        <ul>
          <li>Ask the business, not IT: "what stops the company operating?" ERP, payment rails, OT control, the customer DB.</li>
          <li>Backups & recovery — are they reachable from the estate they protect? If yes, that's an edge into your lifeboat.</li>
        </ul>
        <h3>5 · Map blast radius (the edges between)</h3>
        <ul>
          <li>Flat network? NTLM relay, lateral movement → dense edges, short chains.</li>
          <li>Segmentation, least privilege, T0 isolation → sparse edges, long chains. Note where they're <i>missing</i>.</li>
        </ul>
        <p class="note">Anything you can't characterise (reachable? unknown) becomes a <span style="color:var(--darkq)">dark quantum</span> — capture the node anyway and mark reachability/exploit "unknown". An uncharacterised asset is the dangerous kind.</p>
      </div>
    </div>
  </div>
  <!-- CENTER: graph + chain -->
  <div class="col center">
    <h2>Attack graph &amp; kill chain</h2>
    <div class="legend">
      <span><span class="dot" style="background:var(--entry)"></span>entry</span>
      <span><span class="dot" style="background:var(--jewel)"></span>crown jewel</span>
      <span><span class="dot" style="background:var(--p0)"></span>on shortest chain (P0)</span>
      <span><span class="dot" style="background:var(--p1)"></span>on a chain (P1)</span>
      <span><span class="dot" style="background:var(--p2)"></span>off-chain (P2)</span>
    </div>
    <div class="panel" style="padding:6px"><div id="graph"></div></div>
    <div id="chain-out"></div>
  </div>
  <!-- RIGHT: results -->
  <div class="col right">
    <h2>Assessment</h2>
    <div class="panel" id="summary"></div>
    <h2>Remediation quanta<span class="hint">Sized by time-to-existential-impact, not CVSS.</span></h2>
    <div id="quanta"></div>
  </div>
 </div>
 <script>
 /* ---------------- state ---------------- */
 let nodes = [];   // {id,name,type,tier,entry,jewel,reach,expl,comp,note}
 let edges = [];   // {id,from,to,mech,w}
 let editingId = null;
 let uid = () => 'n'+Math.random().toString(36).slice(2,8);
 const STORE='brownhat-killchain-v1';
 function persist(){ try{localStorage.setItem(STORE,JSON.stringify({nodes,edges}));}catch(e){} }
 function restore(){ try{const s=JSON.parse(localStorage.getItem(STORE));if(s&&s.nodes){nodes=s.nodes;edges=s.edges||[];}}catch(e){} }
 function effortLabel(v){return {1:'1 — trivial',2:'2 — easy',3:'3 — moderate',4:'4 — hard',5:'5 — very hard'}[v];}
 /* ---------------- tabs ---------------- */
 function tab(t){
  ['node','edge','disc'].forEach(x=>{
    document.getElementById('pane-'+x).style.display = x===t?'block':'none';
    document.getElementById('t-'+x).classList.toggle('on',x===t);
  });
  if(t==='edge') refreshEdgeSelects();
 }
 /* ---------------- node CRUD ---------------- */
 function saveNode(){
  const name=document.getElementById('n-name').value.trim();
  if(!name){alert('Name the node first.');return;}
  const data={
    name,
    type:document.getElementById('n-type').value,
    tier:document.getElementById('n-tier').value,
    entry:document.getElementById('n-entry').checked,
    jewel:document.getElementById('n-jewel').checked,
    reach:document.getElementById('n-reach').value,
    expl:document.getElementById('n-expl').value,
    comp:document.getElementById('n-comp').checked,
    note:document.getElementById('n-note').value.trim()
  };
  if(editingId){ Object.assign(nodes.find(n=>n.id===editingId),data); }
  else { nodes.push(Object.assign({id:uid()},data)); }
  clearNodeForm(); render();
 }
 function editNode(id){
  const n=nodes.find(x=>x.id===id); if(!n)return;
  editingId=id;
  document.getElementById('n-name').value=n.name;
  document.getElementById('n-type').value=n.type;
  document.getElementById('n-tier').value=n.tier||'';
  document.getElementById('n-entry').checked=n.entry;
  document.getElementById('n-jewel').checked=n.jewel;
  document.getElementById('n-reach').value=n.reach;
  document.getElementById('n-expl').value=n.expl;
  document.getElementById('n-comp').checked=n.comp;
  document.getElementById('n-note').value=n.note||'';
  tab('node'); window.scrollTo(0,0);
 }
 function delNode(id){
  if(!confirm('Delete this node and its moves?'))return;
  nodes=nodes.filter(n=>n.id!==id);
  edges=edges.filter(e=>e.from!==id&&e.to!==id);
  if(editingId===id)clearNodeForm();
  render();
 }
 function clearNodeForm(){
  editingId=null;
  ['n-name','n-note'].forEach(i=>document.getElementById(i).value='');
  document.getElementById('n-type').value='entry';
  document.getElementById('n-tier').value='';
  ['n-entry','n-jewel','n-comp'].forEach(i=>document.getElementById(i).checked=false);
  document.getElementById('n-reach').value='unknown';
  document.getElementById('n-expl').value='unknown';
 }
 /* ---------------- edge CRUD ---------------- */
 function refreshEdgeSelects(){
  const opts=nodes.map(n=>`<option value="${n.id}">${esc(n.name)}</option>`).join('');
  document.getElementById('e-from').innerHTML=opts;
  document.getElementById('e-to').innerHTML=opts;
 }
 function saveEdge(){
  const from=document.getElementById('e-from').value, to=document.getElementById('e-to').value;
  if(!from||!to){alert('Add at least two nodes first.');return;}
  if(from===to){alert('A move must go between two different nodes.');return;}
  edges.push({id:uid(),from,to,mech:document.getElementById('e-mech').value.trim(),w:+document.getElementById('e-weight').value});
  document.getElementById('e-mech').value='';
  render();
 }
 function delEdge(id){ edges=edges.filter(e=>e.id!==id); render(); }
 /* ---------------- analysis: Dijkstra shortest existential path ---------------- */
 function analyse(){
  const entryIds=nodes.filter(n=>n.entry).map(n=>n.id);
  const jewelIds=new Set(nodes.filter(n=>n.jewel).map(n=>n.id));
  const adj={}; nodes.forEach(n=>adj[n.id]=[]);
  edges.forEach(e=>{ if(adj[e.from]) adj[e.from].push(e); });
  // multi-source Dijkstra from all entry points
  const dist={}, prev={}, prevEdge={};
  nodes.forEach(n=>dist[n.id]=Infinity);
  const pq=[];
  entryIds.forEach(id=>{dist[id]=0; pq.push([0,id]);});
  while(pq.length){
    pq.sort((a,b)=>a[0]-b[0]);
    const [d,u]=pq.shift();
    if(d>dist[u])continue;
    (adj[u]||[]).forEach(e=>{
      const nd=d+e.w;
      if(nd<dist[e.to]){dist[e.to]=nd;prev[e.to]=u;prevEdge[e.to]=e;pq.push([nd,e.to]);}
    });
  }
  // best jewel = reachable jewel with min dist
  let best=null;
  jewelIds.forEach(j=>{ if(dist[j]<Infinity && (!best||dist[j]<dist[best])) best=j; });
  // reconstruct shortest chain
  let chain=[],chainEdges=[];
  if(best!=null){
    let cur=best;
    while(cur!=null){ chain.unshift(cur); if(prevEdge[cur]){chainEdges.unshift(prevEdge[cur]);cur=prev[cur];} else cur=null; }
  }
  const onShortest=new Set(chain);
  // nodes on ANY existential path: reachable from entry AND can reach a jewel
  const reachFromEntry=new Set();
  (function(){const st=[...entryIds];entryIds.forEach(i=>reachFromEntry.add(i));
    while(st.length){const u=st.pop();(adj[u]||[]).forEach(e=>{if(!reachFromEntry.has(e.to)){reachFromEntry.add(e.to);st.push(e.to);}});}})();
  // reverse reachability to a jewel
  const radj={}; nodes.forEach(n=>radj[n.id]=[]); edges.forEach(e=>{if(radj[e.to])radj[e.to].push(e.from);});
  const canReachJewel=new Set();
  (function(){const st=[...jewelIds];jewelIds.forEach(i=>canReachJewel.add(i));
    while(st.length){const u=st.pop();(radj[u]||[]).forEach(f=>{if(!canReachJewel.has(f)){canReachJewel.add(f);st.push(f);}});}})();
  const onAnyChain=new Set(nodes.filter(n=>reachFromEntry.has(n.id)&&canReachJewel.has(n.id)).map(n=>n.id));
  return {chain,chainEdges,onShortest,onAnyChain,dist,best,entryIds,jewelIds,reachable:reachFromEntry};
 }
 /* priority + quantum per node */
 function priority(n,a){
  if(a.onShortest.has(n.id))return 'P0';
  if(a.onAnyChain.has(n.id))return 'P1';
  return 'P2';
 }
 function quantum(n,a){
  const onChain = a.onShortest.has(n.id)||a.onAnyChain.has(n.id);
  if(!onChain) return 'house';
  if(n.reach==='unknown'||n.expl==='unknown') return 'dark';
  if(a.onShortest.has(n.id) && n.reach==='yes' && n.expl==='yes' && !n.comp) return 'crit';
  if(n.reach==='yes' || n.expl==='yes') return 'sev';
  return 'std';
 }
 const QMETA={
  crit:{label:'Critical quantum',budget:'hours · compensating control, not the patch',cls:'crit'},
  sev:{label:'Severe quantum',budget:'days · batched into one change window',cls:'sev'},
  std:{label:'Standard quantum',budget:'sprint · drained in finishable batches',cls:'std'},
  dark:{label:'Dark quantum',budget:'unsized · route to discovery',cls:'darkq'},
  house:{label:'Housekeeping',budget:'off every kill chain — not urgent',cls:'std'}
 };
 /* ---------------- render ---------------- */
 function esc(s){return (s||'').replace(/[&<>"]/g,c=>({'&':'&amp;','<':'&lt;','>':'&gt;','"':'&quot;'}[c]));}
 const TYPELBL={entry:'Entry',identity:'Identity',privilege:'Privilege',device:'Device',data:'Data',infra:'Infra/OT',recovery:'Recovery'};
 function render(){
  persist();
  renderNodeList(); renderEdgeList(); refreshEdgeSelects();
  const a = analyse();
  renderGraph(a); renderChain(a); renderSummary(a); renderQuanta(a);
 }
 function renderNodeList(){
  document.getElementById('n-count').textContent = nodes.length?`(${nodes.length})`:'';
  const el=document.getElementById('node-list');
  if(!nodes.length){el.innerHTML='<div class="empty">No nodes yet. Add the footholds and assets you find — or “Load sample”.</div>';return;}
  const a=analyse();
  el.innerHTML=nodes.map(n=>{
    const p=priority(n,a);
    const pc=p==='P0'?'var(--p0)':p==='P1'?'var(--p1)':'var(--p2)';
    return `<div class="node-item ${editingId===n.id?'sel':''}" onclick="editNode('${n.id}')">
      <div class="nm"><span>${esc(n.name)}</span>
        <span style="display:flex;gap:5px;align-items:center">
          ${n.entry?'<span class="pill entry">entry</span>':''}
          ${n.jewel?'<span class="pill jewel">jewel</span>':''}
          <span style="color:${pc};font-weight:700;font-size:11px">${(a.onShortest.has(n.id)||a.onAnyChain.has(n.id))?p:'—'}</span>
          <span class="x" onclick="event.stopPropagation();delNode('${n.id}')" style="cursor:pointer;color:var(--faint)">✕</span>
        </span>
      </div>
      <div class="meta"><span>${TYPELBL[n.type]||n.type}</span>${n.tier?`<span>· ${n.tier}</span>`:''}
        <span>· reach:${n.reach}</span><span>· exploit:${n.expl}</span>${n.comp?'<span>· compensated</span>':''}</div>
    </div>`;
  }).join('');
 }
 function renderEdgeList(){
  document.getElementById('e-count').textContent = edges.length?`(${edges.length})`:'';
  const el=document.getElementById('edge-list');
  if(!edges.length){el.innerHTML='<div class="empty">No moves yet. A move is one attacker step from one node to another.</div>';return;}
  const nm=id=>{const n=nodes.find(x=>x.id===id);return n?esc(n.name):'?';};
  el.innerHTML=edges.map(e=>`<div class="edge-item">
    <div><b>${nm(e.from)}</b> → <b>${nm(e.to)}</b><br>
      <span class="muted small">${esc(e.mech)||'(mechanism unspecified)'} · effort ${e.w}</span></div>
    <span class="x" onclick="delEdge('${e.id}')">✕</span></div>`).join('');
 }
 function renderGraph(a){
  const g=document.getElementById('graph');
  if(!nodes.length){g.innerHTML='<div class="empty" style="margin:10px">The attack graph renders here.</div>';return;}
  // simple layered layout by distance-from-entry (BFS depth), entries left → jewels right
  const depth={}; nodes.forEach(n=>depth[n.id]=n.entry?0:null);
  const adj={};nodes.forEach(n=>adj[n.id]=[]);edges.forEach(e=>{if(adj[e.from])adj[e.from].push(e.to);});
  let q=nodes.filter(n=>n.entry).map(n=>n.id),guard=0;
  while(q.length&&guard++<999){const u=q.shift();(adj[u]||[]).forEach(v=>{if(depth[v]==null||depth[v]>depth[u]+1){depth[v]=depth[u]+1;q.push(v);}});}
  let maxd=0;nodes.forEach(n=>{if(depth[n.id]==null)depth[n.id]=999;maxd=Math.max(maxd,depth[n.id]===999?0:depth[n.id]);});
  // orphans (no depth) put in a trailing column
  const cols={};nodes.forEach(n=>{const d=depth[n.id]===999?maxd+1:depth[n.id];(cols[d]=cols[d]||[]).push(n);});
  const colKeys=Object.keys(cols).map(Number).sort((x,y)=>x-y);
  const W=Math.max(640,colKeys.length*180), colW=W/colKeys.length;
  let maxRows=0;colKeys.forEach(k=>maxRows=Math.max(maxRows,cols[k].length));
  const H=Math.max(220,maxRows*72+40);
  const pos={};
  colKeys.forEach((k,ci)=>{cols[k].forEach((n,ri)=>{const rows=cols[k].length;
    pos[n.id]={x:colW*ci+colW/2,y:H/(rows+1)*(ri+1)};});});
  const col=n=>{if(a.onShortest.has(n.id))return'var(--p0)';if(a.onAnyChain.has(n.id))return'var(--p1)';if(n.jewel)return'var(--jewel)';if(n.entry)return'var(--entry)';return'#3fb95066';};
  const onChainEdge=new Set(a.chainEdges.map(e=>e.id));
  let svg=`<svg viewBox="0 0 ${W} ${H}" preserveAspectRatio="xMidYMid meet">
    <defs><marker id="arr" markerWidth="9" markerHeight="9" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6 Z" fill="#5b6675"/></marker>
    <marker id="arrR" markerWidth="10" markerHeight="10" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6 Z" fill="var(--p0)"/></marker></defs>`;
  edges.forEach(e=>{const a1=pos[e.from],b=pos[e.to];if(!a1||!b)return;
    const hot=onChainEdge.has(e.id);
    const mx=(a1.x+b.x)/2,my=(a1.y+b.y)/2-18;
    svg+=`<path d="M${a1.x},${a1.y} Q${mx},${my} ${b.x},${b.y}" fill="none" stroke="${hot?'var(--p0)':'#39414d'}" stroke-width="${hot?2.4:1.2}" marker-end="url(#${hot?'arrR':'arr'})" opacity="${hot?1:.7}"/>`;
  });
  nodes.forEach(n=>{const p=pos[n.id];if(!p)return;const c=col(n);
    const r=n.jewel||n.entry?20:16;
    svg+=`<g>
      <circle cx="${p.x}" cy="${p.y}" r="${r}" fill="${c}" fill-opacity="${a.onShortest.has(n.id)?0.95:0.18}" stroke="${c}" stroke-width="2"/>
      ${n.jewel?`<text x="${p.x}" y="${p.y+4}" text-anchor="middle" font-size="14">★</text>`:''}
      ${n.entry?`<text x="${p.x}" y="${p.y+4}" text-anchor="middle" font-size="12">▶</text>`:''}
      <text x="${p.x}" y="${p.y+r+13}" text-anchor="middle" font-size="11" fill="#c9d4df">${esc(n.name.length>22?n.name.slice(0,21)+'…':n.name)}</text>
    </g>`;});
  svg+='</svg>';
  g.innerHTML=svg;
 }
 function renderChain(a){
  const el=document.getElementById('chain-out');
  if(!a.entryIds.length||!a.jewelIds.size){
    el.innerHTML=`<div class="kc-box"><b>No kill chain yet.</b><div class="note">Mark at least one node as an <span style="color:var(--entry)">entry point</span> and one as a <span style="color:var(--jewel)">crown jewel</span>, then connect them with moves.</div></div>`;return;}
  if(!a.chain.length){
    el.innerHTML=`<div class="kc-box"><b style="color:var(--p2)">No path found from any entry point to a crown jewel.</b><div class="note">Either the estate is genuinely segmented here (good — note it), or you haven't mapped the connecting moves yet. In unknown territory, assume the latter until proven.</div></div>`;return;}
  const nm=id=>nodes.find(n=>n.id===id);
  let html=`<div class="kc-box"><h2 style="color:var(--p0);margin-top:0">⛓ The kill chain<span class="hint">Lowest-effort path from foothold to existential impact. Total adversary effort: ${a.dist[a.best]}.</span></h2>`;
  a.chain.forEach((id,i)=>{
    const n=nm(id);
    html+=`<div class="kc-step"><div class="kc-node">
      <div class="n">${esc(n.name)} ${n.entry?'<span class="pill entry">entry</span>':''} ${n.jewel?'<span class="pill jewel">jewel</span>':''}</div>
      <div class="m">${TYPELBL[n.type]||n.type}${n.tier?' · '+n.tier:''}${n.note?' · '+esc(n.note):''}</div>
    </div></div>`;
    if(i<a.chainEdges.length){const e=a.chainEdges[i];
      html+=`<div class="kc-arrow">↓</div><div class="kc-mech">${esc(e.mech)||'move'} · effort ${e.w}</div>`;}
  });
  html+=`<div class="note" style="margin-top:10px">Every node on this path is a <b style="color:var(--p0)">P0</b>. Fix the chain first — break any single link and the existential path is severed. After the incident, ask: did this chain get <i>shorter</i>?</div></div>`;
  el.innerHTML=html;
 }
 function renderSummary(a){
  const counts={P0:0,P1:0,P2:0};
  nodes.forEach(n=>{counts[priority(n,a)]++;});
  const qc={crit:0,sev:0,std:0,dark:0,house:0};
  nodes.forEach(n=>qc[quantum(n,a)]++);
  document.getElementById('summary').innerHTML=`
    <div class="stat"><span>Nodes mapped</span><b>${nodes.length}</b></div>
    <div class="stat"><span>Attacker moves</span><b>${edges.length}</b></div>
    <div class="stat"><span>Entry points</span><b>${a.entryIds.length}</b></div>
    <div class="stat"><span>Crown jewels</span><b>${a.jewelIds.size}</b></div>
    <div class="stat"><span style="color:var(--p0)">Kill-chain length</span><b style="color:var(--p0)">${a.chain.length||'—'}</b></div>
    <div class="stat"><span style="color:var(--p0)">P0 nodes (on shortest chain)</span><b style="color:var(--p0)">${counts.P0}</b></div>
    <div class="stat"><span style="color:var(--p1)">P1 nodes (on a chain)</span><b style="color:var(--p1)">${counts.P1}</b></div>
    <div class="stat"><span style="color:var(--darkq)">Dark quanta (unsized)</span><b style="color:var(--darkq)">${qc.dark}</b></div>`;
 }
 function renderQuanta(a){
  const buckets={crit:[],sev:[],std:[],dark:[]};
  nodes.forEach(n=>{const q=quantum(n,a);if(buckets[q])buckets[q].push(n);});
  const order=['crit','sev','std','dark'];
  let html='';
  order.forEach(k=>{
    const list=buckets[k];if(!list.length)return;
    const m=QMETA[k];
    html+=`<div class="q ${m.cls}"><div class="qh"><span>${m.label}</span><span class="budget">${m.budget}</span></div>`;
    list.forEach(n=>{
      const action = k==='crit'?'Sever reachability / compensating control now'
        : k==='sev'?'Remediate in next change window, verify enforcement'
        : k==='std'?'Batch into sprint; this is where patch velocity fits'
        : 'Characterise: establish reachability & exploitability';
      html+=`<div class="qi"><div class="qn">${esc(n.name)}</div><div class="qd">${action}${n.note?' — '+esc(n.note):''}</div></div>`;
    });
    html+='</div>';
  });
  if(!html) html='<div class="empty">Quanta appear once nodes sit on a kill chain. Map entries, jewels, and the moves between.</div>';
  document.getElementById('quanta').innerHTML=html;
 }
 /* ---------------- import / export ---------------- */
 function exportJSON(){
  dl('kill-chain-assessment.json', JSON.stringify({nodes,edges,exported:new Date().toISOString()},null,2));
 }
 function importJSON(ev){
  const f=ev.target.files[0];if(!f)return;
  const r=new FileReader();
  r.onload=()=>{try{const s=JSON.parse(r.result);nodes=s.nodes||[];edges=s.edges||[];clearNodeForm();render();}catch(e){alert('Could not read that file.');}};
  r.readAsText(f); ev.target.value='';
 }
 function exportMD(){
  const a=analyse();const nm=id=>{const n=nodes.find(x=>x.id===id);return n?n.name:'?';};
  let md=`# Kill Chain Assessment\n\n_Generated ${new Date().toLocaleString()} · Brownhat / CQRE_\n\n`;
  md+=`## Summary\n\n- Nodes mapped: ${nodes.length}\n- Attacker moves: ${edges.length}\n- Entry points: ${a.entryIds.length}\n- Crown jewels: ${a.jewelIds.size}\n- Kill-chain length: ${a.chain.length||'—'}\n\n`;
  if(a.chain.length){
    md+=`## The kill chain (shortest existential path)\n\nLowest-effort path from foothold to existential impact (total adversary effort ${a.dist[a.best]}):\n\n\`\`\`\n`;
    a.chain.forEach((id,i)=>{md+=`${nm(id)}`;if(i<a.chainEdges.length)md+=`\n    → [${a.chainEdges[i].mech||'move'} · effort ${a.chainEdges[i].w}]\n`;});
    md+=`\n\`\`\`\n\nEvery node on this path is a **P0**. Break any single link to sever the existential path.\n\n`;
  } else {
    md+=`## The kill chain\n\nNo path from an entry point to a crown jewel was mapped. Either the estate is segmented here, or the connecting moves are not yet discovered.\n\n`;
  }
  // quanta
  const buckets={crit:[],sev:[],std:[],dark:[]};nodes.forEach(n=>{const q=quantum(n,a);if(buckets[q])buckets[q].push(n);});
  md+=`## Remediation quanta\n\n`;
  [['crit','Critical quantum — hours (compensating control, not the patch)'],
   ['sev','Severe quantum — days (one change window)'],
   ['std','Standard quantum — sprint (patch velocity fits here)'],
   ['dark','Dark quantum — unsized (route to discovery)']].forEach(([k,t])=>{
     if(!buckets[k].length)return;
     md+=`### ${t}\n\n`;
     buckets[k].forEach(n=>{md+=`- **${n.name}**${n.tier?` (${n.tier})`:''}${n.note?` — ${n.note}`:''} _(reach:${n.reach}, exploit:${n.expl}${n.comp?', compensated':''})_\n`;});
     md+=`\n`;
   });
  // findings table
  md+=`## All nodes by priority\n\n| Node | Layer | Tier | Priority | Quantum | Reach | Exploit |\n|---|---|---|---|---|---|---|\n`;
  const pri=n=>priority(n,a);
  nodes.slice().sort((x,y)=>({P0:0,P1:1,P2:2}[pri(x)]-{P0:0,P1:1,P2:2}[pri(y)])).forEach(n=>{
    md+=`| ${n.name} | ${TYPELBL[n.type]||n.type} | ${n.tier||'—'} | ${(a.onShortest.has(n.id)||a.onAnyChain.has(n.id))?pri(n):'off-chain'} | ${QMETA[quantum(n,a)].label} | ${n.reach} | ${n.expl} |\n`;
  });
  md+=`\n---\n\n_See Book VII — Vulnerability Management and the Quantum Vulnerability Management framework for how to size and drain these quanta._\n`;
  dl('kill-chain-assessment.md', md);
 }
 function dl(name,content){
  const b=new Blob([content],{type:'text/plain'});const u=URL.createObjectURL(b);
  const a=document.createElement('a');a.href=u;a.download=name;a.click();URL.revokeObjectURL(u);
 }
 /* ---------------- sample (repo: mid-market engagement) ---------------- */
 function loadSample(){
  if(nodes.length && !confirm('Replace current assessment with the sample engagement?'))return;
  nodes=[
    mk('Stale contractor credential','identity','',{entry:1,reach:'yes',expl:'yes',note:'Active 6 months after offboarding; no MFA'}),
    mk('Internet-facing VPN (legacy firmware)','entry','',{entry:1,reach:'yes',expl:'yes',note:'Cisco ASA, firmware 18mo stale, no MFA'}),
    mk('M365 / Entra ID','identity','T1',{reach:'yes',expl:'yes',note:'34% sign-ins without MFA; CA in report-only'}),
    mk('SharePoint / Teams / Exchange','data','T1',{reach:'yes',expl:'no',note:'All collaboration data + email'}),
    mk('Entra admin account','privilege','T0',{reach:'yes',expl:'yes',note:'Reachable via password spray'}),
    mk('Entra Connect sync account','privilege','T0',{reach:'yes',expl:'yes',note:'Has DCSync rights on-prem'}),
    mk('On-prem Active Directory','privilege','T0',{jewel:0,reach:'yes',expl:'yes',note:'KRBTGT never rotated (847d)'}),
    mk('SAP ERP','infra','T1',{jewel:1,reach:'unknown',expl:'unknown',note:'Financial + operational; default creds on secondary instance'}),
    mk('Backups (same segment as ERP)','recovery','T1',{jewel:1,reach:'yes',expl:'yes',comp:0,note:'Never restore-tested; reachable from estate'})
  ];
  const id=n=>nodes.find(x=>x.name.startsWith(n)).id;
  edges=[
    ed('Stale contractor','M365','Credential valid, no MFA',1),
    ed('Internet-facing VPN','On-prem','VPN auth → internal network',1),
    ed('M365','SharePoint','Token grants data access',1),
    ed('M365','Entra admin','Password spray → privilege escalation',2),
    ed('Entra admin','Entra Connect','Admin controls sync identity',2),
    ed('Entra Connect','On-prem','DCSync via sync-account rights',2),
    ed('On-prem','SAP ERP','Domain creds reused on ERP',3),
    ed('On-prem','Backups','Backups reachable from domain',1),
    ed('SAP ERP','Backups','Same network segment',1)
  ];
  function mk(name,type,tier,o){return Object.assign({id:uid(),name,type,tier,entry:!!o.entry,jewel:!!o.jewel,reach:o.reach||'unknown',expl:o.expl||'unknown',comp:!!o.comp,note:o.note||''},{});}
  function ed(a,b,mech,w){return {id:uid(),from:id(a),to:id(b),mech,w};}
  clearNodeForm();render();
 }
 /* ---------------- boot ---------------- */
 restore();
 if(!nodes.length) loadSample(); else render();
 </script>
 </body>
 </html>
Author	SHA1	Message	Date
tomas.kracmar	173704eca5	feat: Add vulnerability-management arc — Book VII, quantum framework, ORION, and kill-chain assessment tool	2026-06-15 07:56:50 +02:00
tomas.kracmar	633f82c5a7	feat: Add four consultant assignments (identity, CA, Intune, collaboration)	2026-06-09 16:56:48 +02:00
tomas.kracmar	7ff4fad953	feat: Add management overlay pattern (Nebula T0 / Tailscale T1) and cloud admin VM guidance	2026-06-09 14:40:34 +02:00
tomas.kracmar	5264f7b439	feat: Add Antifragile Handbook for M365 & AD (6 books + 2 field guides)	2026-06-09 11:48:11 +02:00
tomas.kracmar	3226e53f95	feat: Add engagement checklist, adversarial validation, and self-service cadence	2026-06-09 11:48:07 +02:00
tomas.kracmar	0d52474c30	chore: Add missing deliverable files (team guide, backlog, sample engagement)	2026-06-05 12:54:44 +02:00
Claude Sonnet 4.6	dc83336567	feat: Add assessment team guide for Brownhat Diagnostic execution New: assessment-templates/assessment-team-guide.md Pre-engagement: access checklist (M365, AD, docs); tool preparation with deployment times; what to do if access is not ready. Day 1 discipline: deploy ASTRAL and PULSAR before workshops start. Step-by-step ASTRAL and PULSAR deployment commands. Passive external scan in background. Microsoft Secure Score baseline. Workshop signals: table of client statements -> likely findings -> what to check on Day 2. Feeds technical assessment planning. Day 2-3 tool runs in sequence: 1. CAExporter (30 min) - CA policy reality check; report-only mode; exclusion groups defeating the purpose 2. BloodHound (1-2h) - 5 required queries; KRBTGT last set check; Domain Admins on workstations; service account attack paths 3. Elysium (2-4h) - privilege requirements noted; privacy model explanation; what to document 4. Purple Knight (30 min) - indicators to focus on; cross-reference with BloodHound 5. Entra ID manual checks (1h) - app registrations, guest accounts, MFA registration status, AD Connect sync account 6. Intune/endpoint check (30 min) - via ASTRAL output 7. External attack surface (30-60 min) - Nmap, Shodan, crt.sh 8. Firewall rule review (30-60 min) - what to look for 9. Backup spot check (30 min) - the 'green tick' test Kill chain synthesis: explicit step-by-step method for tracing from outside to organisational failure. Finding triage: kill chain test table; common priority inflation mistakes. Quick wins: 8-item checklist; three tests a quick win must pass. Report structure: 5 sections, target 15-25 pages, specific guidance per section including what makes a weak vs strong finding. ASERAL/PULSAR handover requirements before leaving site. 9 common assessment mistakes named explicitly. Post-assessment checklist: 10 items before submitting the report. index.md and assessment-templates/README.md updated. Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 10:42:18 +00:00
Claude Sonnet 4.6	097e93a431	feat: Add sample engagement for mid-market hybrid organisation New: playbooks/sample-engagement-mid-market.md Client profile: 500 employees, 10 admins, AD+M365 E3, Intune, 3rd party on-prem/cloud mix, NIS2 important entity, 3 offices, hybrid work, 80 external contractors. Fictional: Nexus Operations s.r.o. Sections: - Client profile and engagement context - Discovery call findings and disqualifier check - Brownhat Diagnostic: kill chain analysis, P0/P1/P2 findings table - 5 quick wins closeable before Day 30 - Module recommendation and rationale (Modules 2, 6, 1, 7) - Day 30/90/180 deliverables specific to this client - Findings backlog pre-populated (23 items, P0 all closed by Day 90) - NIS2 Article 21 compliance map with evidence per measure - Investment estimate (55-80 consultant days) - Consultant notes: CISO handover, NIS2 pressure, two-domain AD, SAP credentials scope, contractor offboarding process dependency index.md: Sample engagement added to playbooks table Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 10:26:20 +00:00
Claude Sonnet 4.6	10f9a9bded	fix: Correct ADO/M365 integration claim in findings backlog No native ADO -> Planner/To Do sync exists. Replace with accurate options: - Teams tab: pin ADO board into Teams channel (built-in, no setup) - Power Automate: available for notifications/Planner push but adds complexity; not recommended as default Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 10:14:20 +00:00
Claude Sonnet 4.6	486c092c32	feat: Add three concrete deployment options to findings backlog Replace vague 'live where client works' with three ordered options: Option 1 (default): ADO Work Items ASTRAL is already in ADO; Work Items are built in, zero additional tooling. Board setup guidance included. M365 Planner/To Do sync via ADO connector or Power Automate: non-technical owners see assigned findings in their daily task list without opening ADO. ASTRAL integration: link Work Items to drift PRs directly. Option 2 (upgrade): CISO Assistant For clients building toward formal GRC. Bridges backlog to risk register: findings promoted from operational backlog to documented risks with treatment plans and compliance evidence links. Docker Compose, self-hosted, 30 minutes to deploy. Option 3 (fallback): Git flat file For clients with technical capability and preference for minimal tooling. Template retained. Limitation noted: no notifications, no Planner sync - if the IT lead needs nudging, use ADO instead. Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 10:12:03 +00:00
Claude Sonnet 4.6	5c4e91179d	feat: Add findings backlog as pragmatic alternative to risk register New: assessment-templates/findings-backlog.md Design principles: lives where client works, every finding has an owner, feeds the housekeeping stream, accumulates from all sources. Format: 6-field minimal entry (ID, finding, source, priority, owner, status) with optional target date/effort/notes/closed date. P0/P1/P2 priority using kill chain test. Flat file template for Git-based clients. Population guide: Day 30 (from Brownhat), subsequent modules, continuous tools (ASTRAL drift, PULSAR alerts, Elysium, BloodHound). Monthly housekeeping cycle structure. Relationship to formal risk register explained. Backlog health indicators (warning signs it is not functioning). Wired into existing framework: move-fast-and-fix-things.md: Rule 4 now names the backlog as the queue rapid-modernisation-plan.md: Day 30 item 7 and Phase 1 action updated engagement-model.md: Section 4 deliverables table updated at all stages assessment-templates/README.md: Production-ready templates section added index.md: Findings Backlog added to Assessment and Tools table Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 10:09:08 +00:00
Claude Sonnet 4.6	6162bb474f	fix: Replace cloud AI cost rows in business case direct costs table Remove 'Cloud AI vendor price shock' (not a security risk; unverifiable number) and 'Competitive intelligence loss from AI training' (inaccurate claim that contradicts corrections made throughout the framework). Replace with: - Incident response and forensics (EUR 150-500K, real range) - Business interruption during recovery (client-specific daily revenue) All five rows now map directly to risks the programme addresses and are quantifiable in a CFO conversation. Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 09:59:12 +00:00
Claude Sonnet 4.6	3b69f255ec	feat: Add concrete milestone deliverables at Day 30/90/180 rapid-modernisation-plan.md: New 'Milestone Deliverables' section with 23 numbered, verifiable deliverables across three milestones. Day 30 (7 deliverables): Brownhat Diagnostic, ASTRAL deployed, PULSAR deployed, T0 accounts hardened, attack surface report, quick wins closed, stale account queue opened. Hard gate: if ASTRAL/PULSAR not deployed, the bottleneck is access provisioning not scope. Day 90 (9 more deliverables): MFA for all users enforced (not enrolled), legacy auth blocked, CA baseline, P0/P1 vulns closed, BloodHound before/ after, vendor access hardened, T0 backup verified, ASTRAL restore drill, PULSAR top 5 alert rules with runbooks. Day 180 (7 more deliverables): Alert runbooks, custom detection rules, client IT lead independence (live walkthrough), housekeeping 3 cycles, module completion packages, risk register closure evidence, retained scope. Each milestone includes the verifiable evidence column and a 'what this value stands alone' statement. Section closes with honest timeline modifiers (large AD, high user count, OT environments). business-case-template.md: The Ask updated to quote the three milestones explicitly. Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 09:54:49 +00:00
Claude Sonnet 4.6	878fca3f0b	feat: Rewrite rapid-modernisation-plan and business-case for realism rapid-modernisation-plan.md: - Add honest framing section: what 180 days delivers vs. what takes 2-3 years - Extend Phase 1 from 30 to 60 days; rename to Visibility - Remove dangerous 'disable all unknown accounts in week 1-2' instruction - Replace Phase 3 (AI Sovereignty) with Signal and Retained Capability - Phase 3 now: detection engineering, alert runbooks, knowledge transfer - Phase 4 made explicitly open-ended (not complete at day 180) - Fix success metrics: remove unverifiable targets, replace with honest ones - Remove 'compress Phases 1-2 into 30 days for small orgs' adaptation - Add 'What This Plan Is Not' practitioner section - ASTRAL and PULSAR integrated as Phase 1 deliverables - AI Sovereignty moved to multi-year parallel initiative business-case-template.md: - Break-even corrected: Day 90 -> 12-18 months post-programme - Phase budget table updated: 30/30/30/90 -> 60/60/60/ongoing - Phase names and deliverables aligned with revised RMP - AI sovereignty removed from core deliverables - Sensitivity analysis: 3 scenarios -> 4 including abort condition - Alternatives table: AI sovereignty removed from Antifragile programme description - ROI table: cloud AI cost line replaced with audit preparation time saving - The Ask: 30-day first gate -> 60-day first gate Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 09:47:25 +00:00
Claude Sonnet 4.6	3062e435ca	chore: Full consistency scan — AOC->PULSAR, fix training-data claims, fix 90% claim AOC -> PULSAR across 10 files (engagement-model, retained-capability, modular-engagements, blue-purple-team-foundation, about-cqre, about-cqre-cs, consultant-field-guide, ai-assisted-tvm, m365-e3-hardening, sovereign-tool-stack, risk-register-example). Training-data framing corrected in: - executive-summary.md: opening paragraph and risk table - README.md: 90% solution claim -> 30-60% in 180 days - modular-engagements.md: public API data use claim - cis-controls-mapping.md: data protection framing - antifragile-risk-register.md: risk entry softened to accurate framing - azure-openai-sovereignty-bridge.md: consumer vs enterprise API distinction Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 07:05:13 +00:00
Claude Sonnet 4.6	bcebf8ebb3	feat: Add critical infrastructure adaptation for Rule 5 (greenfield) move-fast-and-fix-things.md: 'The Critical Infrastructure Adaptation' section in Rule 5. OT/NT environments where full greenfield is impossible. Five-layer adapted stack: IT greenfield protects OT, OT config as code, manual operation as fallback, compartmentalisation as partial burn, long-cycle planned refresh. OT greenfield test with 4h/48h/2w targets. vertical-power-utilities.md: New 'The Controlled Burn Adaptation' section. Full treatment of when greenfield is not an option. Five-layer OT-adapted stack. Explicit acceptance statement framework for genuinely irreplaceable OT components (name, isolate, monitor, plan replacement). The OT greenfield test. Reference back to Rule 5. Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 06:58:07 +00:00
Claude Sonnet 4.6	a337af7ddf	feat: Add housekeeping stream and greenfield capability as Rules 4 and 5 move-fast-and-fix-things.md: Three Rules -> Five Rules. Rule 4: Housekeeping as a permanent stream (named owner, cadence, queue). Rule 5: Greenfield capability as standard operational activity every 5 years. Updated pillar mapping table. antifragile-manifest.md: Pillar 1 Antifragile Moves: greenfield capability as the ultimate expression of structural decoupling. Controlled burn framing. Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 06:53:31 +00:00
Claude Sonnet 4.6	6e86f0844e	fix: Correct speed claim and add infinite vulnerability surface section Speed Is a Security Control: Replace overconfident '90% solution today' with honest target: 30-60% in 180 days. Real comparison is progress vs. the 0% that stays when waiting for the perfect plan. New section 'When the Vulnerability Surface Is Effectively Infinite': AI-scale vulnerability discovery (e.g. Project Glasswing) does not call for AI-assisted patching. It calls for architecture that makes most vulnerabilities matter less: kill chain prioritisation, blast radius limitation, assume-breach posture, known-good baseline. Architecture beats velocity in the vulnerability race. Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 06:44:32 +00:00
Claude Sonnet 4.6	46a1f7e005	feat: Add AI Mythos counter-narrative; rewrite ai-sovereignty-framework move-fast-and-fix-things.md: 'The AI Distraction' section. Multiplier principle, CIS IG1 sequencing, client redirect script. antifragile-manifest.md: Pillar sequencing note (Pillar 4 after 1-3). consultant-field-guide.md: Mistake #11 + AOC->PULSAR rename. ai-sovereignty-framework.md: Full rewrite with regulatory framing, sovereignty spectrum, updated objections, CQRE product examples. Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 05:19:21 +00:00
Claude Sonnet 4.6	48f891db36	feat: Fix review issues and integrate ASTRAL, PULSAR, AURORA product suite Framework fixes: - antifragile-manifest.md: Correct AI Sovereignty pillar (data residency/audit rights framing); add consultant note - executive-summary.md: Same AI sovereignty correction; add EU Regulatory Context (NIS2, DORA, GDPR) - README.md: Add Brownhat brand explanation; expand Standards Alignment with NIS2/DORA/GDPR - core/about-cqre.md: Prominent TEMPLATE WARNING banner to prevent accidental sharing - index.md: Add CQRE Product Suite; renumber consultant nav 1-26 consistently New: playbooks/cqre-product-suite.md - ASTRAL/PULSAR/AURORA product reference with antifragile pillar alignment, regulatory mapping, deployment prerequisites, and objection handling Updated: sovereign-tool-stack.md - ASTRAL updated to GitHub product spec; AOC replaced with PULSAR; AURORA section added Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>	2026-06-05 04:59:20 +00:00