Compare commits
5 Commits
0d52474c30
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 173704eca5 | |||
| 633f82c5a7 | |||
| 7ff4fad953 | |||
| 5264f7b439 | |||
| 3226e53f95 |
@@ -34,11 +34,13 @@ Most security and resilience frameworks optimize for **robustness**—the abilit
|
||||
│ ├── executive-summary.md # One-page board brief
|
||||
│ ├── executive-summary-cs.md # Czech version of board brief (Výkonné shrnutí)
|
||||
│ ├── c-suite-conversation-guide.md # Persuasion scripts for top management
|
||||
│ └── t0-asset-framework.md # Tier 0 asset classification and protection
|
||||
│ ├── t0-asset-framework.md # Tier 0 asset classification and protection
|
||||
│ └── quantum-vulnerability-management.md # Time-budgeted quanta model for the exploitation-first era (Book VII companion)
|
||||
├── playbooks/ # Executable modernisation and response plans
|
||||
│ ├── rapid-modernisation-plan.md # 30-60-90-180 day transformation roadmap
|
||||
│ ├── endpoint-management-entry-vector.md # Intune/device management as engagement entry point
|
||||
│ ├── ai-assisted-tvm.md # AI-powered vulnerability management blueprint
|
||||
│ ├── kill-chain-assessment-app.md # Spec for the offline kill-chain mapping tool (tools/kill-chain-assessment.html)
|
||||
│ ├── zero-budget-vulnerability-discovery.md # Script-based vuln discovery without commercial scanners
|
||||
│ ├── perimeter-scanning-capability.md # External attack surface scanning strategy
|
||||
│ ├── osquery-custom-platform.md # Build a sovereign vuln/asset discovery platform on osquery
|
||||
@@ -66,6 +68,10 @@ Most security and resilience frameworks optimize for **robustness**—the abilit
|
||||
│ ├── vertical-power-utilities.md # Power generation, transmission, water utilities
|
||||
│ ├── vertical-telco.md # Telecommunications and mobile operators
|
||||
│ └── vertical-banking.md # Financial services regulatory alignment
|
||||
├── tools/ # Standalone runnable instruments (offline, single-file)
|
||||
│ ├── README.md # Tool index and design constraints
|
||||
│ └── kill-chain-assessment.html # Maps unknown estates → shortest existential path → quanta
|
||||
├── books/ # The Antifragile Handbook (Books I–VII + field guides)
|
||||
└── assets/ # Diagrams, visuals, and presentation materials
|
||||
```
|
||||
|
||||
|
||||
@@ -8,6 +8,9 @@ This directory contains diagnostic tools, maturity models, and assessment resour
|
||||
|
||||
| Template | Purpose |
|
||||
|----------|---------|
|
||||
| [Engagement Checklist](engagement-checklist.md) | **Point-in-time, regularly updated.** Controls to inspect on every M365+AD engagement, organized by domain. Not scored — a structured inspection list. Review January 2027. |
|
||||
| [Adversarial Validation Checklist](adversarial-validation-checklist.md) | **Phase 2 — mature estates.** Every item is a test, not an inspection. Opening/closing metrics, eight detection simulations, CA ghost policy tests, attack path verification. Review January 2027. |
|
||||
| [Self-Service Cadence](self-service-cadence.md) | **Client leave-behind.** Monthly portal checks and quarterly tool runs (PingCastle, Purple Knight, CAExporter, PowerShell scripts) an admin can run between engagements. Includes "call us" triggers. Customise per client before handing over. |
|
||||
| [Assessment Team Guide](assessment-team-guide.md) | Technical execution guide for the Brownhat Diagnostic: tool sequence (ASTRAL, PULSAR, BloodHound, Elysium, Purple Knight, CAExporter), what to look for, kill chain synthesis, report structure, common mistakes. |
|
||||
| [Findings Backlog](findings-backlog.md) | Single source of truth for all findings across every module and diagnostic. The input queue for the housekeeping stream. Pragmatic alternative to a formal risk register for organisations that do not have one. |
|
||||
| [NIST CSF 2.0 Baseline Assessment](nist-csf-baseline.md) | The Brownhat Diagnostic: structured 2-half-day workshop, gap analysis, kill chain identification |
|
||||
|
||||
@@ -0,0 +1,319 @@
|
||||
# Adversarial Validation Checklist
|
||||
|
||||
> *For clients who have done the foundational work. Everything here is tested, not inspected.*
|
||||
|
||||
**Last updated:** June 2026
|
||||
**Engagement type:** Phase 2 — mature estates
|
||||
**Field guide:** [Adversarial Validation Field Guide](../books/field-guide-adversarial-validation.md)
|
||||
**Next review:** January 2027
|
||||
|
||||
---
|
||||
|
||||
## How to use this
|
||||
|
||||
This checklist assumes the foundational controls are in place. The question is not "does this control exist" — it is "does this control work." Every item is a test. If an item cannot be tested in the current engagement window, mark it as untested and note it as a finding: **an untested control is a broken control, you simply do not know it yet.**
|
||||
|
||||
Before any test: confirm written authorization. Before the first test: capture baseline metrics (BloodHound path count, Entra role assignment export, CA policy JSON export). After the engagement: record the "after" metrics.
|
||||
|
||||
**Notation:**
|
||||
`[VERIFY]` — confirm the claim against observed behavior
|
||||
`[SIMULATE]` — run the attack or failure scenario, authorized and controlled
|
||||
`[MEASURE]` — produce a number; the number is the finding, not pass/fail
|
||||
|
||||
---
|
||||
|
||||
## Opening metrics (capture before first test)
|
||||
|
||||
- `[MEASURE]` BloodHound paths to Domain Admin (all paths; then filtered to paths reachable from standard user compromise)
|
||||
- `[MEASURE]` Count of active (non-eligible) Global Admin assignments excluding break-glass
|
||||
- `[MEASURE]` Count of active (non-eligible) Domain Admin assignments
|
||||
- `[MEASURE]` Service principals with escalation-grade Graph permissions (application permissions)
|
||||
- `[MEASURE]` CA policies verified to enforce (by prior observation) vs. total CA policies in scope
|
||||
- `[MEASURE]` Distinct device IDs in sign-in logs (last 30 days) vs. Intune enrolled device count
|
||||
- `[MEASURE]` Alert volume per day (last 30 days) vs. alerts with documented human response
|
||||
- `[MEASURE]` Structural changes produced by the last five closed security incidents or alerts
|
||||
- `[MEASURE]` Anonymous link count across SharePoint/OneDrive (existing, regardless of current tenant setting)
|
||||
- `[MEASURE]` Backup MTTR from last documented restore (if any; if none, record "never tested")
|
||||
|
||||
---
|
||||
|
||||
## Section 1 — Identity: the wall
|
||||
|
||||
### 1.1 Firebreak integrity
|
||||
|
||||
- `[VERIFY]` Pull all Global Admin members and check `onPremisesSyncEnabled` for each. Any `true` value is a P0. "We moved them to cloud-only" is the claim; this is the verification.
|
||||
- `[VERIFY]` Trace every path from a simulated on-prem compromise (sync server connector account) to a cloud privileged role. Draw the graph. Each path is a hole in the wall.
|
||||
- `[VERIFY]` For each cloud admin: what MFA device are they using, and is that device also used for email and browsing? A Tier 2 device authenticating a Tier 0 role is a tier violation through the MFA layer.
|
||||
- `[VERIFY]` Does any admin's MFA authenticator app depend on a phone number or device that is outside the client's MDM? (MFA backup codes stored in iCloud are a personal device dependency for a privileged role.)
|
||||
|
||||
### 1.2 Break-glass: real test
|
||||
|
||||
- `[SIMULATE]` Sign in to the break-glass Global Admin account.
|
||||
- `[MEASURE]` Time from sign-in to alert received by named responder.
|
||||
- `[VERIFY]` Alert reaches the named responder (not just fires into a queue). Responder acknowledges.
|
||||
- `[VERIFY]` Break-glass sign-in works with zero on-prem dependency (test while sync is stopped, or while on a network with no DC visibility).
|
||||
- `[VERIFY]` Break-glass credentials can be retrieved from their storage location without the systems they are recovering (test retrieval physically or procedurally).
|
||||
|
||||
### 1.3 PIM enforcement
|
||||
|
||||
- `[VERIFY]` For Global Administrator role PIM settings: what is the MFA method required on activation? Confirm it is phishing-resistant (FIDO2 or certificate). Push-approve is a finding.
|
||||
- `[SIMULATE]` Activate an eligible GA role from a personal device or a non-compliant device. Is it blocked by a CA policy scoped to role activation?
|
||||
- `[SIMULATE]` Request activation requiring approval. Does the approval notification reach the approver with meaningful context (what role, for whom, what justification)? Does the approver act within SLA?
|
||||
- `[MEASURE]` Maximum activation time box for GA and Privileged Role Admin. Record in hours. 24-hour window = functionally standing privilege during business hours.
|
||||
- `[VERIFY]` Are there any GA assignments that are active (permanent) and are not break-glass accounts? Pull the list; any result is a PIM compliance gap from configuration drift.
|
||||
|
||||
### 1.4 AD FS (if still running)
|
||||
|
||||
- `[MEASURE]` Token-signing certificate age in days since last rotation.
|
||||
- `[SIMULATE]` Golden SAML tabletop: if the private key were obtained, what alert (if any) would fire? Walk through the detection path. Document what is visible and what is not.
|
||||
- `[VERIFY]` Is there a signed migration plan with a named date? If not, document as P0 finding — migration tooling is mature; absence of a plan is a decision, not a default.
|
||||
|
||||
### 1.5 Connector account monitoring
|
||||
|
||||
- `[SIMULATE]` Authenticate as the Entra connector account (Directory Synchronization Accounts) from a host other than the sync server. Does an alert fire?
|
||||
- `[MEASURE]` Time from test authentication to alert receipt.
|
||||
- `[VERIFY]` If no alert fires: the most DCSync-capable account in the estate is unmonitored. Document as P0.
|
||||
|
||||
### 1.6 Seamless SSO / AZUREADSSOACC
|
||||
|
||||
- `[VERIFY]` `Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet` — compare to approximate tenant go-live date. If matching: never rotated.
|
||||
- `[VERIFY]` If Seamless SSO is not needed for the current device estate (Entra-joined devices on modern auth): document removal as a quick win.
|
||||
|
||||
---
|
||||
|
||||
## Section 2 — Privilege: attack paths
|
||||
|
||||
### 2.1 BloodHound / attack path analysis
|
||||
|
||||
- `[MEASURE]` Total BloodHound paths to Domain Admin.
|
||||
- `[MEASURE]` Shortest path (fewest hops) to Domain Admin from a standard user account. Enumerate the specific path.
|
||||
- `[MEASURE]` Number of paths involving Kerberoastable service accounts.
|
||||
- `[MEASURE]` Number of paths involving ADCS templates (add ACL collection to BloodHound run).
|
||||
- `[VERIFY]` Has anyone on the client team reviewed BloodHound output in the last 90 days? If not, the path count from the last review is the stale baseline, not the current state.
|
||||
|
||||
### 2.2 Kerberoasting: attack and detection
|
||||
|
||||
- `[SIMULATE]` Run Invoke-Kerberoast or Rubeus kerberoast (authorized, test account as origin).
|
||||
- `[VERIFY]` Did Defender for Identity, Sentinel, or any SIEM alert on the TGS request pattern?
|
||||
- `[MEASURE]` Time from attack to alert receipt (if alert fires).
|
||||
- `[SIMULATE]` Attempt to crack the harvested hashes offline. Record which accounts crack and approximate crack time.
|
||||
- Finding: accounts that crack quickly + no detection = P0 on both the account and the detection gap.
|
||||
|
||||
### 2.3 ADCS
|
||||
|
||||
- `[VERIFY]` Run `certipy find` or `Certify.exe find /vulnerable` against the CA. Document any ESC findings.
|
||||
- `[VERIFY]` Is the ADCS server on a dedicated Tier 0 or hardened host, or on a standard server? Check who has local admin access.
|
||||
- `[VERIFY]` Are there published certificate templates with "Supply subject in request" and enrollment permissions broader than the intended service? (ESC1 pattern)
|
||||
- `[SIMULATE]` If ESC1 is found: demonstrate the exploit path (in authorized test context — enroll a cert for a test admin account using the vulnerable template). Show the client the domain admin cert in hand.
|
||||
|
||||
### 2.4 Service principal dark matter
|
||||
|
||||
- `[VERIFY]` For each service principal with escalation-grade application permissions: ask the room to identify the current owner and current use case. Document every "I don't know."
|
||||
- `[VERIFY]` For each: check `lastSignInDateTime` for the service principal. Unused principal + dangerous permissions + non-expiring secret = standing credential that can be activated any time.
|
||||
- `[VERIFY]` Are there app registrations with admin consent granted for `Mail.Read`, `Files.ReadWrite.All`, or equivalent — where the granting user or admin is no longer at the organization?
|
||||
- `[SIMULATE]` Attempt to use a service principal with dangerous Graph permissions to escalate: assign a role, add an app role assignment, or read all users. Confirm the permission is real and enforced (not just declared).
|
||||
|
||||
### 2.5 Standing privilege beyond PIM
|
||||
|
||||
- `[VERIFY]` Pull active (not eligible) role assignments for GA, PRA, Security Admin, Exchange Admin. Any active assignment not in the break-glass inventory is a drift finding.
|
||||
- `[VERIFY]` Pull Domain Admins and Enterprise Admins. Count them. Ask the client how many they believe exist. Present the actual count. In most estates, the actual count exceeds the belief.
|
||||
- `[VERIFY]` Are there administrator accounts with no associated human — service accounts running with Domain Admin because "it was easier at the time"?
|
||||
|
||||
### 2.6 Local privilege on endpoints
|
||||
|
||||
- `[VERIFY]` Pull local Administrators group membership across a sample of endpoints (10+ devices). Are there accounts beyond the expected (LAPS-managed local admin, Entra-joined device admin, EPM)?
|
||||
- `[VERIFY]` Is Windows LAPS deployed and confirmed working? Retrieve a LAPS password for a test device through Intune or the AD attribute. Confirm rotation has occurred (password age < 30 days or per policy).
|
||||
- `[VERIFY]` If EPM is deployed: test an elevation request for a controlled binary. Is it logged? Is the log reviewed by anyone?
|
||||
|
||||
---
|
||||
|
||||
## Section 3 — Devices: compliance signal gap
|
||||
|
||||
### 3.1 CA policy enforcement (test each separately)
|
||||
|
||||
For each CA policy in scope, write the expected outcome before looking at the configuration. Then test:
|
||||
|
||||
- `[SIMULATE]` **Legacy auth block:** Authenticate using Basic Auth from a test account (Exchange ActiveSync, SMTP auth, or equivalent). Expected: blocked. Result: ___
|
||||
- `[SIMULATE]` **Compliant device gate:** Sign in from a known non-compliant device (personal device, or a managed device taken out of compliance). Expected: blocked from sensitive workloads. Result: ___
|
||||
- `[SIMULATE]` **Admin sign-in location gate:** Attempt a PIM role activation from a device outside the named compliant/PAW scope. Expected: blocked. Result: ___
|
||||
- `[SIMULATE]` **MFA enforcement:** Sign in as a test user from a new device with no registered session. Expected: MFA challenged. Confirm the MFA method that fires (push-approve vs. FIDO2). Result: ___
|
||||
- `[VERIFY]` For any policy that fails to enforce despite correct displayed configuration: recreate from scratch, re-test. Document if ghost policy confirmed.
|
||||
- `[VERIFY]` Are there CA policies in report-only mode that should be enabled? Report-only is a test state, not a permanent posture.
|
||||
- `[VERIFY]` Break-glass accounts excluded from blocking policies — test the break-glass sign-in path specifically under the conditions a blocking policy would normally fire.
|
||||
|
||||
### 3.2 Compliance signal quality
|
||||
|
||||
- `[SIMULATE]` Induce a non-compliant state on a test managed device. Record the timestamp.
|
||||
- `[MEASURE]` Time from non-compliance induction to Intune state update.
|
||||
- `[MEASURE]` Time from non-compliance induction to CA token revocation / session block.
|
||||
- `[VERIFY]` Is CAE (Continuous Access Evaluation) active for critical workloads? If yes, measure revocation time for a CAE-supported app vs. a non-CAE app. Present the gap.
|
||||
- `[SIMULATE]` Root / jailbreak a test device. Does the jailbreak detection in the compliance policy trigger? How long?
|
||||
|
||||
### 3.3 Fleet reality check
|
||||
|
||||
- `[MEASURE]` Distinct device IDs in sign-in logs (last 30 days).
|
||||
- `[MEASURE]` Intune enrolled device count.
|
||||
- `[MEASURE]` Devices in sign-in logs with device compliance state "non-compliant" or "unknown."
|
||||
- `[VERIFY]` Are there legacy-auth sign-ins in the logs that bypass device compliance evaluation entirely? Filter by Client App = non-modern entries. Each entry is a device control bypass.
|
||||
- `[VERIFY]` Pick 5 devices from the sign-in log that are not in Intune. What data do they have access to? What CA policy, if any, applies to them?
|
||||
|
||||
### 3.4 Update rings and rollback
|
||||
|
||||
- `[VERIFY]` Are update rings configured with a named pilot group and a broad group with deferral?
|
||||
- `[VERIFY]` Is there a named person with the process to halt a broad ring update push? Do they know the procedure? Have they tested it?
|
||||
- `[SIMULATE]` (If authorized and non-disruptive) Push a test configuration change to the pilot ring only. Confirm it stays in the pilot ring and does not propagate to broad without explicit promotion.
|
||||
|
||||
### 3.5 MAM boundary (per platform)
|
||||
|
||||
- `[SIMULATE]` On iOS: copy text from managed Outlook to an unmanaged app. Blocked or not?
|
||||
- `[SIMULATE]` On Android: same test. (Do separately — behavior is not symmetric.)
|
||||
- `[SIMULATE]` On iOS: "Open in" from a managed email attachment to Files app or an unmanaged viewer.
|
||||
- `[SIMULATE]` On either platform: save to local storage or backup to iCloud/Google Drive.
|
||||
- `[VERIFY]` For any gap found: confirm it reproduces after device reset. If it does, escalate to vendor. If it does not, investigate configuration.
|
||||
|
||||
---
|
||||
|
||||
## Section 4 — Data: does protection travel
|
||||
|
||||
### 4.1 Label encryption in the wild
|
||||
|
||||
- `[SIMULATE]` Forward a Highly Confidential test document to an external test email address. Open it from a mail client with no tenant authentication. Does encryption prevent access?
|
||||
- `[SIMULATE]` Download the same document to an unmanaged device. Does encryption require re-authentication to the tenant?
|
||||
- `[SIMULATE]` Share the document via an anonymous link. Access from an unauthenticated browser. Does it open?
|
||||
- `[SIMULATE]` Copy/paste content from the document on a managed device under a MAM policy. Is it blocked?
|
||||
- `[VERIFY]` For any path where the document opens without authentication: this is an exfiltration route. Document the specific path, the expected control that should have blocked it, and the observed result.
|
||||
|
||||
### 4.2 DLP enforcement
|
||||
|
||||
- `[SIMULATE]` Send an email from a test account containing content matching a high-value DLP rule (credit card number pattern, national ID format, or the client's custom regex for crown-jewel content). Does DLP intercept it? What action fires (block, override, audit-only)?
|
||||
- `[SIMULATE]` Upload the same content to a personal OneDrive or cloud storage from a managed device. Does DLP fire?
|
||||
- `[VERIFY]` For DLP rules that fire in audit-only mode: what happens to the audit events? Are they reviewed? By whom? How often?
|
||||
- `[VERIFY]` What is the false positive rate for high-sensitivity DLP rules? High false positive rates mean users have learned to override; the rule is not a control.
|
||||
|
||||
### 4.3 Anonymous links (existing population)
|
||||
|
||||
- `[MEASURE]` Full count of anonymous links across the tenant. (Not the current sharing setting — the existing links that predate any restriction.)
|
||||
- `[VERIFY]` Confirm at least one existing anonymous link resolves from an unauthenticated browser. It does — almost certainly. This proves the declared sharing restriction is forward-looking, not retroactive.
|
||||
- `[VERIFY]` Can the client produce the anonymous link list and revoke all entries in under 30 minutes? Test the revocation capability, not just the list.
|
||||
|
||||
### 4.4 Email exfiltration paths
|
||||
|
||||
- `[SIMULATE]` Create a test Inbox rule on a test account forwarding to an external test address. Does anything alert? When?
|
||||
- `[VERIFY]` `Get-RemoteDomain Default | Select-Object AutoForwardEnabled` — if False, test whether the Inbox rule still forwards. Document the result (transport-level and client-rule forwarding behave differently).
|
||||
- `[VERIFY]` `Get-TransportRule` for any rules with external redirect or blind copy. For each: who created it, when, and is there a documented owner?
|
||||
- `[MEASURE]` Time from Inbox rule creation to detection alert (if any).
|
||||
|
||||
### 4.5 Guest access and reshare chain
|
||||
|
||||
- `[MEASURE]` Total guest count. Guests not signed in for 90+ days. Ratio of stale to active.
|
||||
- `[VERIFY]` Do guests have access beyond their original project scope? Pick 5 random active guests and enumerate their group and site memberships.
|
||||
- `[SIMULATE]` Share a test document to a test external guest. Have the guest reshare to a second external test account. Can the client observe the second hop? Can they revoke it?
|
||||
- `[VERIFY]` Are access reviews running for guests? What is the default action on reviewer non-response?
|
||||
|
||||
### 4.6 Audit log forensics readiness
|
||||
|
||||
- `[VERIFY]` Confirm audit logging is enabled (Purview > Audit — look for the "Start recording" banner; if it appears, logging is off).
|
||||
- `[SIMULATE]` Run a forensic reconstruction: given a specific test user account, reconstruct everywhere they accessed data in the last 7 days. Can you produce a coherent picture from the audit log alone?
|
||||
- `[MEASURE]` How far back does the audit log extend for the current licensing tier? Test by querying for a known event at the boundary date.
|
||||
- `[VERIFY]` Are admin operations (CA policy changes, role assignments, app consent grants) present in the audit log? Run a query for admin events from the last 30 days and spot-check for completeness.
|
||||
|
||||
---
|
||||
|
||||
## Section 5 — Detection: the eight simulations
|
||||
|
||||
For each simulation: run it, record whether the alert fired, record the time from event to human acknowledgment, and record whether the responder acted. The SLA comparison is the finding.
|
||||
|
||||
| Simulation | Alert fires? | Time to human | Action taken | Finding |
|
||||
|---|---|---|---|---|
|
||||
| Break-glass sign-in | | | | |
|
||||
| New Global Admin assigned | | | | |
|
||||
| DCSync from non-DC host | | | | |
|
||||
| Kerberoasting (TGS pattern) | | | | |
|
||||
| Impossible travel (admin account) | | | | |
|
||||
| External auto-forward rule created | | | | |
|
||||
| Mass download from SharePoint | | | | |
|
||||
| OAuth consent grant (sensitive scope) | | | | |
|
||||
|
||||
### 5.1 Alert queue health
|
||||
|
||||
- `[MEASURE]` Alert volume per day (last 30 days).
|
||||
- `[MEASURE]` Alerts with documented human response.
|
||||
- `[MEASURE]` Alerts suppressed or auto-closed without human review.
|
||||
- `[MEASURE]` Alerts open for more than 48 hours.
|
||||
- `[VERIFY]` For every alert category: is there a named owner? An alert category with no named owner is an unread alert category.
|
||||
- `[VERIFY]` Pick 5 alerts from the last 30 days that were closed. For each: what action was taken, and what structural change resulted?
|
||||
|
||||
### 5.2 The feedback loop test
|
||||
|
||||
- `[MEASURE]` Last 5 closed security incidents: structural changes produced (count removals, access reductions, severed couplings — not reminders, training, or "noted in risk register").
|
||||
- `[VERIFY]` Is there a post-incident process that explicitly asks: "what structural thing changes as a result of this?"
|
||||
- `[VERIFY]` Is the post-incident process blameless on people (encouraging surfacing) and ruthless on structure (demanding a removal or change)?
|
||||
|
||||
---
|
||||
|
||||
## Section 6 — Recovery
|
||||
|
||||
### 6.1 Backup: restore something
|
||||
|
||||
- `[SIMULATE]` Restore a mailbox (or a mailbox item set) from the third-party backup. Time the operation.
|
||||
- `[MEASURE]` Actual MTTR from test restore vs. policy-declared RTO.
|
||||
- `[VERIFY]` If the actual MTTR exceeds the policy RTO: the policy is a fiction. Document the observed time as the operative figure.
|
||||
- `[VERIFY]` Are backups isolated from the estate they protect? Can a Global Admin delete the backup copies?
|
||||
- `[VERIFY]` Is there a third-party M365 backup at all? If not: M365 native recycle bin + version history is the only recovery mechanism, and this is a P0 for any organization with business-critical M365 data.
|
||||
|
||||
### 6.2 AD forest recovery
|
||||
|
||||
- `[VERIFY]` Does a written AD forest recovery runbook exist?
|
||||
- `[VERIFY]` Is it stored where it can be retrieved when AD is down? (Not SharePoint. Not AD-authenticated storage.)
|
||||
- `[VERIFY]` Has anyone on the team run the procedure — not a tabletop, an actual restore, even in a lab?
|
||||
- `[VERIFY]` Does the runbook include: DC restore sequence, metadata cleanup, double KRBTGT rotation, trust resets?
|
||||
- Finding if all above are no: the first time AD forest recovery is performed will be during the real disaster. Document as a rehearsal scope item.
|
||||
|
||||
### 6.3 Configuration known-good
|
||||
|
||||
- `[VERIFY]` Export current CA policies to JSON. Diff against the opening-of-engagement export. For every difference: is there a change record?
|
||||
- `[VERIFY]` Are there CA policies that changed since the last documented review without a corresponding change order?
|
||||
- `[VERIFY]` If a CA policy was silently modified (intentionally or not), what mechanism would have detected it and when?
|
||||
|
||||
### 6.4 Break-glass independence
|
||||
|
||||
- `[VERIFY]` Cloud admin recovery path works with no on-prem dependency — confirm by testing while sync is stopped or from a network with no DC visibility.
|
||||
- `[VERIFY]` If the primary MFA infrastructure (Microsoft Authenticator, FIDO2 key) is unavailable, is there a recovery path for privileged access that does not itself require privileged access?
|
||||
|
||||
---
|
||||
|
||||
## Closing metrics (capture after engagement)
|
||||
|
||||
| Metric | Before | After | Delta |
|
||||
|--------|--------|-------|-------|
|
||||
| BloodHound paths to DA (from standard user) | | | |
|
||||
| Active (non-break-glass) Global Admin assignments | | | |
|
||||
| Active (non-break-glass) Domain Admin assignments | | | |
|
||||
| CA policies verified by observation (working) | | | |
|
||||
| Detection signals tested end-to-end (working) | | | |
|
||||
| Anonymous link count | | | |
|
||||
| Unmanaged device sign-in % of total | | | |
|
||||
| Actual backup MTTR (minutes) | | | |
|
||||
| Structural changes from last 5 incidents (before) | | | |
|
||||
| Structural changes produced this engagement | | | |
|
||||
|
||||
---
|
||||
|
||||
## Engagement close verification
|
||||
|
||||
Before marking the engagement complete:
|
||||
|
||||
- Every finding that was verified by observation has a structural change attached (not a risk register entry — a change).
|
||||
- The closing metrics have been calculated and compared to the opening metrics.
|
||||
- The break-glass has been tested and works.
|
||||
- At least one backup restore has been timed and the MTTR recorded.
|
||||
- At least one CA policy has been verified to enforce by a real sign-in with pre-written expected outcomes.
|
||||
- At least one detection signal has been tested end-to-end to a human responder.
|
||||
- The configuration-as-code export (CA policies, role assignments) has been stored and the client has it.
|
||||
- A named date exists for the next adversarial validation cycle.
|
||||
|
||||
The engagement is not complete when the list is walked. It is complete when every finding from observation has become a structural change or a named, dated, owned commitment.
|
||||
|
||||
---
|
||||
|
||||
*Adversarial Validation Checklist. Updated June 2026. Review alongside the field guide — January 2027.*
|
||||
@@ -0,0 +1,389 @@
|
||||
# M365 + AD Engagement Checklist
|
||||
|
||||
> *Not a benchmark. Not scored. A structured inspection list for consultants on active engagements.*
|
||||
|
||||
**Last updated:** June 2026
|
||||
**Companion to:** [Field Guide 2026](../books/field-guide-2026.md) · [Books I–VI](../books/)
|
||||
**Next review:** January 2027
|
||||
|
||||
---
|
||||
|
||||
## How to use this
|
||||
|
||||
Work through the relevant sections during the Brownhat Diagnostic or at the start of a module engagement. Each item is a control area — something to inspect and a question to answer honestly. Mark items that surface findings. Mark items that are verified clean. If an item is not applicable, note why.
|
||||
|
||||
This is not a scoring tool. "Found" and "clean" are the only states that matter. A clean item with no evidence of testing is the same as not checked.
|
||||
|
||||
**Notation used below:**
|
||||
- `[LOOK AT]` — inspect and document current state
|
||||
- `[TEST]` — verify by observation, not by reading the config
|
||||
- `[ASK]` — a question that requires a conversation, not just a portal check
|
||||
|
||||
Nothing here replaces the governing question from Book I:
|
||||
> **If this is owned tonight, what is the largest thing an attacker reaches before hitting a wall — and can I draw that wall?**
|
||||
|
||||
---
|
||||
|
||||
## Section A — Hybrid Identity
|
||||
|
||||
### A1. Authentication Method
|
||||
|
||||
- `[LOOK AT]` Which authentication method is actually in use: PHS, PTA, or Federation (AD FS)?
|
||||
- `[LOOK AT]` Does the method shown in the Entra portal match what is documented and what IT staff believe to be true?
|
||||
- `[TEST]` If on-prem AD is simulated as unavailable (pull the sync server), does cloud authentication survive? Which auth method does this actually prove?
|
||||
- `[LOOK AT]` Is PHS running alongside PTA as a failover? (Optionality — cheap insurance)
|
||||
- `[LOOK AT]` If on PTA: how many PTA agents are deployed, and what host/network tier are they on?
|
||||
|
||||
### A2. Sync Engine (Entra Connect / Cloud Sync)
|
||||
|
||||
- `[LOOK AT]` Which sync engine is running: Entra Connect Sync or Entra Cloud Sync?
|
||||
- `[LOOK AT]` What server hosts the sync engine, and what domain/tier is it joined to?
|
||||
- `[LOOK AT]` What account runs the on-prem connector service, and does it have `Replicate Directory Changes All` (DCSync capability)?
|
||||
- `[LOOK AT]` What is the patch / update level of the sync server (OS and sync software)?
|
||||
- `[LOOK AT]` Who has local administrator rights on the sync server?
|
||||
- `[LOOK AT]` What does the Entra connector account (Directory Synchronization Accounts role) have permission to do in the cloud?
|
||||
- `[TEST]` If the connector account is monitored: does an alert fire when it authenticates from an unexpected host?
|
||||
- `[LOOK AT]` Are there active alerts or errors in the sync engine health dashboard?
|
||||
|
||||
### A3. AD FS
|
||||
|
||||
- `[LOOK AT]` Is AD FS deployed and active?
|
||||
- `[ASK]` If yes: why is it still running? What relying party trusts require it, and is there a migration plan?
|
||||
- `[LOOK AT]` When was the token-signing certificate last rotated? Where is the private key stored?
|
||||
- `[LOOK AT]` Is the rollover certificate about to expire?
|
||||
- `[LOOK AT]` Which servers host AD FS, and what network tier and patching cadence do they have?
|
||||
- `[TEST]` Golden SAML tabletop: if the token-signing key were obtained, what would detection see, and how fast could the cert be rotated? Is the procedure written and tested?
|
||||
- `[ASK]` Is there a Entra staged rollout in progress or planned to migrate away from federation?
|
||||
|
||||
### A4. Privileged Account Sync
|
||||
|
||||
- `[LOOK AT]` Are any Domain Admins, Enterprise Admins, or other Tier 0 accounts synced to Entra ID (i.e., present as cloud objects)?
|
||||
- `[LOOK AT]` Are Global Admins or other Entra privileged role holders cloud-only accounts, or synced from on-prem?
|
||||
- `[LOOK AT]` Are admin accounts (on-prem or cloud) using the same device for privileged work as for daily tasks (email, browsing)?
|
||||
|
||||
### A5. Writebacks
|
||||
|
||||
- `[LOOK AT]` Which writebacks are enabled: password writeback, group writeback, device writeback?
|
||||
- `[ASK]` For each: who owns the decision, and is the reverse blast radius (cloud compromise → on-prem impact) documented?
|
||||
- `[LOOK AT]` Is group writeback (v2) enabled? If so, which cloud groups write into AD, and what on-prem resources do they gate?
|
||||
|
||||
### A6. Seamless SSO
|
||||
|
||||
- `[LOOK AT]` Is Seamless SSO enabled?
|
||||
- `[LOOK AT]` When was the `AZUREADSSOACC` Kerberos key last rotated? (`Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet`)
|
||||
- `[ASK]` Is Seamless SSO actually needed, or can it be removed (Entra-joined devices + modern auth typically do not require it)?
|
||||
|
||||
### A7. Sync Scope
|
||||
|
||||
- `[LOOK AT]` Is sync scoped to specific OUs, or is "sync everything" the default?
|
||||
- `[LOOK AT]` Are there synced objects that serve no cloud purpose (decommissioned systems, service accounts, administrative accounts)?
|
||||
|
||||
### A8. Breach Optionality
|
||||
|
||||
- `[ASK]` Is there a written, accessible runbook for severing the AD↔Entra bridge under breach conditions?
|
||||
- `[TEST]` Is the runbook stored somewhere accessible when both AD and SharePoint are unavailable?
|
||||
- `[ASK]` Has anyone walked through the "kill the sync" procedure, and does the team know what breaks per auth method?
|
||||
- `[LOOK AT]` Does the cloud admin path (break-glass Global Admin) work with zero on-prem dependency?
|
||||
|
||||
---
|
||||
|
||||
## Section B — Privileged Access
|
||||
|
||||
### B1. Standing Privilege Inventory
|
||||
|
||||
- `[LOOK AT]` How many identities hold standing (permanent, active) privilege: Global Admin, Privileged Role Admin, Domain Admin, Enterprise Admin?
|
||||
- `[LOOK AT]` Are there any standing Global Admin assignments that are not break-glass accounts? (Should be zero)
|
||||
- `[LOOK AT]` How many Domain Admins and Enterprise Admins exist, and are they all justified with named owners?
|
||||
- `[ASK]` When was the privileged account list last reviewed, and by whom?
|
||||
|
||||
### B2. Admin workstations and management plane
|
||||
|
||||
- `[ASK]` What do admins use to reach a domain controller remotely? Is that path independent of the AD it manages, or does it depend on AD for authentication?
|
||||
- `[LOOK AT]` Do admins use the same device for privileged work (DC management, PIM activation) and daily tasks (email, browsing)?
|
||||
- `[ASK]` Is there a dedicated admin workstation — physical PAW or cloud admin VM (Windows 365 / AVD) — that is used only for privileged tasks?
|
||||
- `[LOOK AT]` If a cloud admin VM exists: is it enrolled in Intune with a hardened profile? Is it excluded from email and general browsing? Is it the device scoped in the CA policy restricting privileged role access?
|
||||
- `[LOOK AT]` Is there a management overlay (Nebula, Tailscale, Headscale) providing the admin access path to on-prem Tier 0 systems?
|
||||
- `[ASK]` If a Nebula T0 overlay exists: where is the CA key stored? Who can sign new node certificates? When was the last signing ceremony?
|
||||
- `[ASK]` If a Tailscale T1 overlay exists: is key expiry configured? Does re-authentication require phishing-resistant MFA via Entra?
|
||||
- `[LOOK AT]` For multi-cloud clients without a physical data centre: is the management plane explicitly designed, or is access to cloud management consoles and on-prem servers done ad hoc (VPN, direct RDP, per-cloud bastion, no unified plane)?
|
||||
|
||||
### B3. PIM / JIT
|
||||
|
||||
- `[LOOK AT]` Is Entra PIM deployed and enforced for Entra administrative roles?
|
||||
- `[LOOK AT]` Are Entra roles set to eligible (not active) by default?
|
||||
- `[LOOK AT]` Does PIM activation require phishing-resistant MFA (FIDO2 / certificate), or just push-approve?
|
||||
- `[LOOK AT]` Do crown roles (Privileged Role Administrator, Global Administrator) require approval workflow on PIM activation?
|
||||
- `[LOOK AT]` What is the maximum activation time-box configured? (Should be justified and bounded — 8 hours maximum for a working day)
|
||||
- `[LOOK AT]` Is PIM alert configuration enabled (Roles activated without MFA, Redundant assignments, etc.)?
|
||||
- `[ASK]` For on-prem DA/EA: is there any JIT or time-limited elevation mechanism in place?
|
||||
|
||||
### B4. Service Accounts (On-Prem)
|
||||
|
||||
- `[LOOK AT]` Are there service accounts with SPNs and static passwords older than 12 months? (Kerberoastable)
|
||||
- `[LOOK AT]` Which service accounts are over-permissioned (e.g., Domain Admin, local admin on all servers)?
|
||||
- `[LOOK AT]` Which service accounts have been migrated to gMSA?
|
||||
- `[LOOK AT]` Are there service accounts nobody can identify a current owner for?
|
||||
- `[TEST]` Run a Kerberoast simulation: do ticket requests for service account SPNs generate any detection?
|
||||
|
||||
### B5. Service Principals & App Registrations (Cloud)
|
||||
|
||||
- `[LOOK AT]` Which app registrations hold escalation-grade Graph permissions (application permissions): `RoleManagement.ReadWrite.Directory`, `AppRoleAssignment.ReadWrite.All`, `Application.ReadWrite.All`, `Directory.ReadWrite.All`?
|
||||
- `[LOOK AT]` Which app registrations have non-expiring client secrets?
|
||||
- `[LOOK AT]` Are there orphaned app registrations with no current owner?
|
||||
- `[LOOK AT]` Which apps have tenant-wide admin consent, and is each justified and reviewed?
|
||||
- `[LOOK AT]` Which Azure workloads use client secrets instead of managed identities where managed identities are available?
|
||||
|
||||
### B6. Tier Model / Clean Source
|
||||
|
||||
- `[LOOK AT]` Do Domain Admins / Enterprise Admins authenticate from standard workstations used for email and browsing?
|
||||
- `[LOOK AT]` Is ADCS (Active Directory Certificate Services) deployed? If so, is it on a Tier 0 or hardened host, or on a standard server?
|
||||
- `[LOOK AT]` Are there shared administrative jump boxes that cross tier boundaries (used for both Tier 0 and Tier 1 work)?
|
||||
- `[LOOK AT]` Do cloud admins use the same device for privileged Entra work as for daily activity?
|
||||
|
||||
### B7. Escalation Paths
|
||||
|
||||
- `[LOOK AT]` Are there accounts with `GenericAll`, `WriteDACL`, or `WriteOwner` on high-value AD objects (domain root, DCs, admin groups) that are not themselves Tier 0?
|
||||
- `[LOOK AT]` Are there computers with unconstrained delegation enabled (excluding DCs)?
|
||||
- `[LOOK AT]` When was KRBTGT last rotated? (`Get-ADUser krbtgt -Properties PasswordLastSet`)
|
||||
- `[LOOK AT]` Is LAPS (Windows LAPS preferred) deployed across all workstations and servers? What is the coverage percentage?
|
||||
- `[TEST]` Run BloodHound (or equivalent) and count attack paths to Domain Admin. Note the number as a baseline. Is it going up or down over time?
|
||||
|
||||
### B8. Break-Glass
|
||||
|
||||
- `[LOOK AT]` Do cloud-only break-glass Global Admin accounts exist?
|
||||
- `[LOOK AT]` Is phishing-resistant authentication (FIDO2 or certificate) configured on break-glass accounts?
|
||||
- `[LOOK AT]` Are break-glass accounts excluded from the CA policies that would otherwise enforce device compliance or block sign-in?
|
||||
- `[LOOK AT]` Does any use of the break-glass account trigger an immediate, monitored alert?
|
||||
- `[TEST]` Sign in to the break-glass account in a controlled drill. Does it work? Does the alert fire? Does someone respond?
|
||||
- `[ASK]` Where are the break-glass credentials stored, and can they be retrieved without the systems they recover?
|
||||
|
||||
### B9. Phishing-Resistant MFA for Admins
|
||||
|
||||
- `[LOOK AT]` What MFA method is enforced for Global Admins: FIDO2, certificate-based auth, or push/SMS?
|
||||
- `[LOOK AT]` Push-approve and SMS are not acceptable for administrative accounts. If they are in use, that is a P0.
|
||||
- `[LOOK AT]` Is there a CA policy restricting privileged role activation to compliant/managed devices or named PAWs?
|
||||
|
||||
---
|
||||
|
||||
## Section C — Devices & Endpoint
|
||||
|
||||
### C1. Fleet Reality
|
||||
|
||||
- `[LOOK AT]` Reconcile: Intune enrolled devices vs. Entra registered devices vs. sign-in log device population. What is the gap?
|
||||
- `[LOOK AT]` How many sign-in events in the last 30 days came from non-compliant or unmanaged devices (device compliance state = unknown or non-compliant in sign-in logs)?
|
||||
- `[LOOK AT]` Are there legacy-protocol sign-ins (Basic Auth) that bypass Conditional Access entirely? (Sign-in logs, filter Client App = "Exchange ActiveSync," "Other clients")
|
||||
- `[LOOK AT]` How many BYOD / personal devices are accessing corporate data through the web client or OWA (known-unmanaged population)?
|
||||
|
||||
### C2. Join State and Management Mode
|
||||
|
||||
- `[LOOK AT]` Are devices Entra-joined, hybrid Entra-joined, or Entra-registered (BYOD)?
|
||||
- `[LOOK AT]` Is hybrid Entra join still in use? If so, which on-prem dependencies actually require it?
|
||||
- `[LOOK AT]` Is there a roadmap to go cloud-native (Entra join + Intune) for devices currently on hybrid join?
|
||||
- `[LOOK AT]` Are there GPO and Intune co-management conflicts producing inconsistent configuration?
|
||||
|
||||
### C3. Conditional Access Enforcement
|
||||
|
||||
- `[TEST]` For every CA policy that enforces device compliance or blocks legacy auth: run real sign-ins with expected outcomes written down beforehand. Does the observed result match?
|
||||
- `[TEST]` If a policy looks correct but does not enforce: recreate from scratch, re-test. Document ghost policy findings.
|
||||
- `[LOOK AT]` Is there a CA policy blocking legacy authentication protocols across all apps? (This is the single highest-leverage CA policy — if not in place, that is P0)
|
||||
- `[LOOK AT]` Is there a CA policy requiring MFA for all admin role activations?
|
||||
- `[LOOK AT]` Is there a CA policy requiring compliant or managed device for access to sensitive workloads?
|
||||
- `[LOOK AT]` Are break-glass accounts and emergency service accounts correctly excluded from blocking CA policies?
|
||||
- `[TEST]` Lock yourself out in report-only mode (simulate a compliance failure on an admin account). Confirm break-glass bypasses the policy. Confirm a legitimate admin gets the expected failure and knows the escalation path.
|
||||
|
||||
### C4. Compliance Signal Quality
|
||||
|
||||
- `[LOOK AT]` What is the compliance check-in cadence? (The window where a fallen-out device still holds a "compliant" token)
|
||||
- `[LOOK AT]` Is Continuous Access Evaluation (CAE) enabled for workloads that support it? (Narrows the stale-token window)
|
||||
- `[ASK]` Is root/jailbreak detection in compliance policy, and how is it treated — as a hard block or a risk signal? Is it believed to be a wall or a tripwire?
|
||||
- `[TEST]` Spoof compliance on a test device (root a test device). How long until the signal flips? Does CA revoke access?
|
||||
|
||||
### C5. Endpoint Privilege
|
||||
|
||||
- `[LOOK AT]` Do standard users have standing local admin on their endpoints?
|
||||
- `[LOOK AT]` Is Endpoint Privilege Management (EPM) deployed, or is there a JIT elevation mechanism for tasks requiring admin rights?
|
||||
- `[LOOK AT]` Is Windows LAPS deployed across the fleet? Is legacy LAPS still in use (to be migrated)?
|
||||
- `[LOOK AT]` Are there shared local admin accounts with common passwords across multiple machines?
|
||||
|
||||
### C6. Update and Patch Velocity
|
||||
|
||||
- `[LOOK AT]` Is Windows Autopatch in use (for update ring management)?
|
||||
- `[LOOK AT]` Are Intune update rings configured with pilot, broad, and deferral stages?
|
||||
- `[ASK]` Is there a named person with the authority and procedure to halt a broad update ring push? Has this been tested?
|
||||
- `[LOOK AT]` What is the current patch lag for the fleet (how many devices are 30+ days behind on OS updates)?
|
||||
|
||||
### C7. MAM / App Protection (BYOD)
|
||||
|
||||
- `[TEST]` On iOS: attempt copy/paste from managed Outlook/Teams to an unmanaged app. Does it block?
|
||||
- `[TEST]` On Android: same test, separately — behavior is not symmetric with iOS.
|
||||
- `[TEST]` Attempt to "Open in" from a managed attachment to an unmanaged app on each platform.
|
||||
- `[TEST]` Attempt to save to local storage or sync to a personal cloud (iCloud, Google Drive).
|
||||
- `[LOOK AT]` Are managed browsers enforced for SharePoint/OWA access on BYOD, or can users access via any browser?
|
||||
|
||||
### C8. Autopilot and Enrollment Trust
|
||||
|
||||
- `[LOOK AT]` Is the Autopilot device list audited? Are there stale or unknown device registrations?
|
||||
- `[LOOK AT]` Are enrollment restrictions in place to prevent unauthorized device enrollment?
|
||||
- `[TEST]` Time a wipe-and-reprovision on a corporate device via Autopilot. Is the "replaceable in an hour" claim accurate?
|
||||
- `[LOOK AT]` Is the PRT (Primary Refresh Token) TPM-bound on Windows devices?
|
||||
|
||||
---
|
||||
|
||||
## Section D — Data & Collaboration
|
||||
|
||||
### D1. Sharing Posture
|
||||
|
||||
- `[LOOK AT]` What is the tenant-level external sharing setting in SharePoint Admin Center?
|
||||
- `[LOOK AT]` Are "Anyone with the link" anonymous shares enabled at the tenant level?
|
||||
- `[TEST]` Enumerate existing anonymous links across the tenant. Can you produce the list? How large is it?
|
||||
- `[LOOK AT]` Are per-site sharing settings more permissive than the tenant default? (Sites can override upward)
|
||||
- `[LOOK AT]` Are sharing expiration policies configured for anonymous and external links?
|
||||
- `[TEST]` Share a document to a test external guest and attempt to reshare onward. Can you track the second-hop share?
|
||||
|
||||
### D2. Guest Access
|
||||
|
||||
- `[LOOK AT]` How many active guests exist in the tenant?
|
||||
- `[LOOK AT]` How many guests have not signed in for 90+ days?
|
||||
- `[LOOK AT]` Are access reviews configured for guest accounts? What is the review cadence and the default action on non-response?
|
||||
- `[LOOK AT]` Do guests have broader access than the project they were invited for (i.e., access to Teams/channels beyond their original scope)?
|
||||
- `[LOOK AT]` Are external identities governed by specific B2B collaboration settings, or is the default (all external domains) allowed?
|
||||
|
||||
### D3. Email Security
|
||||
|
||||
- `[TEST]` Enumerate external auto-forwarding rules at the transport level (`Get-TransportRule`). Are there any active rules forwarding externally without a documented business owner?
|
||||
- `[TEST]` Enumerate Inbox rules on executive / privileged user mailboxes forwarding externally. (`Get-InboxRule`)
|
||||
- `[LOOK AT]` Is the global "allow automatic forwarding" setting disabled in Remote Domains for the Default domain?
|
||||
- `[LOOK AT]` Are anti-phishing policies configured? Is impersonation protection enabled for executives and key domains?
|
||||
- `[LOOK AT]` Is DKIM signing enabled for all sending domains?
|
||||
- `[LOOK AT]` Is DMARC configured (policy `reject` or `quarantine`), and is the SPF record current?
|
||||
|
||||
### D4. Crown Jewels
|
||||
|
||||
- `[ASK]` Can the client name the five data sets that, if exfiltrated, would cause the most damage?
|
||||
- `[LOOK AT]` Where do the crown jewels live (SharePoint sites, mailboxes, OneDrive, Teams channels)?
|
||||
- `[LOOK AT]` Who has access to the crown-jewel locations? Is access reviewed periodically?
|
||||
- `[LOOK AT]` Are the crown-jewel locations labeled with sensitivity labels that carry encryption?
|
||||
- `[LOOK AT]` Are audit logs turned on and retained long enough to reconstruct access to crown-jewel locations?
|
||||
|
||||
### D5. Sensitivity Labels and DLP
|
||||
|
||||
- `[LOOK AT]` Are sensitivity labels deployed in the tenant? What is the coverage across the most-used content types (email, files)?
|
||||
- `[LOOK AT]` Are labels configured with encryption for the highest sensitivity tiers?
|
||||
- `[LOOK AT]` Is auto-labeling deployed for known crown-jewel content types (if licensed for M365 E5 Compliance)?
|
||||
- `[LOOK AT]` Is DLP deployed? Is it scoped to specific known-value patterns (regulated data, PII, crown-jewel keywords) or applied as a broad dragnet generating noise?
|
||||
- `[TEST]` Exfiltrate a labeled test document via email to an external address. Does DLP fire? Does the label encryption hold on the received document?
|
||||
|
||||
### D6. Collaboration Sprawl
|
||||
|
||||
- `[LOOK AT]` Is there ungoverned self-service creation of Teams and SharePoint sites?
|
||||
- `[LOOK AT]` Are there orphaned or inactive Teams/sites that still hold data and have no active owner?
|
||||
- `[LOOK AT]` Are there Teams channels or SharePoint sites with "Everyone" or broad internal membership grants on sensitive data?
|
||||
- `[LOOK AT]` Is late-joiners' access to Team history governed (a user joining a Team today can read all prior messages by default)?
|
||||
|
||||
### D7. OAuth App Consent
|
||||
|
||||
- `[LOOK AT]` Is user consent for OAuth apps restricted (users cannot consent to app permission requests without admin approval)?
|
||||
- `[LOOK AT]` Are there existing grants for apps holding `Mail.Read`, `Files.ReadWrite.All`, or equivalent sensitive scopes by non-first-party apps?
|
||||
- `[LOOK AT]` Is Microsoft's app governance module (Purview) enabled? Are risky app alerts configured?
|
||||
|
||||
### D8. Audit Logging
|
||||
|
||||
- `[LOOK AT]` Is Unified Audit Logging enabled (confirm in Purview Compliance Center > Audit)?
|
||||
- `[LOOK AT]` What is the audit retention period, given the client's licensing?
|
||||
- `[TEST]` Run a sample audit query on a known recent activity and verify log entries are present. Do not assume the log is on without testing it.
|
||||
- `[LOOK AT]` Are admin operations (role assignment changes, app consent, CA policy changes) captured in the audit log?
|
||||
|
||||
---
|
||||
|
||||
## Section E — Recovery & Detection
|
||||
|
||||
### E1. Backup and Recovery
|
||||
|
||||
- `[ASK]` What is the recovery path if a Global Admin deletes all Exchange Online mailboxes and SharePoint sites? Be specific about process, tool, and time estimate.
|
||||
- `[LOOK AT]` Is there a third-party M365 backup solution covering Exchange, SharePoint, OneDrive, and Teams?
|
||||
- `[LOOK AT]` Are M365 backups isolated from the estate they protect (immutable, separate authentication domain)?
|
||||
- `[TEST]` When was the last successful restore from backup, and how long did it take? Restore a test mailbox or a file share and time it. This is the MTTR.
|
||||
- `[LOOK AT]` Are on-prem AD backups (System State) taken regularly, stored offline, and verified?
|
||||
- `[TEST]` Can the current backup restore an AD domain if all DCs are destroyed? Has anyone run the forest recovery procedure, even in a lab?
|
||||
|
||||
### E2. Configuration-as-Code (Known-Good Baseline)
|
||||
|
||||
- `[LOOK AT]` Have CA policies been exported to code/JSON (e.g., using CAExporter)?
|
||||
- `[LOOK AT]` Has the Entra role assignment state been captured as a document?
|
||||
- `[LOOK AT]` Has the Intune baseline configuration been exported?
|
||||
- `[LOOK AT]` Is there a diff between the opening state and current state for any changes made during the engagement?
|
||||
- `[ASK]` If the tenant CA policies were silently modified by an attacker, would anyone know? Is there drift detection against the known-good?
|
||||
|
||||
### E3. Recovery Path Independence
|
||||
|
||||
- `[LOOK AT]` Does any part of the recovery runbook depend on the system it recovers (e.g., runbook stored in SharePoint, backup auth via the compromised AD)?
|
||||
- `[LOOK AT]` Are recovery credentials (break-glass, backup admin accounts) accessible independently of the estate?
|
||||
- `[LOOK AT]` Is the AD forest recovery runbook stored offline or in a location that survives domain destruction?
|
||||
- `[ASK]` If both AD and M365 were simultaneously unavailable, what is the recovery sequencing? Is that decision documented?
|
||||
|
||||
### E4. Detection: Signal Quality
|
||||
|
||||
- `[LOOK AT]` Break-glass account use: is there an alert? Is it monitored by a named person?
|
||||
- `[LOOK AT]` New Global Admin assignment: does an alert fire?
|
||||
- `[LOOK AT]` DCSync from a non-DC host: is this detected (Defender for Identity or SIEM rule)?
|
||||
- `[LOOK AT]` Impossible-travel sign-in for admin accounts: is Entra ID Protection user risk policy configured and alerting?
|
||||
- `[LOOK AT]` External auto-forward rule creation: is this generating an alert?
|
||||
- `[LOOK AT]` Mass download from SharePoint/OneDrive: is there a Defender for Cloud Apps or Purview policy detecting it?
|
||||
- `[LOOK AT]` New OAuth consent grant to sensitive scopes: is this alerting?
|
||||
- `[LOOK AT]` PIM activation outside business hours: is this logged and reviewed?
|
||||
- `[TEST]` For each configured detection: simulate the event (in a controlled, authorized test context) and confirm the alert fires, is received by a named person, and generates a response within the expected SLA.
|
||||
|
||||
### E5. Detection: Noise and Action
|
||||
|
||||
- `[ASK]` How many alerts does the monitoring system generate per day? How many are triaged vs. suppressed vs. missed?
|
||||
- `[ASK]` For the last three security incidents or notable alerts: what structural change resulted? If the answer is "we sent an awareness email" or "we noted it," the feedback loop is broken.
|
||||
- `[LOOK AT]` Is there a named owner for each alert category? An alert without a named owner is an unread alert.
|
||||
- `[ASK]` Is there a blameless post-incident process? Do people surface incidents, or do they bury them to avoid blame?
|
||||
|
||||
### E6. Game-Days and Drills
|
||||
|
||||
- `[ASK]` When was the last deliberate test of recovery or detection (a drill, tabletop, or game-day)?
|
||||
- `[TEST]` Break-glass drill: sign in, confirm it works, confirm the alert fires. Document the test and the result.
|
||||
- `[TEST]` CA policy enforcement drill: force a non-compliant state on a test user. Confirm the expected outcome and that break-glass bypasses the gate.
|
||||
- `[ASK]` Has the client ever run a ransomware tabletop that assumes Tier 0 is owned? What did they find?
|
||||
|
||||
---
|
||||
|
||||
## Section F — Quick-Win Inventory
|
||||
|
||||
Use this section to capture findings that can be addressed in the same session or within the engagement without additional scoping.
|
||||
|
||||
Each of the following, if found to be the case, is a fix that typically takes under an hour and has immediate blast-radius reduction. Do not leave these open for the next engagement.
|
||||
|
||||
| Control | Condition that makes it a quick win |
|
||||
|---------|-------------------------------------|
|
||||
| Tenant-level anonymous sharing | "Anyone" links enabled at tenant level — one toggle |
|
||||
| External auto-forwarding | Global block not set — one Exchange setting |
|
||||
| Legacy auth CA policy | No policy blocking legacy auth — deploy baseline CA policy |
|
||||
| Break-glass alert | Break-glass use not alerting — configure alert rule |
|
||||
| Global admins audit | Standing synced GAs — identify and initiate migration |
|
||||
| KRBTGT age | Password not set in 365+ days — document and schedule rotation |
|
||||
| Stale admin accounts | Disabled or unchecked admin accounts — disable and document |
|
||||
| Audit log | Not enabled — turn on (one click in Purview) |
|
||||
| PIM not deployed | P2 licensed but PIM off — scope activation as P1 |
|
||||
| No CA blocking admin sign-in from personal devices | Missing policy — create report-only immediately, test and enable |
|
||||
|
||||
---
|
||||
|
||||
## Engagement Close — Structural Change Verification
|
||||
|
||||
At the close of each engagement or module, confirm:
|
||||
|
||||
1. Which items above were found to be fragile?
|
||||
2. For each: what **structural change** was made (not documented, not accepted, but changed)?
|
||||
3. Which items were tested by observation (not just inspected)?
|
||||
4. Which items are open and in the risk register with a named owner and a timeline?
|
||||
5. Has the configuration-as-code baseline been exported and stored?
|
||||
6. Has the break-glass been tested?
|
||||
7. Is there a named date for the next review of this checklist?
|
||||
|
||||
The work is not complete when the list is walked. It is complete when fragility found has become structure changed.
|
||||
|
||||
---
|
||||
|
||||
*Engagement Checklist. Updated June 2026. Review and update alongside the Field Guide — January 2027.*
|
||||
@@ -0,0 +1,380 @@
|
||||
# Self-Service Security Cadence
|
||||
|
||||
> *What you run between our engagements. When something in here surprises you, that's when you call us.*
|
||||
|
||||
**Last updated:** June 2026
|
||||
**Produced by:** [engagement name / consultant name]
|
||||
**For:** [client name] — [named admin / IT lead]
|
||||
**Next full engagement:** [date or "TBD"]
|
||||
**Next review of this document:** January 2027
|
||||
|
||||
---
|
||||
|
||||
## What this is
|
||||
|
||||
We ran the adversarial validation. We fixed the structural issues we found. The work does not stop when we leave.
|
||||
|
||||
This document is your recurring checklist — things you can run yourself, with the tools we set up, on a regular cadence. None of it requires a security background. Most of it takes under an hour per month. The point is to catch drift before it becomes a problem, and to know when to call us before it becomes a crisis.
|
||||
|
||||
**The most important thing:** when something in here produces a result that surprises you, do not sit on it. Log it, screenshot it, and send it to us. The earlier we see a problem the cheaper it is to fix.
|
||||
|
||||
---
|
||||
|
||||
## Tools you need (all installed during the engagement)
|
||||
|
||||
| Tool | What it does | Where to get it |
|
||||
|------|-------------|-----------------|
|
||||
| **PingCastle** | Scans Active Directory and produces a security report with a score and specific findings | [pingcastle.com](https://www.pingcastle.com) — free Community edition |
|
||||
| **Purple Knight** | Scans Active Directory for indicators of exposure — simpler output than PingCastle, good complement | [purple-knight.com](https://www.purple-knight.com) — free |
|
||||
| **CAExporter** | Exports all Conditional Access policies to JSON files you can compare over time | [github.com/vibecoding/CAExporter](https://github.com/vibecoding/CAExporter) |
|
||||
| **Microsoft Graph PowerShell** | The PowerShell module for the scripts in this document | `Install-Module Microsoft.Graph` |
|
||||
| **Microsoft 365 Defender portal** | alerts.microsoft.com — your alert queue and Secure Score | |
|
||||
| **Microsoft Entra portal** | entra.microsoft.com — your identity dashboard | |
|
||||
|
||||
The scripts in this document are saved in `[location agreed during engagement — e.g., C:\SecurityRunbook\Scripts\]`.
|
||||
|
||||
---
|
||||
|
||||
## Monthly checks — 30 to 45 minutes, portal-based
|
||||
|
||||
Do these on the first working day of each month. They require no special tools — just a browser logged in as a Global Admin or Security Reader.
|
||||
|
||||
---
|
||||
|
||||
### M1. Microsoft Secure Score
|
||||
|
||||
**Where:** [Microsoft 365 Defender portal](https://security.microsoft.com) > Secure Score
|
||||
|
||||
**What to do:**
|
||||
1. Note the current score.
|
||||
2. Compare to last month's score (the history graph shows it).
|
||||
3. Look at the "Recommended actions" tab — filter to "Not addressed."
|
||||
4. Any new items that appeared since last month? Note them.
|
||||
|
||||
**What you are looking for:** Score going down month-over-month without a known reason. New recommended actions you did not create. Completed actions that have reverted to "not addressed" (this means configuration drifted back).
|
||||
|
||||
**Call us if:** Score drops more than 5 points in a month without a documented reason, or if a completed action you remember implementing shows as "not addressed."
|
||||
|
||||
---
|
||||
|
||||
### M2. Entra ID Recommendations
|
||||
|
||||
**Where:** [Entra portal](https://entra.microsoft.com) > Overview > Recommendations
|
||||
|
||||
**What to do:**
|
||||
1. Look at all open recommendations.
|
||||
2. Note any that are new since last month.
|
||||
3. Note the impact rating (High / Medium / Low) on new ones.
|
||||
|
||||
**What you are looking for:** New high-impact recommendations that appeared since last month. Specifically watch for anything related to admin accounts, Conditional Access, legacy authentication, or risky sign-ins.
|
||||
|
||||
**Call us if:** Any new High-impact recommendation appears. We will help you assess whether to act immediately or schedule it.
|
||||
|
||||
---
|
||||
|
||||
### M3. Sign-in risk review
|
||||
|
||||
**Where:** Entra portal > Identity Protection > Risky sign-ins
|
||||
|
||||
**What to do:**
|
||||
1. Filter to the last 30 days.
|
||||
2. Look at sign-ins with risk level "High" that were not dismissed or remediated.
|
||||
3. For any admin account (Global Admin, Exchange Admin, Security Admin) with any risky sign-in event — investigate before dismissing.
|
||||
|
||||
**What you are looking for:** Admin accounts appearing in the risky sign-in list. Any high-risk sign-in that auto-remediated (meaning the user passed an MFA challenge) where the geography or device does not make sense.
|
||||
|
||||
**Call us if:** Any admin account has a risky sign-in event. Any high-risk event that was remediated from an unexpected location.
|
||||
|
||||
---
|
||||
|
||||
### M4. Alert queue health
|
||||
|
||||
**Where:** Microsoft 365 Defender portal > Incidents & alerts > Alerts
|
||||
|
||||
**What to do:**
|
||||
1. Filter to "New" and "In progress" alerts.
|
||||
2. How many are sitting open for more than 48 hours?
|
||||
3. Are there categories of alert that appear repeatedly? (Recurring alerts on the same user or asset are a pattern, not noise.)
|
||||
|
||||
**What you are looking for:** Alert queue growing over time without being worked. The same alert firing repeatedly on the same account or resource. Any alert tagged as "High severity" that is more than 24 hours old without assignment.
|
||||
|
||||
**Call us if:** A High-severity alert is more than 24 hours old and you do not know what to do with it. Or if the same alert keeps firing on the same account.
|
||||
|
||||
---
|
||||
|
||||
### M5. New admin assignments
|
||||
|
||||
**Where:** Entra portal > Identity > Roles & admins > All roles > Global Administrator > Assignments
|
||||
|
||||
**What to do:**
|
||||
1. Check the current member list against last month's.
|
||||
2. Any new members? Were they expected?
|
||||
3. Check at minimum: Global Administrator, Exchange Administrator, Security Administrator, SharePoint Administrator.
|
||||
|
||||
**What you are looking for:** Anyone in a privileged role who should not be, or who appeared without a formal request.
|
||||
|
||||
**Call us if:** Any new privileged role assignment you did not authorize or do not recognize.
|
||||
|
||||
---
|
||||
|
||||
### M6. Break-glass confirmation (30 seconds)
|
||||
|
||||
**What to do:**
|
||||
1. Confirm the break-glass account credentials are still in the agreed storage location.
|
||||
2. Confirm the contact for "break-glass alert fired" is still the right person.
|
||||
|
||||
Do not log in to the break-glass account during this check — any sign-in triggers an alert. Just confirm the credentials are accessible.
|
||||
|
||||
**Call us if:** Credentials cannot be found. Or if the break-glass alert fires without a drill scheduled.
|
||||
|
||||
---
|
||||
|
||||
## Quarterly checks — 2 to 3 hours, tools required
|
||||
|
||||
Do these in the first week of each quarter (January, April, July, October). These require running the installed tools and saving the output.
|
||||
|
||||
---
|
||||
|
||||
### Q1. PingCastle AD scan
|
||||
|
||||
**How to run:**
|
||||
1. Log in to the domain controller (or any domain-joined machine) as a Domain Admin.
|
||||
2. Run `PingCastle.exe --healthcheck --server <your-domain-FQDN>`.
|
||||
3. It produces an HTML report. Save it to `[agreed location]` with the date in the filename: `PingCastle-2026-Q3.html`.
|
||||
4. Open the report and note the score and any findings marked "Critical" or "High."
|
||||
5. Compare to the previous quarter's report — is the score going up or down?
|
||||
|
||||
**What you are looking for:** Score trending down quarter-over-quarter. New Critical or High findings that were not present last quarter. Specifically watch the "Stale Objects" section (accounts nobody uses) and the "Privileged Access" section.
|
||||
|
||||
**Call us if:** The score drops more than 10 points since last quarter. Any new Critical finding. Any finding in the "Privileged Access" category that was clean last quarter.
|
||||
|
||||
---
|
||||
|
||||
### Q2. Purple Knight AD scan
|
||||
|
||||
**How to run:**
|
||||
1. Download and run Purple Knight on a domain-joined machine with Domain Admin credentials.
|
||||
2. It is a GUI tool — click through the scan, wait for it to finish.
|
||||
3. Save the PDF report with the date: `PurpleKnight-2026-Q3.pdf`.
|
||||
4. Look at the "Identity Security Indicators" with status "Exposed" or "Critical."
|
||||
5. Compare to the previous quarter.
|
||||
|
||||
**What you are looking for:** New exposed indicators that did not appear last quarter. Any indicator flagged as Critical. The tool is organized by MITRE ATT&CK category — pay particular attention to "Credential Access" and "Privilege Escalation."
|
||||
|
||||
**Call us if:** Any new Critical indicator. Or if the same Medium indicators keep appearing quarter after quarter without being resolved (this means the fix did not stick).
|
||||
|
||||
---
|
||||
|
||||
### Q3. KRBTGT and AZUREADSSOACC age check
|
||||
|
||||
**How to run:** Open PowerShell as Domain Admin and run the following:
|
||||
|
||||
```powershell
|
||||
Write-Host "=== KRBTGT ===" -ForegroundColor Cyan
|
||||
Get-ADUser krbtgt -Properties PasswordLastSet |
|
||||
Select-Object @{N="Account";E={"krbtgt"}},
|
||||
PasswordLastSet,
|
||||
@{N="AgeDays";E={((Get-Date) - $_.PasswordLastSet).Days}}
|
||||
|
||||
Write-Host "=== AZUREADSSOACC ===" -ForegroundColor Cyan
|
||||
Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet -ErrorAction SilentlyContinue |
|
||||
Select-Object @{N="Account";E={"AZUREADSSOACC"}},
|
||||
PasswordLastSet,
|
||||
@{N="AgeDays";E={((Get-Date) - $_.PasswordLastSet).Days}}
|
||||
```
|
||||
|
||||
Record the age in days in your tracking spreadsheet.
|
||||
|
||||
**What you are looking for:** KRBTGT older than 365 days = P1 (schedule rotation with us). KRBTGT older than 180 days = note and plan. AZUREADSSOACC never rotated since initial sync setup = note.
|
||||
|
||||
**Call us if:** KRBTGT is over 365 days old and there is no scheduled rotation. Or if either account shows a password age younger than expected (meaning someone rotated it without telling you — that is a finding too).
|
||||
|
||||
---
|
||||
|
||||
### Q4. Cloud-only Global Admins check
|
||||
|
||||
**How to run:**
|
||||
|
||||
```powershell
|
||||
Connect-MgGraph -Scopes "Directory.Read.All"
|
||||
|
||||
$gaRoleId = (Get-MgDirectoryRole -Filter "displayName eq 'Global Administrator'").Id
|
||||
$gaMembers = Get-MgDirectoryRoleMember -DirectoryRoleId $gaRoleId
|
||||
|
||||
Write-Host "=== Global Admins ===" -ForegroundColor Cyan
|
||||
$gaMembers | ForEach-Object {
|
||||
$user = Get-MgUser -UserId $_.Id -Property DisplayName,UserPrincipalName,OnPremisesSyncEnabled
|
||||
[PSCustomObject]@{
|
||||
Name = $user.DisplayName
|
||||
UPN = $user.UserPrincipalName
|
||||
SyncedFromAD = $user.OnPremisesSyncEnabled
|
||||
}
|
||||
} | Format-Table -AutoSize
|
||||
```
|
||||
|
||||
Any row where `SyncedFromAD` is `True` is a P0 — call us immediately.
|
||||
|
||||
**What you are looking for:** Any Global Admin that is synced from on-prem AD. Any new GA you did not create.
|
||||
|
||||
**Call us if:** Any synced GA appears. Any GA you do not recognize.
|
||||
|
||||
---
|
||||
|
||||
### Q5. Service principal secrets check — expiring and never-expiring
|
||||
|
||||
**How to run:**
|
||||
|
||||
```powershell
|
||||
Connect-MgGraph -Scopes "Application.Read.All"
|
||||
|
||||
$today = Get-Date
|
||||
$warningDays = 60
|
||||
|
||||
Write-Host "=== Non-expiring secrets ===" -ForegroundColor Red
|
||||
Get-MgApplication -All | ForEach-Object {
|
||||
$app = $_
|
||||
$app.PasswordCredentials | Where-Object { $_.EndDateTime -eq $null } | ForEach-Object {
|
||||
[PSCustomObject]@{ App = $app.DisplayName; Secret = $_.DisplayName; Expires = "NEVER" }
|
||||
}
|
||||
} | Format-Table
|
||||
|
||||
Write-Host "=== Secrets expiring within $warningDays days ===" -ForegroundColor Yellow
|
||||
Get-MgApplication -All | ForEach-Object {
|
||||
$app = $_
|
||||
$app.PasswordCredentials | Where-Object {
|
||||
$_.EndDateTime -ne $null -and $_.EndDateTime -lt $today.AddDays($warningDays)
|
||||
} | ForEach-Object {
|
||||
[PSCustomObject]@{ App = $app.DisplayName; Secret = $_.DisplayName; Expires = $_.EndDateTime }
|
||||
}
|
||||
} | Sort-Object Expires | Format-Table
|
||||
```
|
||||
|
||||
**What you are looking for:** Non-expiring secrets on any app registration. Secrets about to expire (these will break an application if not rotated — but they also need reviewing: is the app still needed?).
|
||||
|
||||
**Call us if:** You find a non-expiring secret on an app you do not recognize. Or if you find an expiring secret and do not know which application or service it belongs to.
|
||||
|
||||
---
|
||||
|
||||
### Q6. Stale guest review
|
||||
|
||||
**How to run:**
|
||||
|
||||
```powershell
|
||||
Connect-MgGraph -Scopes "User.Read.All", "AuditLog.Read.All"
|
||||
|
||||
$cutoff = (Get-Date).AddDays(-90)
|
||||
|
||||
Get-MgUser -Filter "userType eq 'Guest'" -All -Property DisplayName,Mail,CreatedDateTime,SignInActivity |
|
||||
ForEach-Object {
|
||||
$lastSignIn = $_.SignInActivity.LastSignInDateTime
|
||||
[PSCustomObject]@{
|
||||
Name = $_.DisplayName
|
||||
Email = $_.Mail
|
||||
Created = $_.CreatedDateTime
|
||||
LastSignIn = $lastSignIn
|
||||
DaysSinceSignIn = if ($lastSignIn) { ((Get-Date) - $lastSignIn).Days } else { "Never" }
|
||||
}
|
||||
} |
|
||||
Sort-Object DaysSinceSignIn -Descending |
|
||||
Format-Table -AutoSize
|
||||
```
|
||||
|
||||
**What you are looking for:** Guests who have not signed in for 90+ days. Guests you do not recognize (external parties from concluded projects or former vendors).
|
||||
|
||||
**Call us if:** The count of stale guests is growing quarter-over-quarter and nobody is pruning them. Or if a guest account appears that belongs to an external party from a concluded engagement and still has active access.
|
||||
|
||||
---
|
||||
|
||||
### Q7. Anonymous link count
|
||||
|
||||
**How to run:** Connect using PnP PowerShell (installed during engagement):
|
||||
|
||||
```powershell
|
||||
Connect-PnPOnline -Url "https://[tenant]-admin.sharepoint.com" -Interactive
|
||||
|
||||
$sites = Get-PnPTenantSite -IncludeOneDriveSites
|
||||
|
||||
$anonLinks = foreach ($site in $sites) {
|
||||
Connect-PnPOnline -Url $site.Url -Interactive
|
||||
Get-PnPSharingLinks | Where-Object { $_.SharingLinkType -eq "Anonymous" } |
|
||||
ForEach-Object { [PSCustomObject]@{ Site = $site.Url; Link = $_.ShareLink; Expires = $_.ExpirationDateTime } }
|
||||
}
|
||||
|
||||
Write-Host "Total anonymous links: $($anonLinks.Count)" -ForegroundColor Yellow
|
||||
$anonLinks | Sort-Object Site | Format-Table
|
||||
```
|
||||
|
||||
Record the count. Save the export.
|
||||
|
||||
**What you are looking for:** Count increasing quarter-over-quarter (means new anonymous links are being created despite the policy). Links with no expiration date.
|
||||
|
||||
**Call us if:** Count is increasing despite the restriction we put in place. Or if you find anonymous links on sites that hold sensitive data (HR, Finance, M&A).
|
||||
|
||||
---
|
||||
|
||||
### Q8. CA policy diff — detect drift
|
||||
|
||||
**How to run:**
|
||||
|
||||
```powershell
|
||||
# CAExporter is set up from the engagement — run from its directory
|
||||
.\CAExporter.ps1 -ExportPath "C:\SecurityRunbook\CA-Exports\CA-$(Get-Date -Format 'yyyy-MM-dd')"
|
||||
```
|
||||
|
||||
Then compare this quarter's export folder to last quarter's using any file diff tool (WinMerge, VS Code with the "compare folders" extension, or simply `Compare-Object` in PowerShell):
|
||||
|
||||
```powershell
|
||||
$old = Get-ChildItem "C:\SecurityRunbook\CA-Exports\CA-2026-04-01" -File | Select-Object -ExpandProperty Name
|
||||
$new = Get-ChildItem "C:\SecurityRunbook\CA-Exports\CA-2026-07-01" -File | Select-Object -ExpandProperty Name
|
||||
|
||||
Compare-Object $old $new
|
||||
```
|
||||
|
||||
Then for any policy that changed, open the JSON files and compare manually. The changed lines are the configuration drift.
|
||||
|
||||
**What you are looking for:** Policies deleted since last quarter. Policies whose parameters changed (exclusions added, scope narrowed, MFA grant changed to "grant without controls"). New policies in report-only mode that should have been enabled.
|
||||
|
||||
**Call us if:** Any CA policy has changed without a corresponding change record. A policy that was enforcing is now in report-only mode. A new exclusion was added to a critical policy (legacy auth block, admin MFA, device compliance).
|
||||
|
||||
---
|
||||
|
||||
## "Call us" trigger list
|
||||
|
||||
These are the situations where you stop, take a screenshot, and contact us — even outside a scheduled check:
|
||||
|
||||
| What you see | How urgent | What to do first |
|
||||
|---|---|---|
|
||||
| Break-glass alert fires unexpectedly | Immediate | Disable any active sessions for the break-glass account, then call us |
|
||||
| New Global Admin you did not create | Immediate | Do not remove it yet — screenshot first, then call us |
|
||||
| Synced account in Global Admin role | Same day | Do not change anything — screenshot and call us |
|
||||
| DCSync alert from Defender for Identity | Immediate | Isolate the source host from the network if possible, then call us |
|
||||
| External auto-forward rule found on any executive mailbox | Same day | Disable the rule, check for mail forwarded, call us |
|
||||
| PingCastle score drops more than 10 points | Within 48 hours | Send us the report alongside the previous quarter's |
|
||||
| Any alert sitting at High severity for more than 24 hours you do not know how to triage | Within 24 hours | Screenshot, note what the alert says, call us |
|
||||
| Backup restore fails or produces corrupt data | Same day | Do not delete anything — call us |
|
||||
| Something that feels wrong but is not on this list | Use your judgement | A wrong feeling is data. Document what you noticed and send it. We will tell you if it is nothing. |
|
||||
|
||||
---
|
||||
|
||||
## Tracking spreadsheet columns
|
||||
|
||||
Keep a simple spreadsheet (Excel or SharePoint list) with one row per check per quarter:
|
||||
|
||||
| Date | Check | Result / Count | vs. Last Quarter | Action taken | Escalated to consultant? |
|
||||
|------|-------|---------------|-----------------|--------------|--------------------------|
|
||||
|
||||
The trend matters more than any individual value. A metric that is consistently getting worse is a finding even if no single value crosses a threshold.
|
||||
|
||||
---
|
||||
|
||||
## When to schedule the next full engagement
|
||||
|
||||
Use this as a rule of thumb:
|
||||
|
||||
- **Annual:** Full adversarial validation (the engagement that produced this document). Recommended even if the monthly and quarterly checks are clean — they catch drift, not adversarial paths.
|
||||
- **Triggered:** Any time a "call us immediately" event fires, or PingCastle / Purple Knight produces a new Critical finding.
|
||||
- **Project-triggered:** Before any major change to the estate — AD migration, new cloud service onboarding, M365 license change, acquisition or merger, significant IT staff change.
|
||||
|
||||
---
|
||||
|
||||
*Self-service cadence for [client name]. Produced June 2026. Review and update January 2027 alongside the field guide update.*
|
||||
@@ -0,0 +1,194 @@
|
||||
# The Antifragile Handbook for M365 & Active Directory
|
||||
|
||||
## Book I — Principles & Judgement
|
||||
|
||||
> *Move fast and fix things.*
|
||||
|
||||
---
|
||||
|
||||
## Why this book exists
|
||||
|
||||
This is not a benchmark. It will not give you a number to report to a steering committee. It will not tell you that your tenant is 87% compliant, because that number is a lie that makes everyone feel safe while the building burns. Compliance frameworks — CIS, NIST, ISO, the lot — answer one question: *did you do the things on the list?* That is a useful question. It is not the important one. The important question is: **when this gets attacked, does it get weaker, stay the same, or get stronger?** A system that gets stronger from being stressed is antifragile. Almost no M365 + AD estate is antifragile by default. Most are the opposite: a flat domain synced to a cloud tenant, where one phished helpdesk account quietly becomes domain dominance becomes Global Admin. That is fragility wearing a compliance certificate. A consultant trained on benchmarks knows *what* the settings should be. A consultant trained on this book knows *which settings matter, why, and what breaks if they're wrong* — and can walk into a tenant they've never seen and find the thing that will actually kill the client. That is the difference between a technician and an independent professional. We are trying to raise the second kind.
|
||||
|
||||
### What "move fast and fix things" actually means
|
||||
|
||||
It is a deliberate edit of the old Silicon Valley creed. The original assumed things were whole and that breaking them was the cost of speed. Our world is the reverse: **the things are already broken.** Legacy auth is still on. Service accounts from 2014 still have domain admin. Nobody has tested the break-glass account since it was created. Speed, here, is not recklessness — it is refusing to let a thirty-page risk-acceptance process protect a fragility that a teenager with a phishing kit will remove for free. So:
|
||||
|
||||
- **Fast** — bias to action. A fix shipped this week beats a perfect fix discussed for a quarter. Fragility compounds while you deliberate.
|
||||
- **Fix** — actually change the structure, not the documentation. A risk you *accepted* is a risk you still have.
|
||||
- **Things that matter** — and this is the whole craft — the discrimination to know that disabling legacy auth outranks renaming forty GPOs to match a naming standard. Most of the checklist is noise. Find the signal.
|
||||
|
||||
### How compliance still fits (read this before you get smug)
|
||||
|
||||
We are not anti-compliance. We are anti-*thoughtless* compliance. Your clients have auditors, contracts, and regulators, and you will still help them pass. The relationship is this:
|
||||
|
||||
> **Compliance is a floor and a by-product. It is never the target.**
|
||||
|
||||
If you build an antifragile estate, you will pass CIS almost by accident, and you will be able to explain *why* every control exists — which is more than most auditors can. But you will also do things no benchmark asks for (game-days, kill-switch drills, deliberate removal of features) and you will *skip* things benchmarks demand when they add fragility or cost without reducing blast radius. When you skip, you skip **on the record, with a written reason**. That is the difference between independent judgement and laziness.
|
||||
|
||||
---
|
||||
|
||||
## The governing question
|
||||
|
||||
Before the principles, the one question that sits above all of them. Ask it of every account, every trust, every sync, every app registration:
|
||||
|
||||
> **If this is owned tonight, what is the largest thing an attacker reaches before hitting a wall — and can I draw that wall?**
|
||||
|
||||
If you cannot draw the wall, there is no wall. In M365 + AD the wall is almost always missing in the same place: the **identity bridge** between on-prem AD and Entra ID. Internalise this and half the job is done.
|
||||
|
||||
---
|
||||
|
||||
## The Principles
|
||||
|
||||
Nine of them. They overlap on purpose — antifragility is a way of seeing, not a checklist (the irony would be unbearable). Each comes with **judgement prompts**: the questions an independent consultant asks instead of looking up the "correct" value. Learn the questions, not the answers. The answers change with every tenant; the questions don't.
|
||||
|
||||
---
|
||||
|
||||
### 1. Via Negativa — subtract before you add
|
||||
|
||||
The strongest control is the thing that no longer exists. It cannot be misconfigured, cannot be exploited, cannot drift, and costs nothing to maintain. Benchmarks are addition machines — every control is something *more* to deploy and watch. Start the other way: what can we **delete**? In M365 + AD, the highest-leverage deletions are usually: legacy/basic auth, NTLM and unconstrained delegation, standing privileged role assignments, dormant service accounts and their static secrets, unused federation, public folders, orphaned app registrations with tenant-wide consent, and "temporary" firewall or CA exclusions that became permanent. **Judgement prompts**
|
||||
|
||||
- If I removed this control/feature/account, would *anyone* notice within 90 days? If not, why does it exist?
|
||||
- What is the oldest thing here still running, and who decided it should keep running — or did nobody decide?
|
||||
- Every exclusion is a tiny hole punched in a wall. List the exclusions. Who asked for each, and is that person still here?
|
||||
- Am I about to *add* a control to compensate for something I could *remove* instead?
|
||||
|
||||
---
|
||||
|
||||
### 2. The Barbell — protect the irreplaceable, let the rest stay cheap
|
||||
|
||||
Compliance scoring spreads effort evenly: every control worth the same point. Reality is not evenly distributed. A handful of things are irreplaceable — tenant root, Tier 0 / domain controllers, break-glass accounts, backups, the sync engine. Everything else is, in principle, rebuildable. Put **paranoid, expensive, redundant** protection on the irreplaceable few. Let everything else be **cheap, fast, and replaceable** — even disposable. Do not spend your political capital hardening a kiosk laptop while a Global Admin has no phishing-resistant MFA. The middle — moderate protection spread thinly over everything — is where budgets and attention go to die. **Judgement prompts**
|
||||
|
||||
- Name the five things in this estate that, if lost, cannot be rebuilt. Are they protected differently from everything else, or the same?
|
||||
- Where is effort being spent evenly that should be spent asymmetrically?
|
||||
- Is anything in the "cheap and replaceable" bucket actually load-bearing in disguise? (The "temporary" script on someone's laptop that runs payroll.)
|
||||
- Could I afford to let this thing be *destroyed* and just rebuild it? If yes, stop gold-plating it.
|
||||
|
||||
---
|
||||
|
||||
### 3. Blast Radius is the metric — not the control count
|
||||
|
||||
This is the governing question turned into a habit. Compliance counts inputs (controls present). Antifragility measures **propagation** (how far a compromise travels). A tenant with 200 controls and a flat AD→Entra trust is more fragile than a tenant with 50 controls and a real tier boundary. The defining fragility of hybrid M365 is **coupling**: Password Hash Sync or PTA, Entra Connect running as a quasi-Tier-0 service, AD admins who are also cloud admins, devices that are both domain-joined and the user's MFA device. Each coupling means one compromise becomes two. Antifragile design **decouples** — it turns the identity bridge from a conduit into a firebreak. **Judgement prompts**
|
||||
|
||||
- Draw the attack path from a single phished standard user to Global Admin. How many *independent* barriers are there? Independent, not "two MFA prompts from the same provider."
|
||||
- Which single account, if compromised, ends the engagement? How many are there? (If the answer is more than zero, that's the project.)
|
||||
- If on-prem AD fell completely, would the cloud survive — and vice versa? Or are they one organism wearing two badges?
|
||||
- What runs the sync, and what could that identity reach? Trace it.
|
||||
|
||||
---
|
||||
|
||||
### 4. Optionality — buy cheap escape hatches
|
||||
|
||||
Pay a small, certain cost now for the *option* to survive an uncertain disaster later. Break-glass accounts, a tested "kill the sync" runbook, a way to revoke all tokens at once, an offline copy of recovery keys, a documented path to a clean tenant. These look like waste to an auditor and like wisdom on the worst day of the client's year. Optionality is the opposite of optimisation. An optimised system has no slack and shatters at the first surprise. Deliberately keep some slack. **Judgement prompts**
|
||||
|
||||
- When the primary path fails, what's the second path — and has anyone walked it?
|
||||
- If we had to sever AD from Entra in the next 30 minutes to contain a breach, *how*? Is that written down where someone panicking can find it?
|
||||
- Break-glass: does it exist, is it phishing-resistant, is it excluded from the CA policy that would otherwise lock it out, and when was it last *used* in a drill (not just created)?
|
||||
- What are we optimising so hard that we've removed all room to manoeuvre?
|
||||
|
||||
---
|
||||
|
||||
### 5. Stress it on purpose — hormesis, not hope
|
||||
|
||||
Muscle, bone, and immune systems get stronger from controlled stress and weaker from protection. Systems are the same. **An untested control is a broken control** — you simply don't know it yet. The benchmark says "the setting is configured." The antifragile consultant says "we revoked the token at 14:00 on a Tuesday and watched what actually happened." Run game-days. Disable a CA policy and observe the fallout in a controlled window. Simulate Entra Connect failure. Pull a Global Admin's session. Kill a DC. You *want* to discover brittleness on a quiet afternoon, cheaply, with the right people watching — not at 3 a.m. during a real intrusion. **The corollary: declared state is not enforced state.** Underneath "untested = broken" sits a harder truth about *why* you must test — every representation the platform hands you (a config blade, an inventory record, a compliance dashboard, a green tick) is a **claim about reality, not reality itself**, and the two diverge silently and routinely. Two examples that should haunt you:
|
||||
|
||||
- A Conditional Access policy can display a flawless configuration and **enforce nothing** — the evaluated object has desynced from the one you're looking at. Every config review, export-diff, and benchmark audit passes. Only a real sign-in reveals it fails open. (Worked example in Book IV.)
|
||||
- A CMDB or device inventory shows a clean, managed fleet while the sign-in logs show a different, larger, partly-unknown population actually touching the data. The inventory is a wish; the authentication record is the fact. (Worked example in Book IV.)
|
||||
|
||||
So the rule that governs the whole craft: **verify by observation, never by inspection.** Trust what the system *does* under test over what any artefact *says* it does. Reading the config is not knowing the behaviour; counting the inventory is not knowing the fleet. Where the representation and the observed behaviour disagree, the behaviour is the truth and the representation is the bug. **Judgement prompts**
|
||||
|
||||
- What here has never once been tested by actually breaking it?
|
||||
- What do we *believe* is true about this estate that we've never verified by observation? (Belief is not evidence. The portal showing a green tick is not the same as the control firing under attack.)
|
||||
- Which "facts" about this estate come from a *representation* (config screen, CMDB, dashboard) rather than from *observed behaviour*? Which have we confirmed the system actually does, versus merely says?
|
||||
- Where would a silent divergence between declared and enforced state hurt most — and how would we even notice it?
|
||||
- When did this client last deliberately break something to learn from it? If "never," that's the most important finding in your report.
|
||||
- What's the smallest, safest experiment that would tell us whether X is real?
|
||||
|
||||
---
|
||||
|
||||
### 6. Every incident must change the structure
|
||||
|
||||
This is the actual definition of antifragile — *gaining from disorder.* A robust system survives a shock unchanged. An antifragile system comes out **structurally different and harder to hit the same way twice.** Pain that closes a ticket without changing the architecture is wasted pain, and it guarantees the same incident again. After every incident, near-miss, failed game-day, or even a noisy false positive: what *structural* thing changes? Not "we reminded users to be careful." A removed permission, a severed coupling, a new firebreak, a deleted feature. **Judgement prompts**
|
||||
|
||||
- For the last three incidents (or alerts) here — what changed in the *structure* afterwards? If the answer is "a training reminder," nothing changed.
|
||||
- Does this organisation treat incidents as embarrassments to bury or as fuel? (Blameless on people, ruthless on structure.)
|
||||
- Are we fixing the instance or the class? Patching this account, or removing the pattern that made it possible?
|
||||
- What did the last false positive *teach* us that we threw away?
|
||||
|
||||
---
|
||||
|
||||
### 7. Convexity — prefer bounded cost, unbounded upside
|
||||
|
||||
Choose controls whose downside is small and known, and whose upside is large and broad. Conditional Access is convex: cheap to run, fails gently, and one good policy blocks whole classes of attack. A sprawling, hand-tuned DLP ruleset is concave: expensive to maintain, brittle, and it fails in surprising, expensive ways at the worst moment. Favour the convex. Be deeply suspicious of any control that needs constant tending to keep working. **Judgement prompts**
|
||||
|
||||
- When this control fails, does it fail *safe and quietly*, or *open and catastrophically*? (Fail-open is concave and usually a trap.)
|
||||
- How much ongoing care does this need to keep working? High-maintenance controls rot the moment attention moves on.
|
||||
- Does this control block a *class* of attacks or just one specific instance? Prefer the class.
|
||||
- Are we buying a complex product to solve a problem that one CA policy and a deletion would solve?
|
||||
|
||||
---
|
||||
|
||||
### 8. Lindy — trust what has survived
|
||||
|
||||
The longer a mechanism has survived, the longer it's likely to keep working. Boring, time-tested controls (least privilege, network segmentation done right, hardware-backed keys, tiered admin) beat the newest preview blade in the portal. New features arrive with unknown failure modes and unknown attack surface; they have not yet been stress-tested by the world. Use them when they earn it, not because they're new. Equally: an attack technique that has worked for fifteen years (NTLM relay, Kerberoasting, consent phishing) will probably work next year — prioritise accordingly. **Judgement prompts**
|
||||
|
||||
- Is this control time-tested, or are we the QA team for a feature that shipped last month?
|
||||
- What are the oldest, most reliable attacks against this estate — and have we actually closed them, or chased novel ones while the classics stay open?
|
||||
- If this shiny feature vanished tomorrow, would we be exposed? If yes, we built on sand.
|
||||
- Are we solving a 2015 problem with a 2026 product because the product is new?
|
||||
|
||||
---
|
||||
|
||||
### 9. Skin in the game — whoever designs it, lives with it
|
||||
|
||||
Security theatre is what happens when the people imposing controls never carry the pager. A consultant who recommends a control they'd never have to operate is selling fragility dressed as diligence. The person who designs the break-glass process should be woken up by the drill. The architect who couples AD to Entra should be the one who has to uncouple it under fire. This applies to you. Don't recommend what you wouldn't run. Don't hand a client a 40-page hardening guide you've never operated. Your reputation is your skin in the game — stake it on advice that survives contact with reality. **Judgement prompts**
|
||||
|
||||
- Does the person who designed this control have to live with its consequences? If not, expect theatre.
|
||||
- Am I recommending this because it's right, or because it's defensible if something goes wrong? (Defensive medicine is fragility you can bill for.)
|
||||
- Would I bet my own reputation that this works under real attack? If I hesitate, why am I asking the client to bet theirs?
|
||||
- Who gets the 3 a.m. call when this fails — and were they in the room when it was designed?
|
||||
|
||||
---
|
||||
|
||||
## How to spot fragility (the field skill)
|
||||
|
||||
You will walk into estates with no documentation and no time. Fragility has a smell. Train your nose on these tells:
|
||||
|
||||
- **Folklore.** Configurations only one person understands, justified by "we've always done it that way." If they leave, it becomes un-auditable. Folklore is fragility with tenure.
|
||||
- **Single points of failure wearing a uniform.** One service account that runs everything. One admin who holds all the keys. One unreplicated DC. One sync server treated as cattle but actually a pet.
|
||||
- **Tight coupling.** Compromise one thing → automatically own a second. AD↔Entra, identity-device-MFA all on one phone, prod and admin in one forest.
|
||||
- **Things never tested.** Backups never restored. Break-glass never used. DR plans never run. "It should work" is the sound of a fragile system.
|
||||
- **Permanent "temporary."** Exclusions, exceptions, pilot configs, and risk acceptances older than 18 months.
|
||||
- **Even spreading.** Effort distributed uniformly is a sign nobody asked what matters. The barbell is missing.
|
||||
- **Green dashboards, untested reality.** Everything compliant, nothing ever stress-tested. The most dangerous estate of all, because it feels safe.
|
||||
|
||||
---
|
||||
|
||||
## The anti-benchmark: what we measure instead of compliance %
|
||||
|
||||
We don't score controls passed. If the client needs a number, give them these — and explain why each beats a compliance percentage:
|
||||
|
||||
- **Blast radius** — from a single phished standard user, how many independent barriers to tenant/domain dominance? (Higher is better. Most estates: zero or one.)
|
||||
- **Mean time to recover** — measured by *actually doing it* in a drill, not by the RTO written in a policy.
|
||||
- **Single points of failure** — counted, named, and owned. The goal is a shrinking list, not a green tick.
|
||||
- **Untested assumptions** — the number of load-bearing beliefs never verified by observation. The goal is to drive this toward zero.
|
||||
- **Time-to-remove** — how fast can we delete a fragilizer (legacy auth, a standing admin) once found? Velocity *is* a security metric.
|
||||
|
||||
None of these are easy to fake, which is exactly why they're worth measuring.
|
||||
|
||||
---
|
||||
|
||||
## How to use this handbook
|
||||
|
||||
Book I is the lens. The domain books that follow — Hybrid Identity, Privileged Access, Devices, Data & Collaboration, Recovery, Detection-as-feedback — each apply this same lens in the same shape:
|
||||
1. **Fragility inventory** — where does this domain break, and what's the blast radius?
|
||||
2. **Via negativa** — what do we remove first?
|
||||
3. **The barbell** — what gets paranoid protection, what stays cheap?
|
||||
4. **Optionality & recovery** — what are the escape hatches, and are they tested?
|
||||
5. **Stressor** — how do we deliberately break this to learn?
|
||||
|
||||
If you ever find yourself reaching for "because the benchmark says so," stop. Go back to the governing question. Draw the wall. If you can't draw it, you've found your work.
|
||||
|
||||
---
|
||||
|
||||
*Book I of the Antifragile Handbook. Principles over checklists. Judgement over obedience. Move fast and fix things.*
|
||||
@@ -0,0 +1,167 @@
|
||||
# The Antifragile Handbook for M365 & Active Directory
|
||||
|
||||
## Book II — Hybrid Identity
|
||||
|
||||
> *Draw the wall between on-prem and cloud. In most estates there isn't one — there's a hallway with the door propped open.*
|
||||
|
||||
---
|
||||
|
||||
## Why this is the keystone
|
||||
|
||||
If you only ever fix one domain, fix this one. Every other book — privileged access, devices, data — assumes identity holds. In a hybrid M365 + AD estate, identity usually doesn't hold, and the reason is always the same: on-prem AD and Entra ID are not two systems with a guarded border. They are **one organism wearing two badges**, joined by a bridge that most organisations cannot draw, do not monitor, and have never tested severing.
|
||||
|
||||
The governing question, applied here:
|
||||
|
||||
> **If on-prem AD is ransomwared or domain-dominated tonight, does the cloud survive — or is it already poisoned by inheritance?**
|
||||
|
||||
For the overwhelming majority of estates the honest answer is "poisoned," and nobody has ever said it out loud. Your job is to say it out loud, then build the wall.
|
||||
|
||||
---
|
||||
|
||||
## 1. Fragility inventory — anatomy of the bridge
|
||||
|
||||
You cannot harden what you can't draw. Here is the bridge, piece by piece, with the blast radius of each. Learn to find all of these on day one of an engagement.
|
||||
|
||||
### The sync engine (the single most dangerous server you'll forget about)
|
||||
|
||||
Entra Connect Sync (the old Azure AD Connect) or Entra Cloud Sync runs the synchronisation. Whatever the diagram says, **this server is Tier 0** — because of the accounts it holds:
|
||||
|
||||
- **The on-prem connector account.** Under the old "express" install, this account was granted *Replicate Directory Changes* and *Replicate Directory Changes All* — which is **DCSync**. That means the sync server holds an identity that can pull every password hash in the domain. Read that again. The box your infra team treats as a middling utility VM can dump the entire domain.
|
||||
- **The Entra connector account** (Directory Synchronization Accounts role) — can manipulate synced objects in the cloud.
|
||||
|
||||
So: compromise the sync server → DCSync on-prem **and** tamper with cloud objects. One box, both kingdoms. If this server is domain-joined to the production domain (it usually is), then anything that reaches prod-tier reaches your DCSync machine. That is the central coupling of the entire estate.
|
||||
|
||||
**Where it's worse than you think:** the sync server is often internet-facing for updates, runs a local SQL Express nobody patches, sits on an OS build from the project that installed it, and has not had its connector account rights reviewed since go-live.
|
||||
|
||||
### The authentication method (decides whether the cloud lives or dies with AD)
|
||||
|
||||
Three options, three completely different fragility profiles. Know which one you're actually on before you say anything — the diagram and the reality often disagree.
|
||||
|
||||
- **Password Hash Sync (PHS).** A hash-of-a-hash is synced to Entra; the cloud can authenticate on its own. *This is the most resilient for availability* — if on-prem dies, cloud auth keeps working. The transport is fine and not trivially reversible to the plaintext password; the risk is **not** "PHS leaks passwords," it's that the connector account doing the sync can DCSync. Don't let anyone fragilise availability to "fix" a risk that lives in the connector account, not the hash.
|
||||
- **Pass-through Authentication (PTA).** Credentials are validated against on-prem AD in real time by PTA agents. **Coupling: on-prem outage = cloud auth outage.** Worse, the agent must handle the credential to validate it, so a compromised PTA agent is a plaintext-credential harvesting position. PTA agents are Tier 0 and a juicy target, and PTA is a conduit, not a firebreak. (You can enable PHS *alongside* PTA as failover — cheap optionality, see §4.)
|
||||
- **Federation / AD FS.** The catastrophe. See below — it gets its own treatment because it's usually the single largest fragility in the estate.
|
||||
|
||||
### AD FS and Golden SAML (the thing that ends careers)
|
||||
|
||||
If AD FS issues tokens, then the **token-signing key** can forge a SAML assertion for *any* user — including bypassing MFA when MFA is enforced at the federation layer — and the cloud will trust it because it's validly signed. This is **Golden SAML**. It is how nation-state actors turned a single on-prem foothold into silent, total, persistent cloud impersonation (the SolarWinds intrusions). It is nearly invisible: the IdP is forging legitimate tokens, so there's no failed login, no anomalous password, nothing for a benchmark to catch.
|
||||
|
||||
The token-signing certificate is a single catastrophic point of failure that most orgs never rotate, store poorly, and don't monitor. If you take one thing from this book: **AD FS is fragility incarnate, and the correct long-term answer is to remove it** (§2), not to harden it.
|
||||
|
||||
### Seamless SSO (the forgotten Kerberos key)
|
||||
|
||||
Seamless SSO creates the `AZUREADSSOACC` computer account in AD. Its Kerberos decryption key, if never rotated (it usually never is), is a silver-ticket / token-forging exposure. Classic Lindy fragility: old, unrotated, forgotten, exploitable.
|
||||
|
||||
### The writebacks (reverse conduits nobody counts)
|
||||
|
||||
Every writeback turns the bridge two-way and creates *reverse* blast radius:
|
||||
|
||||
- **Password writeback** — cloud SSPR can change on-prem passwords. Useful; also a path from cloud to on-prem.
|
||||
- **Device writeback / group writeback** — cloud objects written into AD. Group writeback (v2), where cloud security groups become AD objects that gate on-prem resource access, means a **cloud group compromise now affects on-prem access** — a coupling people rarely diagram.
|
||||
|
||||
Each writeback may be justified. None should be silent. Count them, name the blast radius of each.
|
||||
|
||||
### The admin coupling (one organism, two badges)
|
||||
|
||||
The deepest fragility isn't a setting, it's the people and accounts:
|
||||
|
||||
- The same humans are Domain Admins **and** Global Admins.
|
||||
- Cloud admin accounts are **synced from on-prem**, so on-prem compromise → harvest → cloud admin.
|
||||
- Admins use the same workstation for AD and Entra, and that workstation is also their email/MFA device.
|
||||
|
||||
If on-prem privilege flows into cloud privilege through any of these, there is no wall. There's a hallway.
|
||||
|
||||
### Source of authority (why you can't fix it in the cloud)
|
||||
|
||||
For synced objects, **on-prem is authoritative**. You cannot durably fix a synced object purely cloud-side; the next sync cycle overwrites you. This matters enormously in incident response: if AD is owned, your cloud objects are downstream of poison and "just fix it in Entra" doesn't hold.
|
||||
|
||||
---
|
||||
|
||||
## 2. Via negativa — what to remove (in priority order)
|
||||
|
||||
Hybrid identity is where subtraction pays the highest dividend in the whole estate. In rough order of leverage:
|
||||
1. **Remove AD FS. Migrate to cloud authentication** (PHS, or PTA if you have a hard real-time-validation requirement), and move MFA and access decisions to Conditional Access in Entra where they belong. This deletes Golden SAML as a class, shrinks attack surface massively, and removes a SPOF you were never rotating anyway. This is the single highest-leverage deletion in this book.
|
||||
2. **Stop syncing privileged on-prem accounts to the cloud.** Domain Admins, Enterprise Admins, Tier 0 — filter them *out* of sync scope. They have no business being cloud objects. A synced privileged account is a free bridge for the attacker.
|
||||
3. **Make cloud admins cloud-only.** Global Admins and other Entra privileged roles should be cloud-only accounts (`.onmicrosoft.com`), phishing-resistant, never derived from or synced with on-prem identity. This is the firebreak in one move (see §3).
|
||||
4. **Trim the writebacks.** Keep only the ones with a named owner and a justified reverse blast radius. Delete the rest.
|
||||
5. **Rotate or remove Seamless SSO.** If you don't need it, remove the `AZUREADSSOACC` account. If you keep it, rotate the key on a schedule — and the fact that nobody has is itself a finding.
|
||||
6. **Reduce sync scope.** OU-filter aggressively. Don't sync what the cloud doesn't need. Every synced object is attack surface and a potential bridge. The default "sync everything" is laziness, not architecture.
|
||||
|
||||
For each deletion the test from Book I applies: *if I removed this, would anyone notice in 90 days?* For AD FS the honest answer, after migration, is usually "no — and the attackers will notice it's gone."
|
||||
|
||||
---
|
||||
|
||||
## 3. The barbell — what gets paranoia, what stays cheap
|
||||
|
||||
**The irreplaceable few (paranoid protection, redundancy, monitoring):**
|
||||
|
||||
- **The sync server.** Treat it as Tier 0 *in practice*, not just on the diagram: dedicated admin tier, no internet browsing, hardened OS, least-privileged connector account (use a gMSA; strip DCSync rights if your topology allows the scoped permission model), restricted logon, alerting on the connector account's behaviour.
|
||||
- **The connector accounts.** Least privilege, gMSA where supported, monitored. An account that can DCSync should scream in your SIEM if it ever behaves like a domain controller from the wrong host.
|
||||
- **The AD FS token-signing key** — if AD FS still exists, the key belongs in an HSM, monitored, rotated on a real schedule (remember the rollover cert). But the better barbell move is §2.1: don't own this liability at all.
|
||||
- **Cloud-only break-glass Global Admins** (from Book I) — phishing-resistant, excluded from the CA policy that would lock them out, tested.
|
||||
|
||||
**The firebreak — the one design decision that builds the wall:**
|
||||
|
||||
> **Cloud privilege must not be reachable from on-prem compromise.**
|
||||
|
||||
Cloud-only admin accounts + not syncing privileged on-prem accounts + separate privileged workstations = on-prem can fall completely and the attacker still hits a wall at the cloud admin boundary. *That wall is the entire point of this book.* Draw it, then verify an attacker can't walk around it through the sync server (which is why the sync server is in the paranoid bucket).
|
||||
|
||||
**Everything else stays cheap.** Standard user sync, normal device registration, the bulk of the directory — these are replaceable and don't deserve the attention that the sync server and the admin boundary demand. Don't gold-plate the directory while the connector account can dump it.
|
||||
|
||||
---
|
||||
|
||||
## 4. Optionality & recovery — escape hatches, tested
|
||||
|
||||
- **The "kill the sync" runbook.** A written, rehearsed procedure to stop sync fast when on-prem is compromised, so poison stops flowing cloud-ward. Know the nuance per auth method, because severing behaves differently:
|
||||
- *PHS:* disabling sync stops new changes flowing, but already-synced hashes remain — containment of *propagation*, not instant revocation. Pair with token revocation and credential resets.
|
||||
- *PTA / Federation:* severing the bridge can take cloud auth down with it unless you've pre-staged a fallback. Which is why —
|
||||
- **Pre-stage the federated-to-managed conversion.** Know, in advance, how to convert the domain from federated (or PTA) to managed/cloud auth (PHS) *fast*, so that during an on-prem incident you can cut the dependency and keep the cloud alive on its own. Rehearse it. "We think we could" is not a plan.
|
||||
- **PHS as failover under PTA.** Cheap optionality: run PHS alongside PTA so a PTA-agent or on-prem outage doesn't lock everyone out of the cloud. Small certain cost now, large uncertain payoff later. Classic Book I optionality.
|
||||
- **Cloud-only admin path that survives AD death.** Because cloud admins are cloud-only (§3), you retain full control of the tenant even if AD is gone. This *is* the recovery path — verify it actually works without any on-prem dependency (including MFA that doesn't secretly route through on-prem).
|
||||
- **Accept the source-of-authority reality.** Your IR plan must account for the fact that synced objects are downstream of on-prem. Decide *in advance* whether, during a domain-dominance incident, you sever first and rebuild authority cloud-side. Discovering this mid-incident is how recoveries fail.
|
||||
|
||||
---
|
||||
|
||||
## 5. Stressor — break it on purpose
|
||||
|
||||
Untested = broken. Game-days for hybrid identity, smallest/safest first:
|
||||
|
||||
- **Pull the sync server** (planned window). Does cloud auth survive? The answer *proves* which auth method you're really on and whether your availability assumptions are true. Most teams are surprised. That surprise is the point.
|
||||
- **Revoke / disable the connector account and watch your SIEM.** Did anything alert? An account that can DCSync going dark, or behaving oddly, should be the loudest alarm you own. If nothing fired, you've found a detection gap worth more than any control you could add.
|
||||
- **Golden SAML tabletop** (if AD FS exists). Walk through: attacker has the token-signing key — what do you detect, how do you contain, how fast can you rotate, and could you tell at all? If the honest answer is "we couldn't tell," escalate the §2.1 removal from "roadmap" to "now."
|
||||
- **Break-glass under sync-down.** Test the cloud-only break-glass account *while the bridge is severed*. It must work with zero on-prem dependency. If it silently relied on something on-prem, you just found it on a Tuesday instead of during the breach.
|
||||
- **DCSync detection drill.** Have someone simulate DCSync from an unexpected host and confirm detection fires. The connector account is the one place DCSync is "normal," which is exactly why attackers love to look like it.
|
||||
|
||||
Every one of these, per Book I principle 6: whatever breaks must produce a **structural** change, not a calendar reminder.
|
||||
|
||||
---
|
||||
|
||||
## Honest uncertainty (read this, don't trust a handbook on moving parts)
|
||||
|
||||
This book teaches stable mechanisms — the coupling between AD and Entra, Golden SAML, the DCSync-via-connector path, the PHS/PTA/federation trade-offs. Those don't change much; they're Lindy.
|
||||
|
||||
What **does** move, and what you must verify against current Microsoft documentation rather than trusting any 2026-vintage handbook:
|
||||
|
||||
- **Connect Sync vs Cloud Sync feature parity.** Microsoft has been steering new deployments toward the lighter Cloud Sync agent (no SQL, multiple agents for HA — better optionality), but parity for specific scenarios (certain writebacks, device sync, large/complex topologies, passthrough nuances) has been evolving. **Check the current parity matrix before you recommend a migration.** Don't let me, or any document, freeze this for you.
|
||||
- **AD FS deprecation / migration tooling.** Direction of travel is clearly away from AD FS toward Entra-native auth, with staged-rollout and migration tooling to ease it. Exact timelines, tool capabilities, and supported paths shift — verify current state when you scope the work.
|
||||
- **Connector account hardening guidance** (gMSA support, least-privilege permission models, the scoped alternative to full DCSync rights) continues to improve — confirm what's available for your topology and version.
|
||||
|
||||
If a client's safety depends on a current-version specific, **look it up and cite it**, don't quote your memory or this book. Honest "I need to verify the current parity" beats confident and wrong every time. That's not weakness; that's the job.
|
||||
|
||||
---
|
||||
|
||||
## Consolidated judgement prompts
|
||||
|
||||
The questions to carry into any hybrid estate:
|
||||
|
||||
- Which auth method are we *actually* on — and does the cloud survive on-prem death? (Verify by testing, not by asking.)
|
||||
- Is the sync server Tier 0 in practice or only on the diagram? What can its connector account reach? Can it DCSync?
|
||||
- Are any privileged on-prem accounts synced to the cloud? Are Global Admins cloud-only or synced?
|
||||
- Can on-prem privilege reach cloud privilege by *any* path — accounts, workstations, the sync server, writebacks? Draw every path. Each one is a hole in the wall.
|
||||
- Do we have AD FS? *Why?* What exactly would removing it take, and what's the honest reason it hasn't happened?
|
||||
- When was the Seamless SSO key / AD FS token-signing cert last rotated? ("Never" is a finding, not an answer.)
|
||||
- Which writebacks are on, and what reverse blast radius does each create?
|
||||
- If we severed the bridge in the next 30 minutes, what breaks, and is the procedure written where someone panicking can run it?
|
||||
|
||||
---
|
||||
|
||||
*Book II of the Antifragile Handbook. The wall between on-prem and cloud is the most important structure you will ever draw — because in most estates, it isn't there. Move fast and fix things.*
|
||||
@@ -0,0 +1,147 @@
|
||||
# The Antifragile Handbook for M365 & Active Directory
|
||||
|
||||
## Book III — Privileged Access
|
||||
|
||||
> *Privilege is blast radius with a time axis. Standing privilege reaches everything, forever. The whole job is to collapse both: less reach, less time.*
|
||||
|
||||
---
|
||||
|
||||
## The governing question
|
||||
|
||||
Book I asked you to draw the wall. Book II built it between on-prem and cloud. This book is about the credentials that can knock any wall down. Ask of every privileged identity — human, service account, or app:
|
||||
|
||||
> **If this credential leaks tonight, how long does it stay useful, and how far does it reach?**
|
||||
|
||||
A permanent Domain Admin answers *"forever, everything."* A permanent Global Admin answers *"forever, the whole tenant."* A JIT, scoped, time-boxed role answers *"for one hour, for one task."* Every technique in this book exists to turn the first kind of answer into the second. That's it. That's the whole craft of privileged access: **shrink the reach, shrink the time.**
|
||||
|
||||
Compliance counts whether you "have a PAM solution." Wrong question. The question is whether privilege *evaporates when not in use* and whether a leaked credential hits a wall in minutes instead of owning the estate forever.
|
||||
|
||||
---
|
||||
|
||||
## 1. Fragility inventory — where privilege rots
|
||||
|
||||
### Standing privilege (the original sin)
|
||||
|
||||
An account that is *always* an admin is a loaded gun left on the table, every hour of every day, whether anyone's using it or not. Its blast radius is constant and maximal. Permanent Domain Admins, permanent Enterprise Admins, permanent Global Admins — every one of them is a credential whose value to an attacker never drops to zero. **The single most important number in this book is: how many identities hold standing privilege?** In most estates it's an order of magnitude too high, and nobody has ever counted.
|
||||
|
||||
### Service accounts and service principals (the dark matter)
|
||||
|
||||
This is where the bodies are buried, on both sides of the wall:
|
||||
|
||||
- **On-prem service accounts** — over-permissioned ("we made it Domain Admin to make it work"), static passwords that haven't changed since 2016, an SPN attached so they're **Kerberoastable** (request the ticket offline, crack the weak password at leisure), owned by nobody, documented nowhere, and impossible to turn off because something unknown will break.
|
||||
- **Cloud service principals / app registrations** — the same disease in a new body. Client secrets that never expire, **tenant-wide admin consent**, and Microsoft Graph permissions that are quietly catastrophic: `RoleManagement.ReadWrite.Directory`, `AppRoleAssignment.ReadWrite.All`, `Application.ReadWrite.All` — any of which is a privilege-escalation path to Global Admin. Service principals **cannot do MFA**, usually hold **standing** privilege, and live in a blind spot no benchmark looks at hard enough.
|
||||
|
||||
Service identities are dark matter: most of the privileged mass of the estate, invisible in the usual diagrams, and gravitationally dominant when something goes wrong.
|
||||
|
||||
### Tier violations (the wall with a hole kicked in it)
|
||||
|
||||
The Lindy core of on-prem security is the tier model (Tier 0 = identity control plane: DCs, AD, ADCS, the sync server from Book II; Tier 1 = servers; Tier 2 = workstations). Microsoft has since reframed it as the Enterprise Access Model reaching into the cloud, but the rule never changed:
|
||||
|
||||
> **A higher-tier credential must never be exposed on a lower-tier system.**
|
||||
|
||||
Every Domain Admin who RDPs into a workstation, every admin whose daily-driver laptop also touches a DC, every shared jump box used for both Tier 0 and Tier 1 — that's a tier violation, and it's how `pass-the-hash` / `pass-the-ticket` turns one phished workstation into domain dominance. The clean-source principle is absolute: **you cannot securely manage a system from a less-secure one.**
|
||||
|
||||
### The escalation plumbing nobody maps
|
||||
|
||||
- **AD ACL backdoors** — who can reset whose password, who has `WriteDACL` / `GenericAll` on what. Privilege hides in object permissions, not just group membership. Attackers map this in minutes; defenders rarely map it at all.
|
||||
- **Delegation** — unconstrained delegation is a standing golden-ticket risk; constrained/RBCD misconfigurations are escalation paths.
|
||||
- **ADCS** — the certificate services escalation paths (the ESC-series misconfigurations) turn a forgotten CA template into domain compromise. ADCS is **Tier 0** and is almost always treated as Tier 1 or forgotten entirely.
|
||||
- **KRBTGT** — the master key behind golden tickets. Rarely rotated; if an attacker ever had it, they may still have it.
|
||||
- **LAPS absent** — without per-machine local admin password randomisation, one cracked local admin hash unlocks lateral movement across every machine sharing it.
|
||||
|
||||
### The recovery paradox
|
||||
|
||||
The accounts that can rebuild the estate after a disaster are, by definition, the most powerful — and therefore the most valuable to an attacker. Break-glass done carelessly is just standing privilege with a heroic name. (Handled in §4.)
|
||||
|
||||
---
|
||||
|
||||
## 2. Via negativa — what to remove (in priority order)
|
||||
|
||||
Privilege is the domain where deletion is the entire strategy. Adding "privileged access controls" on top of unmanaged standing privilege is rearranging furniture in a burning room.
|
||||
1. **Eliminate standing privilege.** Roles become *eligible*, not *active*. Cloud-side this is PIM (§3). On-prem it's harder and the tooling is weaker — be honest about that (§ honest uncertainty) — but time-bound group membership and JIT elevation tooling exist; use them. The target state: at rest, almost nobody is an admin.
|
||||
2. **Empty the top groups toward the irreducible minimum.** Drive Domain Admins, Enterprise Admins, and standing Global Admins down to the smallest number that reality permits (plus break-glass). Delegate specific rights instead of handing out god-mode. "Empty Domain Admins" is an achievable goal, not a fantasy.
|
||||
3. **Kill, convert, or constrain service identities.** Remove the ones nobody can justify (apply the 90-day-scream test). Convert the rest to managed identities — **gMSA** on-prem (the established, Lindy fix: automatic password rotation, no static secret, not Kerberoastable in the same way), **managed identities** in Azure where possible. Strip every excess right. For app registrations: remove the dangerous Graph permissions, expire and rotate secrets, prefer certificate credentials or managed identities over secrets, and delete unused registrations and stale consent grants.
|
||||
4. **Remove tier violations.** No high-tier credential on a low-tier box, ever. This is mostly subtraction — taking admin rights *off* daily-driver machines and shared boxes.
|
||||
5. **Fix the escalation plumbing by removal.** Decommission unused ADCS templates, remove unconstrained delegation, prune dangerous ACLs, deploy LAPS so standing shared local admin passwords cease to exist.
|
||||
6. **Remove standing local admin from users.** Most don't need it. The ones who think they do usually need it for ten minutes a month — which is a JIT problem, not a standing-rights problem.
|
||||
|
||||
---
|
||||
|
||||
## 3. The barbell — paranoia for the control plane, cheap for the rest
|
||||
|
||||
**The irreplaceable few (paranoid, redundant, monitored):**
|
||||
|
||||
- **Tier 0** — DCs, AD, ADCS, KRBTGT, and the sync server from Book II. This is the control plane; if it falls, everything falls.
|
||||
- **The handful of break-glass Global Admins** (§4).
|
||||
- **The PIM / role-management configuration itself** — because whoever controls *who can become admin* is effectively admin. Privileged Role Administrator and Privileged Authentication Administrator are crown roles; treat them as such.
|
||||
|
||||
**Paranoid protection for privileged work means, non-negotiably:**
|
||||
|
||||
- **PAWs — the principle and the practical reality.** The principle: all Tier 0 / Global Admin work from a clean, hardened, single-purpose device that never reads email or browses the web. The admin's normal laptop is Tier 2. This is right. The practical reality: physical PAWs almost never get deployed. The hardware procurement, the second device on the desk, the behaviour change — all of it defeats the project before it starts. The deployable alternative that preserves the essential properties is a **cloud-hosted admin workstation** — a Windows 365 or Azure Virtual Desktop VM provisioned from a hardened template, enrolled in the management overlay, used only for privileged tasks. The admin connects from their normal device via browser or RDP. Privileged credentials live in the cloud VM, not on the admin's local device. If the VM is compromised: wipe it, reprovision from template in 20 minutes. The security property is the same — credentials isolated from the daily-use device — without the hardware problem. This is the practical PAW. Recommend it before recommending a dedicated physical device; it will actually get deployed.
|
||||
- **The management overlay** connects the admin workstation (cloud VM or physical PAW) to the systems it manages without exposing those systems to the general network. The T0/T1 split matters here and maps directly to the tier model: T0 systems (DCs, ADCS, sync server) get an overlay with no external runtime dependency (Nebula with pre-distributed certificates); T1 systems (member servers, cloud workloads, multi-cloud resources) get an overlay with identity-aware access and per-session MFA (Tailscale with Entra OIDC). The realistic T0 node count for a 5,000-person organisation is 15–25 nodes — small enough to manage with a documented certificate ceremony and a spreadsheet, not a full PKI team. The management overlay is what makes remote and hybrid admin work possible without either a traditional VPN's flat-network problem or physical-presence-only access.
|
||||
- **Phishing-resistant MFA only** for admins — FIDO2 / passkeys / certificate-based. SMS and push-approve are not admin-grade; they're phishable, and admins are the phishing prize. For the management overlay, this means Tailscale configured with key expiry and an Entra OIDC IdP enforcing FIDO2 — so the WireGuard device trust and a per-session identity assertion are both present, not just the device key.
|
||||
- **Separate, cloud-only privileged identities** for cloud admin (the Book II firebreak, enforced here). On-prem admin identity must not be the cloud admin identity.
|
||||
- **JIT for everything** via PIM: eligible-not-active, time-boxed, MFA on activation, justification logged, and **approval workflow on the crown roles**.
|
||||
- **Conditional Access scoped to admins** — privileged roles usable only from PAWs / compliant devices / named locations.
|
||||
|
||||
**Everything else stays cheap.** Standard RBAC, normal user access, ordinary app permissions — don't pour the privileged-access budget evenly across the whole directory. Concentrate it ferociously on the tiny set of identities that own the control plane. A thousand hardened standard users won't save you if one permanent Domain Admin uses `Password1!` on a Kerberoastable SPN.
|
||||
|
||||
---
|
||||
|
||||
## 4. Optionality & recovery — escape hatches, tested
|
||||
|
||||
- **Break-glass done right.** This is the deliberate exception to "no standing privilege" — you *need* an account that works when PIM, MFA infrastructure, or the IdP is down. So it's standing by necessity, which means it is protected differently: cloud-only, phishing-resistant credential stored offline/split, excluded from the CA policy that would otherwise lock it out, and **wired so that any use at all triggers a screaming alert.** Standing privilege you can't remove, you watch like a hawk. And you **test it** — an untested break-glass account is Schrödinger's recovery.
|
||||
- **KRBTGT rotation on demand.** Can you rotate KRBTGT (twice, with the required interval) the moment you suspect golden tickets — without taking the forest down? Is it rehearsed? If not, you have a theoretical control, not a real one.
|
||||
- **Fast session revocation / admin disable.** A one-move way to kill a compromised admin's sessions and tokens and disable the account, on both sides of the wall. Rehearse it; the breach is not the time to discover the command.
|
||||
- **No single human as the only recovery path** — balanced against blast radius. You want enough redundancy that one person under a bus (or under coercion) doesn't end recovery, without so many standing admins that you've recreated the problem. The barbell, again.
|
||||
- **Tier 0 / forest rebuild path** — links forward to Book V (Recovery). Know it exists, know it's been tested, know it doesn't secretly depend on a credential that the incident just compromised.
|
||||
|
||||
---
|
||||
|
||||
## 5. Stressor — break it on purpose
|
||||
|
||||
- **Pull an admin's standing access and route them through PIM for a week.** Does real work still flow? If JIT activation is too slow or broken, people will route around it — and you'll have found that in a drill instead of discovering the shadow standing-admin account they created in revenge.
|
||||
- **Kerberoast yourself.** Run the attack against your own directory. Which service accounts crack? Did anything *detect* the ticket requests? Two findings in one cheap test.
|
||||
- **Attempt a tier violation in a test window.** Try to use a Tier 0 credential on a Tier 2 box. Is it blocked? Detected? Silent? Silence is the worst answer and the most common.
|
||||
- **Run attack-path analysis as routine, not as a once-a-year pentest.** Tools that map "who can reach Domain Admin / Global Admin in N hops" turn privilege escalation into a number you can track over time. **The count of paths to domain/tenant dominance is a better security metric than any compliance percentage.** Drive it down; watch it not creep back up.
|
||||
- **Simulate a malicious consent grant / over-permissioned app.** Register an app requesting a dangerous Graph scope. Does anything flag it? Can you find every existing app holding those scopes today? (You should be able to. Most can't.)
|
||||
- **Break-glass drill** — yes, again, and on a schedule. The recurring test in this whole handbook.
|
||||
|
||||
Per Book I principle 6: each of these must yield a **structural** change — a removed right, a severed path, a new alert — not a note that says "be careful."
|
||||
|
||||
---
|
||||
|
||||
## Honest uncertainty (the moving parts — verify, don't trust this book)
|
||||
|
||||
Stable and Lindy (teach with confidence): standing privilege is the core risk; the tier / clean-source model; Kerberoasting, pass-the-hash, golden/silver tickets, DCSync; the gMSA pattern; JIT/eligibility as the goal. These don't churn.
|
||||
|
||||
What moves, and what you must verify against current Microsoft documentation:
|
||||
|
||||
- **The management overlay pattern** (covered in §3 above) is stable in principle — the T0/T1 split, the clean-source reasoning for isolating the management plane, the cloud admin VM as the deployable PAW substitute. What moves: the specific tooling. Nebula's CA and ACL model, Tailscale's per-session MFA configuration and OIDC integration, and the Windows 365 / AVD provisioning model all evolve. Verify current implementation guidance before deploying, and confirm Tailscale's key-expiry and IdP enforcement behaviour is still available as described.
|
||||
|
||||
- **PIM capabilities, role definitions, and the risk classification of specific Graph permissions** evolve continually. Confirm which scopes are escalation-grade *today* rather than trusting a 2026 list.
|
||||
- **On-prem JIT/PAM tooling is genuinely weaker and more fragmented than the cloud story.** Native time-bound group membership, MIM PAM, and third-party PAM all have trade-offs that shift. Don't promise a client a clean AD-native JIT experience without checking current reality — and be honest that on-prem eligibility is harder than PIM makes cloud look.
|
||||
- **gMSA vs dMSA.** gMSA is the established, Lindy answer for managed service accounts. **dMSA** (delegated managed service accounts, introduced with the Windows Server 2025 generation) targets the real gap — migrating a standing service account and disabling the original — but newer mechanisms carry newer attack surface, and there has been published privilege-escalation research against the dMSA migration path. **Verify current patch and hardening guidance before you recommend dMSA**; this is exactly the kind of new-and-shiny that Book I principle 8 warns about. gMSA until you've checked dMSA's current state.
|
||||
- **Enterprise Access Model vs the classic three-tier model** — same logic, evolving names and cloud extensions. Use whichever vocabulary the client knows; don't get religious about the label.
|
||||
|
||||
If a client's safety hinges on a current specific, look it up and cite it. "I need to verify the current Graph permission classification" beats confidently quoting a stale one. That posture *is* the independence this handbook is trying to build.
|
||||
|
||||
---
|
||||
|
||||
## Consolidated judgement prompts
|
||||
|
||||
- How many identities hold **standing** privilege — human, service account, and service principal — counted, named, and owned? (If you can't produce the number, that's finding #1.)
|
||||
- For each privileged credential: leaked tonight, how long is it useful and how far does it reach? Where's the wall?
|
||||
- Where are the tier violations? Which high-tier credentials touch low-tier systems? Does any admin's daily laptop reach Tier 0?
|
||||
- Which service accounts are Kerberoastable? Which app registrations hold escalation-grade Graph permissions or non-expiring secrets?
|
||||
- Are cloud admins cloud-only and phishing-resistant, or synced and push-MFA'd? (Book II firebreak — verify it's actually enforced here.)
|
||||
- Does privilege **evaporate when idle** (PIM/JIT) or sit loaded on the table?
|
||||
- Is ADCS treated as Tier 0? When was KRBTGT last rotated? Is LAPS deployed?
|
||||
- Break-glass: does it exist, is it monitored to scream on use, and when was it last *tested* — not created, tested?
|
||||
- How many paths to Domain Admin / Global Admin exist right now, and is that number going up or down?
|
||||
- What does an admin use to reach a domain controller remotely — and if that path is compromised, what does the attacker get? Is the management access path independent of the estate it manages?
|
||||
- Are privileged credentials ever typed into or stored on a device that is also used for email and browsing? If yes, the session isolation that PAWs are meant to provide does not exist, regardless of what the policy says.
|
||||
|
||||
---
|
||||
|
||||
*Book III of the Antifragile Handbook. Privilege is blast radius with a clock on it. Shrink the reach, shrink the time, and watch the credentials that can rebuild the world. Move fast and fix things.*
|
||||
@@ -0,0 +1,172 @@
|
||||
# The Antifragile Handbook for M365 & Active Directory
|
||||
|
||||
## Book IV — Devices & Endpoint (Intune)
|
||||
|
||||
> *The device will be compromised. Compliant is not the same as secure, and the portal toggle is not the same as the device's behaviour. Build for the compromise, not against it.*
|
||||
|
||||
---
|
||||
|
||||
## The governing question
|
||||
|
||||
Most endpoint programmes are built on a wish: *make the device trusted.* That wish is unwinnable — a device in a user's hand, on a network you don't control, running an OS you didn't write, will eventually be compromised, and no amount of hardening changes that. So flip the question:
|
||||
|
||||
> **Assume every device is already compromised. What still holds?**
|
||||
|
||||
If the answer is "nothing, because a compromised-but-compliant device gets full access," you've built fragility with a green tick on it. The antifragile endpoint posture stops trying to own the device and instead builds a boundary that **survives an untrusted device**: the data lives behind a wall, the device is cheap and disposable, and "compliant" is treated as what it actually is — a *signal that can be wrong*, not a guarantee.
|
||||
|
||||
That reframe — **compliance is a signal, not a checkbox** — is the spine of this whole book.
|
||||
|
||||
---
|
||||
|
||||
## 1. Fragility inventory — where the endpoint betrays you
|
||||
|
||||
### The fleet is a fiction: managed, unmanaged, shadow, dark
|
||||
|
||||
Before any of the controls below mean anything, confront the foundational lie of endpoint security: **you do not know your fleet.** The whole book so far has said "the managed devices" as if that set is the fleet. It isn't. The managed devices are the part you *chose to count* — and in most estates they're the bigger part only *if you're lucky.* The blast radius lives in everything else.
|
||||
|
||||
The honest spectrum of what touches your data:
|
||||
|
||||
- **Managed** — enrolled (MDM) or app-managed (MAM). The devices you can see and control. The part the programme is about, and the part everyone fixates on.
|
||||
- **Known-but-unmanaged** — devices that authenticate and reach data but aren't managed. Entra-registered-but-not-compliant, BYOD that hit OWA or a SharePoint link in a browser. They're in the sign-in logs; they're not under your control.
|
||||
- **Shadow** — devices the org never sanctioned but users brought anyway: a personal phone, a contractor's laptop, a home PC pulling files through the web client. Shadow IT at the device layer.
|
||||
- **Dark** — access you have *no device-level visibility into at all.* Legacy- protocol sign-ins that bypass Conditional Access and never produce a clean device signal. Long-lived tokens issued once and never re-evaluated. App passwords. Service principals and automation that aren't devices but reach data like one (the "dark matter" of Book III, wearing a different hat). This is the end of the spectrum that should frighten you, because it never trips a sensor.
|
||||
|
||||
And the inventory of record — the CMDB — is almost always **more wish than reality.** It's populated by *process* (someone files a ticket), and process decays the moment attention moves on. The real device population is populated by *behaviour* — what is actually authenticating right now. The gap between those two is precisely your shadow and dark population, and it's invisible exactly where it matters most.
|
||||
|
||||
This is the Book I corollary made flesh: **the inventory is a claim; the sign-in log is the fact.** Stop deriving your fleet from the CMDB (declarative, decaying, wishful) and start deriving it from observed authentication (behavioural, current, honest). You can't manage what you can't see, and you can't see what you decided not to look at.
|
||||
|
||||
The reframe that saves you is the same barbell from §3: the goal is **not** to manage every device — that's impossible, and chasing it is fragile. The goal is (a) to *know the real population* by observation, and (b) to *gate the data* so that an unmanaged or unknown device gets limited, app-contained, or no access. The question was never "is this device managed." It's **"can a device I don't control reach the data, and what happens when it does?"** An unmanaged device forced through an app-protection boundary in a browser session control is contained. An unmanaged device holding a fat client and a never-re-evaluated token is a hole in the wall you didn't know was open.
|
||||
|
||||
### The compliance signal lies (in both directions)
|
||||
|
||||
"Require compliant device" in Conditional Access is the real control. But the compliance signal underneath it is softer than the toggle suggests:
|
||||
|
||||
- **It's stale.** Compliance is evaluated on a check-in cadence, not continuously. There's a window where a device falls out of compliance — gets rooted, drops encryption, falls behind on patches — and still carries a "compliant" state and a valid token. The signal lags reality.
|
||||
- **It's spoofable.** Root/jailbreak detection is an arms race, not a wall. A motivated attacker (or a determined user with a YouTube tutorial) steps over the tripwire. Treat detection as a tripwire, never as a barrier.
|
||||
- **It's shallow.** "Compliant" usually means a handful of boxes — PIN set, encrypted, OS version, not-jailbroken. None of those stop malware running with the user's own token on a device that passed every check.
|
||||
- **It fails both ways.** A false *compliant* over-trusts a hostile device. A false *non-compliant* locks a legitimate user out at the worst possible moment — and anyone who's run endpoint at scale has watched a flaky signal brick access for someone important mid-flight. Both failure modes are real; design for both.
|
||||
|
||||
### The ghost policy: displayed config ≠ enforced config
|
||||
|
||||
This one is field-earned and genuinely frightening, because it defeats every form of inspection there is. A Conditional Access policy can show a **perfectly correct configuration in the portal** — every condition, assignment, and grant exactly as intended — and yet **never enforce anything.** The backend state has desynced or corrupted; the object you're *looking at* is not the object being *evaluated*. Recreating the policy from scratch with byte-identical parameters restores enforcement. Nothing in the displayed config ever told you it was broken.
|
||||
|
||||
Sit with what that means. A config review passes. An export-and-diff passes. A CIS audit ticks it green. Every parameter is "correct." And the control is doing nothing — a CA policy that **fails open, silently.** This is the worst failure on the convexity axis: the control you trusted to be convex (fails safe, blocks a class) is quietly behaving concave (fails open, protects nothing), and *no artefact you can read reveals it.* A benchmark cannot catch this. It is invisible to inspection by construction.
|
||||
|
||||
There is exactly one thing on earth that detects it: **observed enforcement under test.** This is not an edge case to file away — it is the single hardest piece of evidence for why the entire stressor discipline in this handbook exists. The iron rule that follows (and it is non-negotiable):
|
||||
|
||||
> **A CA policy's displayed configuration is a claim, never proof. The only proof is a real sign-in producing the expected outcome. Define the expected results *before* you build or change the policy, and test against them every time.**
|
||||
|
||||
Concretely: for the users and conditions that matter, write down the required outcome first — *user X, condition Y → MUST be blocked / granted / MFA-prompted* — so you're testing against a pre-committed expectation, not rationalising whatever you observe. Use the What If tool as a first pass, but understand its limit: What If evaluates the *configuration logic*, so it will happily tell you a ghost policy "applies" while the live evaluator ignores it. **Only a real authentication attempt is proof.** And when behaviour and config disagree, **recreate the policy from scratch — do not re-edit it**, because editing a corrupt object can carry the corruption forward. Re-test after tenant-level changes too, not just after policy edits; the desync can appear without you having touched the policy at all.
|
||||
|
||||
### The join-state coupling (Book II reaches the desktop)
|
||||
|
||||
Entra hybrid join drags the Book II fragility down to the device: the device identity now depends on on-prem AD, the SCP, the sync, and line-of-sight to a DC for some flows. It's the device-layer version of "one organism, two badges," and it exists almost entirely to service legacy app/auth dependencies. Pure Entra join + Intune is the cloud-native path that severs that coupling.
|
||||
|
||||
### The PRT is the device's golden ticket
|
||||
|
||||
The Primary Refresh Token on a managed device is its key to seamless cloud SSO. A compromised endpoint with a live PRT is a serious blast-radius problem. TPM binding (the session key sealed in hardware) is what raises the cost of stealing it — so "is the PRT TPM-bound?" is a real question, not a checkbox.
|
||||
|
||||
### MAM / App Protection is a *porous* boundary
|
||||
|
||||
Managing the data layer without owning the device (MAM-WE / App Protection Policies) is the right idea — wall the data, don't try to own a personal phone. But the wall has seams, and the data leaks through them: the OS share sheet, copy/paste where it isn't blocked, screenshots, "open in unmanaged app," local save paths, backups and cloud sync, and unmanaged browsers. A **"Block" in the policy is a claim, not a guarantee** — there are documented cases where the data goes out a path the policy was supposed to close. And enforcement is **not symmetric across iOS and Android**: different OS capabilities, different companion app requirements, different gaps that shift release to release. Never assume parity, and never trust the toggle without watching the device.
|
||||
|
||||
### Enrollment is a trust-establishment moment
|
||||
|
||||
Autopilot and enrollment are when a device becomes "trusted." That makes the enrollment path — tokens, the Autopilot device list, enrollment restrictions — a target: hijack it and you enrol a hostile device as a friend. Most programmes harden the device after enrollment and never look hard at the enrollment trust itself.
|
||||
|
||||
### The legacy and standing-privilege drag
|
||||
|
||||
- **GPO + co-management overlap** — on-prem-coupled config (Book II again), conflicts with Intune, and a migration most estates have half-finished for years.
|
||||
- **Standing local admin** on endpoints — the device-layer version of Book III's original sin; one cracked local admin path = lateral movement.
|
||||
- **Legacy auth that bypasses CA entirely** — the device controls are irrelevant on a protocol that never consults Conditional Access.
|
||||
|
||||
### Patch velocity, and its evil twin
|
||||
|
||||
A fleet you can patch in 24 hours is antifragile; one that takes six weeks of change control is fragile, and the attackers know your patch latency better than you do. But the *opposite* failure is just as real: a fast push to **everything at once** with no staging is how a single bad update bricks an entire fleet — the 2024 CrowdStrike mass-BSOD event was exactly this, a security vendor's own update shipped fast to everyone with no canary. Velocity without an escape hatch is concave (see §4).
|
||||
|
||||
---
|
||||
|
||||
## 2. Via negativa — what to remove
|
||||
|
||||
1. **Go cloud-native.** Move to Entra join + Intune + Autopilot and retire hybrid join, domain join, and GPO wherever the legacy dependency can actually be killed. This severs the Book II coupling at the device layer and deletes a whole class of "the desktop broke because the DC/sync/SCP did" failures.
|
||||
2. **Stop trying to trust the device.** This is a *deletion* — stop pouring effort into making BYOD a trusted device. Wall the data instead (MAM/App Protection) and treat the device as untrusted by default. Subtracting the impossible goal is the move.
|
||||
3. **Remove data from the endpoint.** If the data lives in managed apps and the cloud, there's less on the device to leak or lose. Shrink the local footprint and the compromise gets cheaper to absorb.
|
||||
4. **Remove standing local admin.** JIT elevation (Endpoint Privilege Management) instead — Book III's "shrink the time" at the desktop.
|
||||
5. **Kill legacy auth and the protocols that bypass CA.** A device control you can route around isn't a control.
|
||||
6. **Prune the cruft** — conflicting/duplicate config profiles, dead enrollment profiles, stale Autopilot registrations, orphaned compliance policies nobody can explain. Each one is drift waiting to surprise you.
|
||||
|
||||
---
|
||||
|
||||
## 3. The barbell — cheap devices, protected boundary
|
||||
|
||||
**The device is cattle, not a pet.** This is the central barbell of the book. A lost, stolen, or compromised endpoint should be a **shrug**: selective-wipe the corporate data (BYOD) or full-wipe and re-provision via Autopilot in about an hour (corporate). If losing a laptop is a crisis, you've made the device irreplaceable — which means you protected the wrong thing.
|
||||
|
||||
**Protect the irreplaceable boundary instead:**
|
||||
|
||||
- **The access decision** — Conditional Access. This is the convex control of the endpoint world (Book I): one well-built policy blocks whole classes of attack, cheaply. It is also one of the few things that can brick an entire tenant if misconfigured, so it gets paranoid change discipline (§4).
|
||||
- **The data boundary** — the managed-app container / App Protection policy set, tested at the seams (§5), not trusted at the toggle.
|
||||
- **The PRT and enrollment trust** — TPM-bound credentials, hardened enrollment restrictions, device-bound phishing-resistant auth (links Book III).
|
||||
|
||||
**Don't gold-plate the disposable.** Spending weeks locking down a kiosk's wallpaper policy while the CA policy set has a legacy-auth hole is the endpoint version of even-spreading. Concentrate on the decision and the data wall.
|
||||
|
||||
---
|
||||
|
||||
## 4. Optionality & recovery — escape hatches, tested
|
||||
|
||||
- **Wipe-and-reprovision as the recovery primitive.** Autopilot makes the device replaceable; *that* is your endpoint recovery plan. But "replaceable in an hour" is a slide claim until you've timed it on a real device. Drill it.
|
||||
- **Selective wipe for BYOD** — the clean escape hatch that pulls corporate data without touching the user's photos. The thing that makes MAM politically survivable.
|
||||
- **Update rings and canaries — velocity *with* a brake.** The answer to the CrowdStrike failure mode isn't "patch slowly," it's "patch fast through rings with a real canary, and keep the ability to **halt or roll back** a bad push before it reaches everyone." Fast *and* reversible. This is the barbell and optionality fused: speed on the upside, a bounded blast radius on the downside.
|
||||
- **Break-glass exclusion from device requirements.** A flaky compliance signal must never lock out recovery. The break-glass accounts (Book I/III) sit outside the "require compliant device" gate — and that exclusion is monitored, not forgotten.
|
||||
- **Fast device-trust revocation.** A one-move way to disable a device, revoke its tokens, and drop it from CA trust. Rehearse it.
|
||||
- **Continuous Access Evaluation** is the mechanism shrinking the stale-token window — near-real-time response to critical events instead of waiting for token expiry. It narrows §1's "the signal is stale" gap. Coverage is not universal across every app and flow (verify current state, §honest uncertainty).
|
||||
|
||||
---
|
||||
|
||||
## 5. Stressor — break it on purpose
|
||||
|
||||
This domain rewards hands-on stress more than any other, because the gap between *policy* and *behaviour* only shows up on a real device.
|
||||
|
||||
- **Reconcile the four lists and hunt the deltas.** Pull Intune-enrolled devices, Entra-registered devices, devices appearing in sign-in logs, and the CMDB. None of them will agree. The **disagreements are the findings**: devices authenticating that nobody manages, CMDB entries that never sign in, registered devices that fell out of management. Then go further — count legacy-auth sign-ins and long-lived sessions (the dark end), and run network device discovery for the unmanaged things on the wire. The size of the gap between "the fleet we think we have" and "the population actually touching data" is one of the most honest metrics you can put in a report.
|
||||
- **Attack your own MAM boundary, per platform.** Try to get corporate data out through every seam: share sheet, copy/paste, screenshot, save-as-local, open-in- unmanaged-app, backup/sync, an unmanaged browser. Find where "Block" doesn't actually block. Do it **separately on iOS and Android** — they will not behave the same, and the difference is the finding. (When you find a gap that survives reinstall and reset, that's an escalation to the vendor, not a config you missed.)
|
||||
- **Spoof the compliance signal.** Root/jailbreak a test device. Is it caught? How long until the signal flips and CA reacts? That latency is your real exposure window.
|
||||
- **Prove every CA policy actually enforces.** Never sign off a policy on its displayed config. With expected results written down beforehand, drive real sign-ins for each user/condition that matters and confirm the *observed* outcome matches. Treat What If as a hint, not proof. If a policy that looks correct doesn't enforce, recreate it from scratch rather than editing — the displayed object and the evaluated object can diverge silently, and a ghost policy fails open without ever telling you.
|
||||
- **Lock yourself out on purpose.** In report-only mode, simulate a false non-compliant on a privileged user. Watch the CA decision. Confirm break-glass sails through. Better to find the lockout in a drill than during an outage.
|
||||
- **Push a deliberately bad config/update to the canary ring.** Confirm the ring *contains* it and that halt/rollback works. An untested canary is just the first domino with a friendly name.
|
||||
- **Time a wipe-and-reprovision.** Is the device truly replaceable in an hour, or is that a fiction the recovery plan rests on?
|
||||
- **Compromise a test endpoint.** What does its PRT reach? Does EDR detect it? Does the device-risk signal actually flow into CA and revoke access — or does it stop at a dashboard nobody watches?
|
||||
|
||||
Per Book I principle 6: every gap found becomes a **structural** change — a closed seam, a tightened ring, a severed coupling, an escalation raised — not a line in a test log that dies there.
|
||||
|
||||
---
|
||||
|
||||
## Honest uncertainty (endpoints are the worst offender — verify on a real device)
|
||||
|
||||
Stable and Lindy (teach with confidence): the device will be compromised; trust the boundary, not the device; cheap-and-reprovisionable beats hardened-and- precious; compliance is a signal; velocity needs a brake. None of that churns.
|
||||
|
||||
What moves — and on the endpoint, it moves *faster and more quietly* than anywhere else in this handbook:
|
||||
|
||||
- **MAM / App Protection enforcement is version-, platform-, and OS-build- dependent, and it has gaps that shift release to release.** iOS and Android are not symmetric and never have been; companion app requirements and managed- browser support change. The portal will tell you a policy is enforced while the device quietly does something else. **The only reliable test is on a real device, on the current OS build, every release** — the documentation and the hardware disagree more than Microsoft likes to admit. If you live anywhere in this handbook, live here.
|
||||
- **Continuous Access Evaluation coverage** is expanding but not universal — which apps and flows honour near-real-time revocation changes; verify current coverage before you promise it closes the stale-token window.
|
||||
- **Windows LAPS, Endpoint Privilege Management, Autopatch, Smart App Control / WDAC** capabilities and management surfaces all evolve; confirm current state and licensing before recommending.
|
||||
- **Cloud-native vs hybrid-join guidance and the GPO→Settings-Catalog migration tooling** keep shifting toward cloud-native; check what's actually supported for the client's app estate before promising the coupling can be cut.
|
||||
|
||||
If a client's safety hinges on a specific enforcement behaviour, **test it on the device and, if needed, cite the current Microsoft doc** — and when the device behaviour contradicts the doc, believe the device. Confident-but-wrong about an endpoint control is how data walks out a seam everyone swore was closed.
|
||||
|
||||
---
|
||||
|
||||
## Consolidated judgement prompts
|
||||
|
||||
- If this device is compromised right now, what does the attacker get, how fast do we know, and how fast is it gone? Is the device a shrug or a crisis?
|
||||
- Do we know our *real* device population — derived from what's authenticating — or are we trusting a CMDB that's more wish than reality? How big is the gap between managed, known-unmanaged, shadow, and dark? What dark access bypasses CA entirely?
|
||||
- Is "compliant" being treated as a guarantee or as a signal that can be stale, spoofed, or shallow? What happens when it's wrong — in *both* directions?
|
||||
- Is the boundary the data (MAM/CA) or the device? Have we tested the data wall at every seam, on every platform, on the current OS build — or just toggled it?
|
||||
- Are devices hybrid-joined out of genuine need, or out of habit? What would it take to go cloud-native and cut the Book II coupling?
|
||||
- Can we patch the fleet fast — and can we *halt* a bad push before it reaches everyone? Do we have rings and a real canary, or hope?
|
||||
- Is the PRT TPM-bound? Is enrollment trust hardened, or can a hostile device enrol as a friend?
|
||||
- Does standing local admin still exist? Does legacy auth still bypass CA?
|
||||
- For every CA policy that matters: has it been proven to enforce by a *real sign-in* against pre-written expected results — or are we trusting the displayed config of a policy that might be a ghost?
|
||||
- Has anyone timed a wipe-and-reprovision, tested break-glass against the device gate, or watched the device-risk signal actually reach a CA decision?
|
||||
|
||||
---
|
||||
|
||||
*Book IV of the Antifragile Handbook. Stop defending the device; assume it's already lost and build the boundary that survives it. Trust the device behaviour over the portal toggle, every time. Move fast and fix things.*
|
||||
@@ -0,0 +1,140 @@
|
||||
# The Antifragile Handbook for M365 & Active Directory
|
||||
|
||||
## Book V — Data & Collaboration (Exchange, SharePoint, Teams, OneDrive)
|
||||
|
||||
> *Data is liquid. It leaves where you put it — copied, shared, forwarded, synced, linked. The question is never "is it locked down" but "where can it flow, who can reshare it, and can you see and reverse the flow?"*
|
||||
|
||||
---
|
||||
|
||||
## The governing question
|
||||
|
||||
Books II–IV protected the *containers*: identity, privilege, devices. This book is about the *contents*, and contents obey a different physics. You can perfectly secure a container and still lose the data, because data doesn't stay put — it's duplicated into an email, dropped in a Team, synced to a laptop, handed to a guest who reshares it to someone you've never heard of. Perimeter thinking dies here.
|
||||
|
||||
> **Every share is a copy of your blast radius handed to a party you don't control. Can you see where it went, and can you pull it back?**
|
||||
|
||||
For most estates the honest answers are "no" and "no": nobody can enumerate the external shares, nobody reviews the guests, and a file shared to "Anyone with the link" three years ago is still reachable by anyone who ever held that link.
|
||||
|
||||
---
|
||||
|
||||
## 1. Fragility inventory — how data leaks
|
||||
|
||||
### "Anyone" links: bearer tokens for your data
|
||||
|
||||
Anonymous "Anyone with the link" sharing in SharePoint/OneDrive is the single largest data-exposure fragility in M365. A link is a **bearer token** — whoever holds it has access, no identity, no MFA, no device check, often no expiry, and it's forwardable. Its blast radius is everyone the link ever reaches, forever, including the open web if it leaks into an email thread or a crawler. Conditional Access, compliant devices, all of Books II–IV — none of it applies to a bearer link. It's a hole punched clean through every wall you built.
|
||||
|
||||
### Reshare, and the chain you can't see
|
||||
|
||||
Once data is shared — especially externally — the recipient can usually reshare, download, and copy it. You've handed your blast radius to an org (or a personal account) whose security posture you don't control and can't observe. Guests reshare to other guests. The chain of custody becomes invisible after the first hop. And the controls that govern this in Teams collaboration are **split across several layers** — Teams policy, SharePoint org- and site-level sharing, OneDrive, tenant sharing settings, and B2B/cross-tenant access — that interact in non-obvious ways and don't always agree. (More in §honest uncertainty; this is a place where the policy matrix and the observed behaviour routinely diverge.)
|
||||
|
||||
### Guest sprawl: standing blast radius at the data layer
|
||||
|
||||
Guests accumulate and nobody prunes them. The guest invited for one project in 2022 still has a foothold. Each is an external identity governed by *their* security, not yours — the data-layer cousin of standing privilege (Book III) and shadow devices (Book IV). Unreviewed guest access is a slowly metastasising external attack surface, and most tenants cannot even produce the list of who has it and to what.
|
||||
|
||||
### Email: the oldest, most Lindy exfil channel
|
||||
|
||||
Auto-forwarding rules are the classic business-email-compromise move — a quiet hidden rule that copies all mail to an external address, persistent and invisible. Add attachment-save paths that escape policy, and mail remains the most reliable way data walks out the door. External auto-forward should be off by default, and its presence should scream.
|
||||
|
||||
### The hybrid Exchange anchor (Book II at the data layer)
|
||||
|
||||
An on-prem Exchange server is a Tier-0-adjacent liability — historically one of the most catastrophic on-prem attack surfaces, where mailbox/management permissions can escalate toward AD. Hybrid Exchange drags that liability into the estate, and subtle functionality dependencies keep the last server alive long past its welcome. The via-negativa prize is decommissioning on-prem Exchange entirely (§2) — verify the current management/recipient tooling first.
|
||||
|
||||
### Internal oversharing
|
||||
|
||||
External isn't the only blast radius. "Everyone," "All company," and "Everyone except external users" permissions on a site holding HR, finance, or M&A data mean one compromised *internal* account reaches it all. Default-open SharePoint sites and self-service site creation produce internal data sprawl that no one maps.
|
||||
|
||||
### Collaboration sprawl by design
|
||||
|
||||
Every Team spins up a SharePoint site, an M365 group, a mailbox, and more — each with its own sharing and guest settings, each a potential leak. Self-service creation means ungoverned proliferation of data containers, and collaboration tools carry subtle data-visibility behaviours (who sees what history, what a late joiner can read) that surprise even experts. Sprawl nobody inventories is fragility nobody can see.
|
||||
|
||||
### Illicit OAuth consent: data exfil through a "legitimate" app
|
||||
|
||||
A user clicks OK on an app requesting `Mail.Read` or `Files.Read.All`, and now a third party reads tenant data through a sanctioned-looking grant. This is the data-layer face of Book III's app-registration dark matter — exfil that needs no malware and trips no device control.
|
||||
|
||||
### Retention as hoarded blast radius
|
||||
|
||||
Keeping everything forever makes every breach maximal: the attacker gets fifteen years of data instead of one. Over-retention is hoarding fragility — every byte you keep is a byte that can be stolen. (Its opposite, no recoverable copy at all, is Book VI's problem. The art is disposing of what you don't need while protecting what you do.)
|
||||
|
||||
---
|
||||
|
||||
## 2. Via negativa — what to remove
|
||||
|
||||
1. **Kill anonymous "Anyone" links.** Default external sharing to authenticated, time-limited, least-permission (view, not edit). Remove the bearer token from your data entirely where you can.
|
||||
2. **Decommission on-prem Exchange.** Remove the Tier-0-adjacent liability; get off hybrid Exchange where the dependency can actually be cut (verify current tooling — §honest uncertainty).
|
||||
3. **Block external auto-forwarding by default.** Delete the quietest exfil channel there is.
|
||||
4. **Prune guests ruthlessly.** Access reviews, expiration, entitlement management. Stale external access gets removed, and new guest access expires by default. Treat guest sprawl like standing privilege: minimise and time-box it.
|
||||
5. **Minimise retention.** Dispose of stale data on a schedule. Shrink the prize so every breach is smaller. Data you no longer hold cannot be exfiltrated.
|
||||
6. **Remove broad internal shares** ("All company"/"Everyone") from anything sensitive. Sensitive data should live in *few, known* places with *narrow* access.
|
||||
7. **Govern self-service creation and clean up the dead.** Curb ungoverned Team/ site/app creation; archive and delete orphaned, inactive containers.
|
||||
8. **Restrict user consent and revoke illicit grants.** Users shouldn't be able to hand tenant data to arbitrary apps; admin-consent workflow for anything sensitive, and sweep out the over-permissioned grants already there.
|
||||
|
||||
---
|
||||
|
||||
## 3. The barbell — find the crown jewels, free the rest
|
||||
|
||||
**Name the crown jewels.** Which handful of data sets — the IP, the regulated data, the executive and M&A comms, the source of the company's value — would, if leaked, actually end the business? Most organisations cannot name them, and *that inability is finding #1.* You cannot protect asymmetrically until you know what the asymmetry is for.
|
||||
|
||||
**Paranoid protection for the crown jewels:**
|
||||
|
||||
- **Sensitivity labels with encryption that travels with the file.** This is the convex control of the data world (Book I, principle 7): one label protects the file *everywhere it goes*, forever — even after it leaves the tenant, lands on an unmanaged device, or is forwarded to a stranger. The protection is bound to the data, not the container. That's the only thing that survives data's liquidity.
|
||||
- **Restricted sites, no external sharing, tight access with recurring reviews.**
|
||||
- **Conditional Access app control / session controls** — browser-only, block-download for sensitive data on unmanaged devices (the Book IV boundary applied to content).
|
||||
- **Heightened monitoring** on crown-jewel access (feeds Book VI).
|
||||
|
||||
**Free everything else.** Most collaboration data is low value and should flow *fast* — velocity is a feature (Book I creed). Don't lock the lunch-menu SharePoint with M&A-vault rigour. Spreading DLP and restriction evenly across all data is the concave failure: enormous maintenance, false positives that train users to click through, and the real exfil lost in the noise. **DLP is a scalpel for known high-value patterns (card numbers, national IDs, the labelled crown jewels), not a dragnet over everything.**
|
||||
|
||||
---
|
||||
|
||||
## 4. Optionality & recovery — escape hatches, tested
|
||||
|
||||
- **The label *is* the escape hatch.** Because encryption travels with the file, a leaked crown-jewel document is still encrypted wherever it lands — you pre-paid for the data to survive being stolen. That is optionality bound into the byte.
|
||||
- **Fast share revocation.** Can you, in 30 minutes, enumerate and *kill* every external share and anonymous link? If you can't produce the list, you can't pull it back — build the report and the revocation muscle before you need them.
|
||||
- **Audit and content forensics — switched on and retained.** "Who accessed and downloaded what" is your post-incident truth, but only if audit logging is actually enabled and retained long enough to matter. Verify it's on; don't assume (§honest uncertainty).
|
||||
- **Guest access reviews as recurring pruning** — the recovery loop for sprawl.
|
||||
- **Immutable/held copies of crown-jewel data** — the bridge to Book VI backup.
|
||||
|
||||
---
|
||||
|
||||
## 5. Stressor — break it on purpose
|
||||
|
||||
- **Exfiltrate a labelled crown-jewel file yourself.** Email it externally, share it anonymously, download it through CAA session control, open it on an unmanaged device. Does the label encryption hold? Does DLP fire? Does anything alert? You are testing the *behaviour*, not the policy screen (Book I corollary).
|
||||
- **Plant a canary document** seeded with a detectable pattern and try to move it out every way you can. What catches it? What doesn't?
|
||||
- **Enumerate the external surface.** Produce the full list of "Anyone" links, external guests, and externally-shared files. The exercise of *trying* usually reveals you can't — which is the finding.
|
||||
- **Simulate the BEC forward rule.** Set a test external auto-forward. Is it blocked? Alerted? Silent? Silence is the BEC attacker's favourite answer.
|
||||
- **Test the reshare chain.** Share to a test guest, have them reshare onward. Can you see it? Stop it? Pull it back?
|
||||
- **Reconcile declared vs enforced sharing.** The tenant sharing setting says one thing; walk the actual per-site and per-link reality. They diverge — the ghost-policy cousin from Book IV, at the data layer.
|
||||
|
||||
Per Book I principle 6: every leak path found becomes a **structural** change — a killed link type, a pruned guest population, a label applied, a coupling removed — not a note in a spreadsheet.
|
||||
|
||||
---
|
||||
|
||||
## Honest uncertainty (the sharing matrix moves — test, don't trust it)
|
||||
|
||||
Stable and Lindy (teach with confidence): data is liquid; bearer links are exposure; protection must travel with the data; minimise the prize; DLP is a scalpel not a dragnet; guests are standing blast radius. None of that churns.
|
||||
|
||||
What moves, and what you must verify by testing rather than reading:
|
||||
|
||||
- **External sharing enforcement is split across many interacting layers** — Teams policy, SharePoint org/site sharing, OneDrive, tenant settings, B2B/cross-tenant access, and the Premium tiers — and they don't always agree. Enforcement can differ by client and platform, and the documented matrix and the observed behaviour diverge often enough that you should **confirm the real behaviour on a real client, not from the policy screen.** When you find an inconsistency that survives reconfiguration, that's a vendor escalation, not your error.
|
||||
- **On-prem Exchange decommissioning** and the "last server for management" story — the tooling has evolved; verify the current supported path before promising the coupling can be cut.
|
||||
- **Purview / sensitivity labels / auto-labelling / DLP** capabilities churn fast, including the branding. Verify current coverage and licensing.
|
||||
- **Cross-tenant access settings (B2B collaboration and direct connect)** are comparatively new and evolving — verify current behaviour.
|
||||
- **Audit log retention defaults and licensing have changed over time.** Confirm what's actually captured and for how long *before* you rely on it for forensics.
|
||||
|
||||
If a client's safety hinges on a specific sharing behaviour, test it on a live client and cite the current doc — and where the client behaviour contradicts the doc, believe the client.
|
||||
|
||||
---
|
||||
|
||||
## Consolidated judgement prompts
|
||||
|
||||
- Can we name the crown jewels? If not, that's finding #1 — everything else is guesswork until we can.
|
||||
- Can we enumerate every external share, anonymous link, and guest *right now*? Can we revoke them fast?
|
||||
- Does protection travel *with* the crown-jewel data (labels/encryption), or only with the container it currently sits in?
|
||||
- Where can this data flow — reshare, forward, sync, download, OAuth app — and is any of that flow visible or reversible?
|
||||
- Are guests treated as standing blast radius (minimised, time-boxed, reviewed) or left to accumulate?
|
||||
- Is DLP a scalpel on known high-value patterns, or a dragnet generating noise everyone clicks through?
|
||||
- Is on-prem Exchange still anchoring the estate? What would it take to cut it?
|
||||
- Is audit logging actually on and retained long enough to reconstruct an incident?
|
||||
- Does the tenant's *declared* sharing posture match what the sites and links *actually* enforce?
|
||||
|
||||
---
|
||||
|
||||
*Book V of the Antifragile Handbook. You cannot wall in a liquid. Name the few things that would end the company, bind protection to the data itself, shrink the prize, and make every flow visible and reversible. Move fast and fix things.*
|
||||
@@ -0,0 +1,154 @@
|
||||
# The Antifragile Handbook for M365 & Active Directory
|
||||
|
||||
## Book VI — Recovery & Detection-as-Feedback
|
||||
|
||||
> *Robust means you survive the shock unchanged. Antifragile means you come back stronger. The shock is coming either way — the only choice is what you do with it.*
|
||||
|
||||
---
|
||||
|
||||
## The governing question
|
||||
|
||||
This is the capstone, because it's the book that decides whether everything before it was merely *robust* or genuinely *anti*fragile. The first five books harden the estate; this one builds the machine that turns every shock into improvement. Ask:
|
||||
|
||||
> **When — not if — this fails, do you come back weaker, the same, or stronger?**
|
||||
|
||||
A fragile estate comes back weaker (if at all). A robust estate comes back the same and waits for the next identical hit. An antifragile estate comes back *different and harder to hit the same way twice* — because it ran the shock through a feedback loop and changed its own structure. That loop is the entire subject of this book.
|
||||
|
||||
The reframe that powers it: most organisations treat detection and recovery as the sad afterthought — the thing they hope never to need. Invert it. **Incidents, alerts, failed drills, and near-misses are the most valuable intelligence the system ever produces** — honest, real-world data about where the fragility actually is, bought in the cheapest currency available *if you harvest it.* The org that buries incidents stays fragile. The org that treats them as fuel becomes antifragile. Your job is to build the machine that converts disorder into structural strength.
|
||||
|
||||
---
|
||||
|
||||
## 1. Fragility inventory — where recovery and detection rot
|
||||
|
||||
### Backups that have never been restored
|
||||
|
||||
The biggest recovery lie in the industry: *"we have backups."* Having a backup is not the same as being able to recover, and an untested backup is Schrödinger's recovery — simultaneously fine and worthless until someone actually opens the box. Two M365-specific traps make this worse:
|
||||
|
||||
- **"Microsoft backs it up for us."** Microsoft provides geo-redundancy, recycle bins, and limited native retention — *not* point-in-time backup against your own ransomware, malicious deletion, or retention expiry. Under the shared- responsibility model, **your data is your responsibility.** Most tenants have no real, independent, point-in-time M365 backup, and discover this during the incident.
|
||||
- **Attackers target backups first.** Ransomware operators delete or encrypt the backups *before* they hit production, because they know it's your only way out. A backup reachable from the compromised estate is not a backup; it's another victim.
|
||||
|
||||
### AD forest recovery: the nightmare nobody rehearses
|
||||
|
||||
Recovering a compromised or destroyed AD forest is one of the hardest operations in all of IT — clean OS installs, authoritative restore of one DC per domain, metadata cleanup, double krbtgt reset, trust resets, the whole brutal sequence. Almost no one has practised it. So when ransomware takes AD, "restore from backup" is a multi-day, error-prone, improvised ordeal performed for the first time under maximum pressure. Entra recovery is less apocalyptic but has its own teeth: the hard-delete window for objects, and the fact that tenant *configuration* (CA policies, Intune, roles) has no native "undo" unless you captured it as code.
|
||||
|
||||
### Recovery that depends on what the incident destroyed
|
||||
|
||||
The fatal circular dependency: backups authenticated by the AD that's down. The recovery runbook stored in the SharePoint that's encrypted. The break-glass that needs the MFA service that's offline. The recovery admin whose credentials the attacker already has. **A recovery path that depends on the thing it's recovering is not a recovery path** — it's the clean-source principle (Book III) applied to survival.
|
||||
|
||||
### Detection that fires into a void
|
||||
|
||||
Logs not collected. Audit logging never enabled or silently aged out. A SIEM full of alerts nobody triages. And the specific blind spots the earlier books planted: the unmonitored DCSync (Book II), the unwatched break-glass use (Book III), the device-risk signal that dies on a dashboard (Book IV), the BEC forward rule nobody sees (Book V). Detection that nobody acts on is theatre with a subscription fee.
|
||||
|
||||
### Alert fatigue: the boy who cried wolf, automated
|
||||
|
||||
Too many low-fidelity alerts is itself a fragility — the real signal drowns in noise, and the analyst who's dismissed a thousand false positives dismisses the one that mattered. More alerts is not more security; past a point it's *less.*
|
||||
|
||||
### MTTR that exists only on paper
|
||||
|
||||
RTO/RPO numbers in a policy document, never once validated by an actual restore, are fiction. (Book I anti-benchmark: MTTR is measured by *doing it*, not by declaring it.)
|
||||
|
||||
### Incidents that close without changing anything
|
||||
|
||||
The post-incident review that concludes "remind users to be more careful" has wasted the disorder entirely and guaranteed the recurrence. And a blame culture destroys the feedback loop at the source — if surfacing an incident gets you punished, incidents get buried, and the system goes blind.
|
||||
|
||||
### No known-good to return to
|
||||
|
||||
If your tenant configuration lives only as click-ops in a portal, you have no golden image of "correct," so you can neither rebuild it fast nor detect drift *from* it — and you can't catch a ghost policy (Book I/IV) because you have nothing to diff against. No config-as-code means no known-good.
|
||||
|
||||
---
|
||||
|
||||
## 2. Via negativa — what to remove
|
||||
|
||||
1. **Delete the false comfort that Microsoft backs you up.** Removing the dangerous belief comes before adding the real backup.
|
||||
2. **Sever recovery's dependencies on the estate it recovers.** Recovery credentials, runbooks, and backups must not depend on prod AD/Entra/SharePoint. Decouple, so the lifeboat doesn't sink with the ship.
|
||||
3. **Cut alert noise.** Ruthlessly remove low-fidelity alerts so the high-fidelity ones become visible. Via negativa applied to detection: fewer, louder, truer.
|
||||
4. **Remove blame from the post-incident process.** Blameless on people so people surface incidents — then ruthless on structure so the incident actually changes something. Removing the incentive to hide *protects the feedback loop itself.*
|
||||
5. **Remove click-ops from critical configuration.** Move control-plane config (CA, Intune, roles) to code, so a known-good exists to rebuild from and diff against.
|
||||
|
||||
---
|
||||
|
||||
## 3. The barbell — paranoid recovery for the irreplaceable, best-effort for the rest
|
||||
|
||||
**The irreplaceable few** — the identity control plane (Books II/III) and the crown-jewel data (Book V) — get **real, tested, immutable, offline/isolated backup** and **rehearsed** recovery. AD forest recovery is practised, not theorised. Recovery objectives for these are measured in a drill, in minutes or hours, not asserted in a policy.
|
||||
|
||||
**The recovery capability is itself a crown jewel.** Backups are a top attacker target, so protect them like break-glass: immutable, offline or in a separate trust domain, unreachable even from full domain dominance. A backup the attacker can reach is not a control.
|
||||
|
||||
**Everything else is best-effort and tiered.** Don't gold-plate recovery for the lunch-menu SharePoint. Tier recovery objectives to value — crown jewels get immutable and fast; bulk collaboration gets good-enough. And concentrate **high-fidelity detection** on the control-plane and crown-jewel signals (the screaming break-glass, the anomalous DCSync, the impossible-travel admin, the crown-jewel mass-download) rather than spreading shallow alerting evenly across everything.
|
||||
|
||||
---
|
||||
|
||||
## 4. Optionality & recovery — the heart of the book
|
||||
|
||||
- **Tested restores on a schedule.** The only proof of recovery is a restore that happened. Make the restore drill routine, time it, and verify integrity — that time *is* your real MTTR.
|
||||
- **Immutable + offline/isolated backups** — the escape hatch that survives the attacker reaching production. Ransomware-resilient by design, not by hope.
|
||||
- **Rehearsed AD forest and Entra recovery runbooks, stored independently** — on paper or offline, reachable when the estate is dark, not in the SharePoint that's encrypted.
|
||||
- **Configuration-as-code (IaC) for the control plane** — instant rebuild *and* a known-good baseline to detect drift and ghost configuration against. This single practice serves recovery, drift detection, and the Book I corollary at once.
|
||||
- **A clean-room / isolated recovery environment** — somewhere to rebuild that the attacker isn't already inside.
|
||||
- **The fail-over-vs-clean-in-place decision pre-made.** When do we rebuild rather than try to clean a compromised estate? Decide the criteria *before* the incident; it's the Book II "sever the sync" decision generalised to the whole estate.
|
||||
|
||||
---
|
||||
|
||||
## 5. Stressor — the hormesis engine (the climax of the handbook)
|
||||
|
||||
This is where the entire handbook either runs or rusts. Everything else is preparation for the loop; this is the loop turning.
|
||||
|
||||
- **Live restore of a crown-jewel dataset and the control plane.** Not a tabletop — an actual restore, integrity-verified and timed. The number you get is the truth; the number in the policy was always fiction.
|
||||
- **Rehearse AD forest recovery.** The first time you perform the hardest recovery in IT must not be during the real disaster. Run it. Find what's missing. Fix the runbook.
|
||||
- **Inject attacks end-to-end and follow them all the way through.** DCSync, malicious consent, break-glass use, impossible-travel admin, crown-jewel mass- download. Confirm not just that the alert *exists*, but that it's **triaged, and someone acts.** Detection that fires into a void fails this test on purpose, so you can fix it.
|
||||
- **Run a ransomware game-day** that assumes Tier 0 is owned and backups are the first target. Watch your decoupling hold or fail.
|
||||
- **Purple-team as routine, not annually.** Standing, escalating, blast-radius- controlled stress — hormesis, not a once-a-year audit ritual.
|
||||
- **Measure the loop itself.** Track *time from incident to structural change.* If drills and incidents close without a removed right, a severed coupling, or a new firebreak, the loop is broken and you are merely robust.
|
||||
|
||||
---
|
||||
|
||||
## The feedback loop — what makes all six books antifragile
|
||||
|
||||
Name the loop explicitly, because it's the thread that ties the whole handbook together and the thing that converts robustness into antifragility:
|
||||
|
||||
**Detect** (see the stressor) → **Respond** (contain it) → **Recover** (come back) → **Learn structurally** (come back *stronger*) → which feeds back into **Removal and redesign** across every prior book — a fragilizer deleted (Book I via negativa), a coupling severed (Book II), a standing privilege collapsed (Book III), a device boundary tightened (Book IV), a data flow closed (Book V).
|
||||
|
||||
The first three steps are robustness; plenty of organisations reach them and call it security. **The fourth step is the whole game.** A shock that produces no structural change has been wasted, and the system will meet the same shock again, unchanged. A shock that *does* produce structural change has made the estate stronger — which is the literal definition of antifragile, and the only honest justification for everything in this handbook.
|
||||
|
||||
---
|
||||
|
||||
## Honest uncertainty (verify the moving parts)
|
||||
|
||||
Stable and Lindy (teach with confidence): untested backup is no backup; attackers hit backups first; recovery must not depend on what it recovers; detection without action is theatre; alert fatigue is fragility; every shock must change the structure. None of that churns — these are the oldest truths in operational security.
|
||||
|
||||
What moves, and what you must verify:
|
||||
|
||||
- **M365 native backup/retention specifics and the shared-responsibility boundary** — what Microsoft does and does not cover, recycle-bin and hard-delete windows — evolve. Verify current reality, and **test what you can actually recover** rather than trusting either "Microsoft has us covered" or a vendor pitch.
|
||||
- **Entra recovery and configuration-backup tooling** (deleted-object windows, Graph/IaC options for capturing CA, Intune, and roles as code) evolve — verify current capability.
|
||||
- **AD forest recovery** is Lindy in principle (it is brutal; rehearse it), but automation and tooling evolve — confirm the current supported procedure.
|
||||
- **Detection tooling** (the XDR/SIEM signal catalogue) churns continuously. Verify which detections exist *today* and test them end-to-end; the principle (high-fidelity over noise, tested through to action) is what's permanent.
|
||||
- **Audit log retention and licensing** have changed over time — confirm what's captured and for how long *before* relying on it for forensics.
|
||||
|
||||
If recovery hinges on a current specific, verify it and test it. "We confirmed the restore works and it takes four hours" beats any RTO ever written in a policy.
|
||||
|
||||
---
|
||||
|
||||
## Consolidated judgement prompts
|
||||
|
||||
- When this fails, do we come back weaker, the same, or stronger? What's the mechanism that makes it *stronger*?
|
||||
- When was a backup of the crown jewels and the control plane last *restored* — not taken, restored — and how long did it take?
|
||||
- Are the backups reachable from the estate they protect? (If yes, they're another victim.) Are they immutable and offline?
|
||||
- Has anyone ever rehearsed AD forest recovery? Is the runbook reachable when the estate is dark?
|
||||
- Does any part of the recovery path depend on the thing the incident destroyed — credentials, runbook location, MFA, the recovery admin?
|
||||
- Does detection fire into action, or into a void? Is there so much noise the real signal is lost?
|
||||
- Does control-plane config exist as code (a known-good to rebuild and diff against), or only as click-ops?
|
||||
- For the last three incidents and drills: what *structural* thing changed? If the answer is "a reminder," the loop is broken.
|
||||
- How long from incident to structural change — and is that time getting shorter?
|
||||
|
||||
---
|
||||
|
||||
## Coda — the whole arc
|
||||
|
||||
Six books, one idea. Book I is the **lens**: subtract before you add, protect the irreplaceable, measure blast radius, buy optionality, stress on purpose, and make every shock change the structure — verifying by observation, never by inspection. Books II–V apply that lens to the **containers and contents**: the identity bridge made a firebreak, privilege collapsed in reach and time, the device assumed hostile and the boundary moved to the data, and the data itself made to carry its own protection as it flows. Book VI is the **loop** that makes it all antifragile rather than merely robust — the machine that feeds every incident back into removal and redesign.
|
||||
|
||||
None of this is a checklist, and if a consultant trained on it ever reaches for "because the benchmark says so," they've missed the point. The point is judgement: draw the wall, find the fragility, fix what matters, and let every stress make the estate stronger than it was.
|
||||
|
||||
Move fast and fix things.
|
||||
|
||||
---
|
||||
|
||||
*Book VI of the Antifragile Handbook, and the close of the arc.*
|
||||
@@ -0,0 +1,203 @@
|
||||
# The Antifragile Handbook for M365 & Active Directory
|
||||
|
||||
## Book VII — Vulnerability Management
|
||||
|
||||
> *The patch cycle was built for a world where you had weeks. That world is gone. Exploitation now arrives in hours, the patch arrives in days, and no amount of "patch faster" closes a gap that runs the wrong way by two orders of magnitude. Stop racing the attacker to the patch. Change the race.*
|
||||
|
||||
---
|
||||
|
||||
## The governing question
|
||||
|
||||
The first six books were written for a world in which the dominant way into an estate was a person — phished, tricked, talked past the controls. That assumption is now wrong. As of the 2026 Verizon DBIR, **exploitation of vulnerabilities is the leading initial-access vector in confirmed breaches — roughly twice phishing, for the first time in the report's history.** The front door changed. This book changes the lens to match.
|
||||
|
||||
The governing question is the same as everywhere else in the handbook, pointed at the vulnerability surface:
|
||||
|
||||
> **When — not if — a vulnerability on your estate is exploited, does the estate come back weaker, the same, or stronger?**
|
||||
|
||||
A fragile estate treats every CVE as a race it has already lost and patches by score until the analyst burns out. A robust estate patches the important ones fast and survives. An antifragile estate **stops treating the vulnerability list as the unit of work at all** — it asks where the vulnerability sits on the kill chain, removes the false urgency that hides the real targets, contains the few that matter in hours, and feeds every exploited path back into architecture so the *next* vulnerability on that path is a non-event.
|
||||
|
||||
The reframe that powers the book: **you cannot win a speed race against machine-speed exploitation by moving your humans faster, and you do not have to.** The winning move is not to patch the long tail before the attacker reaches it — that is arithmetically impossible and getting worse. The winning move is to make most vulnerabilities not matter (blast-radius and reachability), contain the few that do in the time you actually have (hours, not weeks), and convert every near-miss into a permanently shorter kill chain.
|
||||
|
||||
---
|
||||
|
||||
## Why the old model is finished — the arithmetic
|
||||
|
||||
Four numbers end the debate, and they are worth saying out loud to a client in a room:
|
||||
|
||||
- **Time-to-exploit has collapsed** from a median of 771 days in 2018 to roughly **4 hours** by 2024. The window the entire patch-management model was built around — the weeks between disclosure and exploitation — has effectively closed.
|
||||
- **Patching still takes weeks.** The 2026 DBIR puts median remediation of edge-device vulnerabilities at **43 days**, with only **54% remediated within a year.** 43 days versus 4 hours is the whole story.
|
||||
- **Volume has gone vertical.** ~59,000 new CVEs were projected for 2025, a ~50% year-on-year increase, and 2026 is on pace to exceed it. The enrichment infrastructure has buckled under the load — NIST reclassified ~29,000 backlogged CVEs to "Not Scheduled," meaning the data you relied on to prioritise is arriving late or never.
|
||||
- **Exploitation is being automated.** Autonomous exploitation research has demonstrated AI systems exploiting 174 of 178 CISA Known-Exploited Vulnerabilities at an average of ~21 minutes each, with no human in the loop, and an ~87% success rate against one-day vulnerabilities in real software. The attacker side automates faster than the defender side because generating a working exploit for a known bug is a clean, verifiable, deterministic problem — exactly what machines are good at — while *defending* requires environmental context, which is exactly what they have historically been bad at.
|
||||
|
||||
The honest conclusion: **a human-paced, score-sorted patch programme is now structurally incapable of keeping pace.** This is not a maturity problem to be solved with more analysts. It is a model that has run out of road. Everything below is the replacement.
|
||||
|
||||
One piece of good news hides in the data, and the whole framework leans on it: **roughly 90% of "critical" vulnerabilities are not actually exploitable in a given environment once compensating controls, reachability, and segmentation are properly mapped.** The fragility is not that you have 40,000 criticals. It is that you cannot yet tell which ~10% are real, so you treat all 40,000 as equally urgent and drown. Antifragile vulnerability management is, before anything else, the discipline of removing the 90% of false urgency so the real targets become visible.
|
||||
|
||||
---
|
||||
|
||||
## 1. Fragility inventory — where vulnerability management rots
|
||||
|
||||
### CVSS as the prioritisation engine
|
||||
|
||||
The original sin. CVSS scores *severity in the abstract* — it knows nothing about whether the vulnerable asset is internet-reachable, whether it sits on the kill chain, whether an exploit exists, or whether an existing control already neutralises it. A 9.8 on a segmented, non-privileged, unreachable host is noise; a 7.5 on an internet-facing box one hop from a domain controller is a P0. Sorting 40,000 findings by CVSS produces a list that is precisely uncorrelated with where the attacker will actually go. It feels like prioritisation. It is sorting by the wrong key.
|
||||
|
||||
### The infinite, undifferentiated backlog
|
||||
|
||||
"We have 40,000 criticals" is not a vulnerability problem; it is a *triage* problem wearing a vulnerability costume. An undifferentiated backlog has no front — every item looks equally urgent and equally hopeless — so the team either patches by score (wrong key) or freezes. The backlog grows faster than any human process can drain it, which means a backlog-draining strategy is a strategy to fall behind forever.
|
||||
|
||||
### Patch velocity treated as the only lever
|
||||
|
||||
The reflex when the AI-exploitation story lands is "we need to patch faster." It is the wrong reflex, and it is the most expensive one. You cannot out-patch a 4-hour exploitation window with a 43-day cycle by trimming the cycle to 30 days. Velocity is a real lever for the long tail, but as the *primary* response to the speed problem it is a fragilizing illusion — it consumes the entire budget defending a race you mathematically cannot win, and leaves nothing for the moves that actually change the outcome (reachability, blast radius, containment, architecture).
|
||||
|
||||
### The half-done remediation — the ghost patch
|
||||
|
||||
Book I's ghost-policy corollary, applied to vulnerabilities. A patch deployed to 80% of the fleet, a compensating rule applied but never verified to actually block, a "remediated" ticket closed against a host that quietly rolled back — these are *worse* than an open finding, because the open finding is at least honest. A remediation that displays as done while enforcing nothing is a vulnerability with a clean bill of health. **A vulnerability that is partly fixed is not partly safe; it is fully exploitable and now invisible.**
|
||||
|
||||
### The unscanned and the unscannable
|
||||
|
||||
You cannot prioritise what you cannot see. The fleet you don't scan (Book IV's shadow and dark device populations), the appliance whose firmware no scanner reads, the SaaS you don't own, the dependency buried three layers into a container image — these are the dangerous quanta precisely because they carry no score at all. An estate that congratulates itself on draining the *known* backlog while the unknown surface grows is optimising the lit area under the streetlight.
|
||||
|
||||
### Reachability and compensating controls left unmapped
|
||||
|
||||
If you have not mapped which assets are internet-reachable, which sit behind a WAF or EDR, which are segmented away from the crown jewels, then you have no way to perform the one subtraction that matters — collapsing 40,000 criticals to the ~10% that are genuinely exploitable here. Without reachability and control context, every finding is theoretically critical and therefore practically un-prioritisable.
|
||||
|
||||
### Remediation as the silent bottleneck
|
||||
|
||||
Detection is largely solved — most teams are *drowning* in findings, not short of them. The bottleneck is everything after: triage, ownership, change windows, approvals, deployment, verification. Each human handoff in that chain costs hours or days, and there are usually five or six of them. In a world of 4-hour exploitation, a six-handoff remediation pipeline *is* the vulnerability.
|
||||
|
||||
### Detection without a feedback path to architecture
|
||||
|
||||
A vuln gets exploited (or nearly), it gets patched, the ticket closes, and the *path* the attacker used — the flat segment, the over-privileged service account, the reachable management interface — stays exactly as it was, waiting for the next CVE to land on it. The incident produced a patch but no structural change. The disorder was wasted. This is the Book VI failure mode pointed at the vulnerability layer, and it is the difference between a programme that gets stronger and one that runs in place forever.
|
||||
|
||||
---
|
||||
|
||||
## 2. Via negativa — what to remove
|
||||
|
||||
The defining act of antifragile vulnerability management is **subtraction before addition.** You remove false urgency, false comfort, and false work before you add a single new tool.
|
||||
|
||||
1. **Remove CVSS as the sort key.** It does not go away — it stays as one input — but it stops being the thing that orders the queue. The queue is ordered by kill-chain position and exploitability in *this* environment.
|
||||
2. **Remove the ~90% of criticals that aren't exploitable here.** Map reachability and compensating controls and *delete the false urgency* on everything segmented, unreachable, or already neutralised. This is the single highest-leverage move in the entire programme: it turns "40,000 criticals" into "400 that are real and 40 that are on fire," and it is pure subtraction.
|
||||
3. **Remove the undifferentiated backlog.** A backlog with no structure is itself a fragility. Replace it with quanta (Section 3) — time-budgeted, atomic, completable units. An item that cannot be placed in a quantum is either not real (delete it) or not yet understood (route it to discovery).
|
||||
4. **Remove "patch faster" as the headline strategy.** Demote velocity to what it is — a lever for the long tail — and stop letting it consume the budget that belongs to reachability, blast radius, and containment.
|
||||
5. **Remove the half-done remediation from the "done" column.** A fix is not done until it is *verified to enforce* against a real test, not until the ticket is closed. Every quantum closes with a signal or it does not close. (Book I: validate by observation, never by inspection.)
|
||||
6. **Remove human handoffs from the hours-lane.** The steps in the critical-quantum pipeline that require no judgement — detection, reachability assessment, work-item generation, routing — get automated within policy guardrails so the scarce human judgement is spent only where judgement is actually required. You are not removing the human; you are removing the human from the steps that were only ever latency.
|
||||
|
||||
---
|
||||
|
||||
## 3. Quantum vulnerability management — the core model
|
||||
|
||||
Here is the model the rest of the book turns on, and the direct answer to "how do we size remediation to a world that moves in hours."
|
||||
|
||||
A **quantum** is the smallest unit of remediation that (a) fully closes a specific exploitable path, (b) is sized to a time budget it can *actually be completed within*, and (c) ends in a verifiable signal. The word is deliberate. A quantum is *atomic* — you cannot ship half of it and claim half the protection (that is the ghost patch). And it is *discrete* — work is packetised into units that fit the time you have, not smeared across an infinite backlog.
|
||||
|
||||
The sort key is not severity. It is **time-to-existential-impact**, which is a function of three things the estate actually determines:
|
||||
|
||||
> **kill-chain position × reachability × exploit availability**
|
||||
|
||||
A vulnerability that sits on the path to existential compromise, is reachable by the adversary, and has a working exploit in the wild has a time-to-impact measured in hours. The same vulnerability, segmented away and unreachable, has a time-to-impact measured in months — or never. **The vulnerability is identical; its quantum is different, because its position is different.** This is the Book I principle (kill-chain position changes priority, not the CVE) made operational.
|
||||
|
||||
That sort produces three live quanta and one that is more dangerous than all of them:
|
||||
|
||||
### Critical quantum — the hours lane
|
||||
|
||||
On the kill chain, reachable, exploitable now. The time budget is **hours**, and that fact dictates the response: **you cannot wait for a patch cycle, so the critical quantum is closed by a compensating control, not necessarily the patch.** Block it at the edge, sever the reachability, disable the vulnerable feature, isolate the host, pull it behind the WAF. The patch follows later in the standard lane on the normal change calendar. The critical quantum's job is to **move the asset out of the hours-window** — to convert a 4-hour time-to-impact into a non-urgent one — by the cheapest fast control available. This is the lane that must be partly autonomous (Section 6), because human-paced execution cannot meet an hours budget.
|
||||
|
||||
### Severe quantum — the days lane
|
||||
|
||||
Material risk, reachable with friction, or where a compensating control already buys partial cover. The time budget is **days**. These are batched into a days-sized packet of work that can be fully completed and verified inside a single short change window — not started and left at 80%.
|
||||
|
||||
### Standard quantum — the sprint lane
|
||||
|
||||
The long, real, non-urgent tail. The time budget is a **sprint**. The discipline here is batching: the long tail is drained in sprint-sized quanta of work that *can actually be finished*, each one atomic and verified, rather than as an ever-growing list nobody ever reaches the bottom of. This is the only lane where "patch velocity" is the right tool, and it is fine for it to be slow, because by definition nothing in it is on fire.
|
||||
|
||||
### Dark quantum — the unsized unknown
|
||||
|
||||
The most dangerous quantum is the one you cannot size, because you cannot yet see the asset, cannot establish reachability, or cannot determine exploitability. An unsized quantum is not a low priority — it is an *uncharacterised* one, and uncharacterised risk on an unknown asset is exactly how estates die. The antifragile response is not to ignore it (it has no score, so the old model does) but to **route it to discovery and to the Kill Chain Assessment** — to spend effort turning a dark quantum into a sized one, because a known severe is safer than an unknown nothing. This lane is why discovery (Book IV, the zero-budget discovery playbooks, the Kill Chain Assessment app) is part of vulnerability management and not separate from it.
|
||||
|
||||
**The quantum discipline in one line:** size every remediation to the time you actually have, make each unit atomic and verifiable, and spend your scarce judgement converting dark quanta into sized ones — not re-sorting the known list by the wrong key.
|
||||
|
||||
---
|
||||
|
||||
## 4. The barbell — fast containment and deep architecture, nothing in the fragile middle
|
||||
|
||||
The vulnerability barbell has two ends and a lethal middle.
|
||||
|
||||
**One end: cheap, fast, reversible containment.** The hours-lane compensating controls — edge blocks, reachability cuts, feature disables, isolation. Low cost, high speed, applied within policy, reversible when the patch lands. This end exists to win the time race the patch can never win.
|
||||
|
||||
**The other end: slow, structural, blast-radius reduction.** Segmentation, least privilege, T0 protection, assume-breach architecture (the whole of Books II–V). This is the end that makes the ~90% of vulnerabilities *not matter*, because a vulnerability that cannot reach anything important and cannot pivot is a finding, not an incident. It is slow and expensive and it is the only durable bet — architecture beats velocity in the vulnerability race, and it is the only race you can actually win.
|
||||
|
||||
**The fragile middle to avoid: the aging critical-patch backlog.** A months-long queue of "critical" patches is neither fast containment nor structural fix. It is the worst of both — it carries the urgency of the hours-lane but moves at the speed of the sprint-lane, so it spends maximum anxiety for minimum protection while the attacker clears it for you, one exploited host at a time. The barbell says: contain it fast *or* architect it away. Do not let it sit in the middle, aging, pretending that "we're working through the criticals" is a posture.
|
||||
|
||||
The asymmetric-payoff reading (Pillar 5): a few hours of compensating-control work on a kill-chain node prevents a catastrophe, and a segmentation project that costs a quarter makes a thousand future CVEs irrelevant. Both ends of the barbell are convex. The fragile middle is concave — maximum cost, minimum return.
|
||||
|
||||
---
|
||||
|
||||
## 5. Optionality & recovery — designing so most vulnerabilities can't matter
|
||||
|
||||
- **Reachability as a control surface.** If you can cut a vulnerable asset off from the adversary faster than you can patch it — and you almost always can — then reachability *is* your fastest remediation. Build the capability to sever reachability quickly (edge policy as code, network isolation on demand) and you have an answer to every hours-lane finding that does not depend on a vendor patch existing yet.
|
||||
- **Compensating-control inventory, mapped in advance.** The ~90% reduction only works if you already know, per asset, what controls are in front of it. Map EDR coverage, WAF rules, segmentation, and internet reachability *before* the incident, so that when a zero-day drops you can answer "are we actually exposed?" in minutes instead of days. This map is the single most valuable artefact in the programme.
|
||||
- **Blast-radius limitation as vulnerability management.** Every segmentation boundary and every collapsed standing privilege is a vulnerability-management control, because it converts "exploit one thing, own everything" into "exploit one thing, contain it." The cheapest way to manage a vulnerability is to have already made it survivable.
|
||||
- **Known-good baselines and config-as-code (ASTRAL).** When a vulnerability is exploited, the ability to restore the affected control plane to a verified baseline collapses the cost of exploitation. A reachable, recoverable, version-controlled estate treats a successful exploit as an inconvenience, not a catastrophe.
|
||||
- **The pre-made "isolate vs patch vs rebuild" decision.** Decide the criteria before the incident: when do we contain-and-wait, when do we emergency-patch, when do we rebuild from known-good? Deciding under fire is how the half-done remediation gets created.
|
||||
|
||||
---
|
||||
|
||||
## 6. Stressor — the autonomy and the feedback loop
|
||||
|
||||
Two stressors run this book, and the second is the one that makes it antifragile rather than merely fast.
|
||||
|
||||
### Autonomy in the hours-lane — matching machine speed with machine speed
|
||||
|
||||
The article that prompted this book is right about the core asymmetry: **attackers are executing at machine speed and defenders are still running remediation through human-paced processes designed for a world with weeks of lead time.** The hours-lane cannot be served by a pipeline with five human handoffs. So the critical quantum's execution — detect the new exposure, cross-reference the asset inventory, assess reachability and compensating controls, generate the work item with context, route it, and in the clear cases *apply the compensating control* — runs autonomously **within human-defined guardrails.**
|
||||
|
||||
The repo's standing scepticism applies and sharpens the point rather than contradicting it: **AI on a broken foundation is expensive noise.** Autonomy without environmental context just generates tickets faster — "faster noise," the exact toil that makes developers dread security. The autonomy only works *because* the foundation is in place: the compensating-control map, the reachability model, the known-good baseline, the segmented architecture. Autonomy is the accelerator on the hours-lane; architecture is still the durable bet. The human role moves up a level — from doing the remediation to **governing the policy**: which classes of action the system may take, which severity thresholds trigger automated containment, which changes still require a human. That is a better use of scarce security talent and the only operating model that survives the volume. The concrete blueprint for this lane is in [AI-Assisted TVM](../playbooks/ai-assisted-tvm.md); this book is the principle, that playbook is the build.
|
||||
|
||||
The guardrail is the whole game. Autonomous does not mean uncontrolled. The most defensible implementations keep the human at the policy boundary and delegate only execution — and they apply compensating controls (reversible, contained) far more readily than irreversible changes. Start the autonomy on the safest, highest-value action: cutting reachability on a confirmed-exploitable, internet-facing, kill-chain asset.
|
||||
|
||||
### The feedback loop — every exploited path becomes a shorter kill chain
|
||||
|
||||
This is the climax, and it is the same machine as Book VI. A vulnerability that was exploited, or nearly exploited, is the cheapest penetration test you will ever get — honest, real-world data about exactly where a path to the crown jewels was open. Patching the CVE wastes that data. The antifragile move is to **sever the path**: the flat segment gets a boundary, the over-privileged service account gets collapsed, the reachable management interface gets pulled behind the bastion — so that the *next* vulnerability that lands on that path is a non-event before it is ever disclosed.
|
||||
|
||||
Measure the loop, not just the lane. MTTR tells you how fast you patch; it does not tell you whether you are getting stronger. The antifragile metric is: **after each exploited-or-near vulnerability, did the kill chain get shorter?** If the last ten vulnerability incidents produced ten patches and zero severed paths, the loop is broken and you are merely fast. If they produced ten patches and six structurally shortened kill chains, the estate is getting harder to compromise every time it is tested — which is the only honest definition of antifragile.
|
||||
|
||||
---
|
||||
|
||||
## Honest uncertainty (verify the moving parts)
|
||||
|
||||
Stable and Lindy (teach with confidence): CVSS is not a priority; kill-chain position is. Most criticals aren't reachable. A half-done remediation is a hidden full vulnerability. You cannot out-patch machine-speed exploitation; you can make most vulnerabilities not matter and contain the few that do. Every exploited path should shorten the kill chain. None of that churns — it is the architecture-beats-velocity thesis applied to vulnerabilities, and it will outlive every tool named here.
|
||||
|
||||
What moves, and what you must verify:
|
||||
|
||||
- **The headline statistics churn annually.** The "exploitation is #1, ~2× phishing" finding is the 2026 DBIR; the 4-hour and 43-day figures, the ~59,000-CVE projection, the autonomous-exploitation benchmarks — all of these are point-in-time and will move. The *direction* (exploitation rising, time-to-exploit collapsing, volume exploding) is the stable signal; the specific numbers need re-checking against the current year's DBIR, M-Trends, and FIRST/CVE data before you put them on a slide.
|
||||
- **The enrichment infrastructure is actively degrading.** NVD's backlog and the "Not Scheduled" reclassification mean the data you use to prioritise is itself unreliable and getting worse. Verify what enrichment you can actually trust *today*, and lean harder on your own reachability and exploitability signals precisely because the public ones are thinning.
|
||||
- **The autonomous-execution tooling is immature and fast-moving.** The Zero-Day-Agent-class pattern (autonomous detect → reachability assessment → compensating control) is real and operational but the products, their accuracy, and their guardrail models are evolving monthly. Verify current capability and, more importantly, current *failure modes* before you delegate any action — and start with reversible compensating controls, never irreversible change.
|
||||
- **The ~90%-not-exploitable figure is environment-specific.** It is a defensible industry estimate, not a law. The real number depends entirely on how well your compensating controls are actually mapped and enforced — and a mapped control that has rotted into a ghost is a false negative that will hurt you. Test the controls you are counting on, do not trust the map.
|
||||
- **Exploit-availability and threat-intelligence feeds** (CISA KEV, exploit databases, vendor advisories) are reliable in principle but vary in latency and coverage — verify which feeds are current and how fast they update before you wire them into the hours-lane.
|
||||
|
||||
If a prioritisation decision hinges on a current specific, verify it and test it. "We confirmed this asset is internet-reachable and the EDR rule actually blocks the exploit" beats any CVSS score ever published.
|
||||
|
||||
---
|
||||
|
||||
## Consolidated judgement prompts
|
||||
|
||||
- When a vulnerability on this estate is exploited, do we come back weaker, the same, or stronger? What's the mechanism that makes it stronger?
|
||||
- Are we sorting by CVSS, or by kill-chain position × reachability × exploit availability?
|
||||
- Of our "criticals," how many are actually reachable by an adversary right now? If we don't know, that is the first finding.
|
||||
- For our top exploitable findings: can we sever reachability faster than we can patch? If yes, why are we waiting for the patch?
|
||||
- Is anything in the "done" column a ghost patch — closed but never verified to enforce?
|
||||
- What is sitting in the fragile middle — the aging critical-patch backlog that is neither contained fast nor architected away?
|
||||
- How many human handoffs are in our hours-lane, and which of them require actual judgement versus just adding latency?
|
||||
- What's in the dark quantum — the unscanned, the unscannable, the unowned — and what are we doing to size it?
|
||||
- For the last ten vulnerability incidents: how many produced a severed path versus just a patch? Is the kill chain getting shorter?
|
||||
|
||||
---
|
||||
|
||||
## Where this book sits in the arc
|
||||
|
||||
Books II–V harden the containers and contents; Book VI builds the loop that makes shocks pay. Book VII is what happens when the dominant shock stops being a phished human and becomes an exploited vulnerability arriving at machine speed. The answer is not a seventh thing bolted on — it is the same antifragile lens (subtract the false, protect the irreplaceable, contain the few that matter, feed every shock back into structure) applied to the surface the attacker now prefers. The vulnerability list was never the unit of work. The kill chain always was.
|
||||
|
||||
Move fast and fix things.
|
||||
|
||||
---
|
||||
|
||||
*Book VII of the Antifragile Handbook. Pairs with the [Quantum Vulnerability Management](../core/quantum-vulnerability-management.md) framework and the [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md); the build-level companion is the [AI-Assisted TVM Blueprint](../playbooks/ai-assisted-tvm.md).*
|
||||
@@ -0,0 +1,101 @@
|
||||
# The Antifragile Handbook for M365 & Active Directory
|
||||
|
||||
Most M365 estates are fragile. Not because nobody has run the benchmarks — they have, and the scorecards look fine. They're fragile because a compliance certificate and a hardened estate are different things, and the industry has spent years teaching people to chase the first while missing the second.
|
||||
|
||||
This handbook is the attempt to close that gap. It is written for consultants who want to walk into a tenant they've never seen and find the thing that will actually kill the client — not the thing that fails the CIS audit. It is opinionated, sequenced, and deliberately uncomfortable. If you want a checklist, the CIS Benchmark is free. If you want to understand *why* the checklist exists, what breaks when the controls fail, and how to build an estate that gets stronger under attack rather than just surviving it, start here.
|
||||
|
||||
The governing question in every book is the same:
|
||||
|
||||
> **When — not if — this fails, does the estate come back weaker, the same, or stronger?**
|
||||
|
||||
---
|
||||
|
||||
## The books
|
||||
|
||||
### [Book I — Principles & Judgement](00-principles-and-judgement.md)
|
||||
|
||||
*The craft before the controls.*
|
||||
|
||||
Everything else in this series rests on the discrimination developed here: the ability to distinguish signal from noise, to know that disabling legacy auth outranks renaming forty GPOs, and to understand why compliance is a floor and a by-product rather than the target. This book also introduces the "move fast and fix things" operating principle — a deliberate inversion of the Silicon Valley creed, because the things are already broken and speed means refusing to let a thirty-page risk-acceptance process protect a fragility a teenager with a phishing kit will remove for free.
|
||||
|
||||
Read this first, even if you're experienced. Especially if you're experienced.
|
||||
|
||||
---
|
||||
|
||||
### [Book II — Hybrid Identity](01-hybrid-identity.md)
|
||||
|
||||
*Draw the wall between on-prem and cloud. In most estates there isn't one — there's a hallway with the door propped open.*
|
||||
|
||||
In a hybrid estate, on-prem AD and Entra ID are not two systems with a guarded border. They're one organism wearing two badges, joined by a bridge that most organisations cannot draw, do not monitor, and have never tested severing. This book maps the bridge — the sync engine, the connector accounts, the authentication method, the writeback paths — and explains why a single compromise of the sync server gives an attacker DCSync on-prem *and* cloud object manipulation at the same time. Then it shows how to build the actual wall.
|
||||
|
||||
If you only ever fix one domain, fix this one. Everything else assumes identity holds.
|
||||
|
||||
---
|
||||
|
||||
### [Book III — Privileged Access](02-privileged-access.md)
|
||||
|
||||
*Privilege is blast radius with a time axis. Standing privilege reaches everything, forever. The whole job is to collapse both: less reach, less time.*
|
||||
|
||||
The most dangerous accounts in any estate are the ones nobody is watching — the permanent Domain Admins that have always existed, the service accounts with Kerberoastable SPNs and passwords from 2016, the app registrations with `RoleManagement.ReadWrite.Directory` and admin consent that nobody remembers granting. This book names them, shows how they become privilege-escalation paths, and builds the case for Just-in-Time access, Entra PIM, and a rigorous service-principal audit as the core of any engagement.
|
||||
|
||||
The single most important number in this book: how many identities hold standing privilege right now?
|
||||
|
||||
---
|
||||
|
||||
### [Book IV — Devices & Endpoint (Intune)](03-devices-and-intune.md)
|
||||
|
||||
*The device will be compromised. Compliant is not the same as secure, and the portal toggle is not the same as the device's behaviour.*
|
||||
|
||||
Endpoint programmes are usually built on a wish: make the device trusted. That wish is unwinnable. This book flips the question — assume every device is already compromised, and ask what still holds — and uses that reframe to expose the gap between a "compliant" device in the portal and a device that is actually behaving as expected. It covers the hidden fleet (managed, unmanaged, shadow, dark), the Conditional Access misconfiguration patterns that most estates share, and how to build posture that survives an untrusted device rather than depending on the device being clean.
|
||||
|
||||
The spine of the book: compliance is a signal, not a checkbox.
|
||||
|
||||
---
|
||||
|
||||
### [Book V — Data & Collaboration](04-data-and-collaboration.md)
|
||||
|
||||
*Data is liquid. The question is never "is it locked down" but "where can it flow, who can reshare it, and can you see and reverse the flow?"*
|
||||
|
||||
Books II–IV protect the containers: identity, privilege, devices. This book is about the contents, and contents obey different physics. An "Anyone with the link" SharePoint share is a bearer token — no identity, no MFA, no device check, often no expiry, forwardable to anyone, reachable by the open web if it leaks. Guest sprawl hands your blast radius to external identities you don't govern. Email is the oldest exfil channel in the industry and almost never properly monitored. This book maps the exposure patterns across Exchange, SharePoint, Teams, and OneDrive, and builds the controls that let you see — and reverse — the data flow.
|
||||
|
||||
For most estates the honest answer to "can you see where it went?" is no. That's the starting point.
|
||||
|
||||
---
|
||||
|
||||
### [Book VI — Recovery & Detection-as-Feedback](05-recovery-and-detection.md)
|
||||
|
||||
*Robust means you survive the shock unchanged. Antifragile means you come back stronger. The shock is coming either way — the only choice is what you do with it.*
|
||||
|
||||
The capstone, because it decides whether everything before it was merely robust or genuinely antifragile. Detection and recovery are not the sad afterthought — they're the feedback loop that changes the structure of the estate after every shock. An org that buries incidents stays fragile. An org that treats them as fuel becomes antifragile. This book covers the recovery lies the industry tells itself (untested backups, undocumented break-glass, AD forest recovery nobody has practised), builds the detection architecture, and — most importantly — describes the machine that turns incidents, alerts, and near-misses into structural improvement.
|
||||
|
||||
Read this once you've built something worth protecting — it closes the original defensive arc (Books I–VI).
|
||||
|
||||
---
|
||||
|
||||
### [Book VII — Vulnerability Management](06-vulnerability-management.md)
|
||||
|
||||
*The patch cycle was built for a world where you had weeks. That world is gone. Stop racing the attacker to the patch — change the race.*
|
||||
|
||||
The first six books assume the dominant way into an estate is a phished human. As of the 2026 Verizon DBIR that assumption is wrong: **exploitation of vulnerabilities is now the leading initial-access vector, roughly twice phishing.** This book changes the lens to match. It refuses the two losing moves — sorting 40,000 findings by CVSS, and trying to "patch faster" against a 4-hour exploitation window — and replaces them with the antifragile alternative: subtract the ~90% of criticals that aren't actually reachable, size the rest into **quanta** by time-to-existential-impact (hours / days / sprint, plus the dangerous *dark* quantum you can't yet size), contain the few that matter with compensating controls rather than waiting for a patch, and feed every exploited path back into a shorter kill chain.
|
||||
|
||||
It pairs with the [Quantum Vulnerability Management](../core/quantum-vulnerability-management.md) framework and the [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md). Read it when the threat landscape — not the maturity model — forces the question.
|
||||
|
||||
---
|
||||
|
||||
## Field Guide (2026 Edition)
|
||||
|
||||
The books are principles; they are deliberately stable. Two field guides apply them in practice:
|
||||
|
||||
**[Field Guide — 2026 Edition](field-guide-2026.md):** Concrete actions and current tooling for foundational engagements. The "do this" companion to the handbook. Review January 2027.
|
||||
|
||||
**[Field Guide — Adversarial Validation](field-guide-adversarial-validation.md):** For clients who have done the foundational work. Tests declared controls against observed behaviour, domain by domain. Closes with a client leave-behind cadence so the admin can self-monitor between engagements. Review January 2027.
|
||||
|
||||
For inspection checklists, see the [assessment templates](../assessment-templates/): the [Engagement Checklist](../assessment-templates/engagement-checklist.md) (foundational), the [Adversarial Validation Checklist](../assessment-templates/adversarial-validation-checklist.md) (phase 2), and the [Self-Service Cadence](../assessment-templates/self-service-cadence.md) (client leave-behind).
|
||||
|
||||
---
|
||||
|
||||
## How to use this series
|
||||
|
||||
The books are sequenced deliberately — each one assumes the previous — but an experienced practitioner can use them as field references. The fragility inventories at the start of each book are designed to be usable on day one of an engagement, before you've had time to read everything. The "governing question" at the start of each section is designed to be asked out loud, to a client, in a room where someone will have to answer it.
|
||||
|
||||
The goal throughout is not compliance. Compliance is a by-product. The goal is an estate that gets harder to compromise every time it's tested — and is tested often enough to know.
|
||||
@@ -0,0 +1,568 @@
|
||||
# M365 + AD Field Guide — 2026 Edition
|
||||
|
||||
> *The books are principles. This is practice — concrete actions, current tooling, and 2026-specific decisions. It will need updating next year. That is the point.*
|
||||
|
||||
**Last updated:** June 2026
|
||||
**Companion to:** The Antifragile Handbook for M365 & AD (Books I–VI)
|
||||
**Next review:** January 2027
|
||||
|
||||
---
|
||||
|
||||
## What this is
|
||||
|
||||
The Antifragile Handbook teaches judgement. This document teaches actions — what to do, in 2026, with the tooling that exists now, in the estates you will actually walk into. Where the handbook says "eliminate AD FS," this document says how and what blockers to expect. Where the handbook says "test the CA policy," this document says what a ghost policy looks like when you find one.
|
||||
|
||||
Read the books first. Use this document on-site.
|
||||
|
||||
---
|
||||
|
||||
## Notation
|
||||
|
||||
**P0** — attacker already through; fix before leaving this session
|
||||
**P1** — closes in this engagement
|
||||
**P2** — roadmap item, documented
|
||||
**2026 note** — something that has changed or become clearer since the handbook was written
|
||||
|
||||
---
|
||||
|
||||
## 1. Hybrid Identity
|
||||
|
||||
### Remove AD FS — this is now a P0 conversation
|
||||
|
||||
In 2026, Microsoft's migration tooling has matured to the point where AD FS is a choice, not an inevitability. Every client still running it should have a migration plan or a written, named reason for not having one.
|
||||
|
||||
**Why it is a P0:** Golden SAML is still an active nation-state technique. The token-signing private key in most tenants has never been rotated, is stored on the AD FS servers, and is not monitored. One foothold on any on-prem system that can reach the AD FS servers ends cloud identity entirely — silently, with validly-signed tokens, no failed logins, nothing for a SIEM to catch.
|
||||
|
||||
**What to do:**
|
||||
- In the Entra portal, go to Identity > Applications > AD FS activity (if it appears). This gives you the relying party trust inventory and migration readiness per application. This is your conversation starter.
|
||||
- Enumerate relying party trusts: `Get-AdfsRelyingPartyTrust | Select-Object Name, Enabled, Identifier`. Each enabled one is a blocker that needs a cloud equivalent or decommission plan.
|
||||
- Check the token-signing cert: `Get-AdfsCertificate -CertificateType Token-Signing`. Note the NotAfter date and when it was last rotated. "Has not been rotated since installation" is the expected answer and is itself a finding.
|
||||
- Staged rollout in Entra lets you migrate users incrementally — you do not have to cut over all at once. Use it.
|
||||
|
||||
**Migration target:** Password Hash Sync (PHS) + Entra-managed MFA via Conditional Access. This removes the on-prem dependency for cloud authentication and kills Golden SAML as a class.
|
||||
|
||||
**2026 note:** The AD FS migration activity report and staged rollout tooling make this significantly more tractable than it was in 2023–2024. Remove the roadmap language and have the P0 conversation.
|
||||
|
||||
---
|
||||
|
||||
### Connect Sync vs Cloud Sync — new deployments
|
||||
|
||||
**2026 recommendation:** For new hybrid sync deployments and organizations without complex topologies (no device writeback, no large object filtering requirements, no multi-forest writeback scenarios), **Entra Cloud Sync** is the preferred deployment. Smaller attack surface than Connect Sync (no SQL Express, no full-blown sync engine, multiple lightweight agents for HA), easier to harden, no single machine that holds DCSync-capable credentials.
|
||||
|
||||
**Connect Sync stays correct for:** Large/complex topologies, specific writeback scenarios (check the current parity matrix at Microsoft Learn before promising Cloud Sync covers a client's requirements — this changes).
|
||||
|
||||
**For existing Connect Sync deployments:** The migration path to Cloud Sync exists. Check current documentation for topology compatibility. Do not promise the migration before confirming the client's scenario is supported.
|
||||
|
||||
**In either case, the sync server is Tier 0.** See the hardening actions below.
|
||||
|
||||
---
|
||||
|
||||
### Sync server hardening — concrete actions
|
||||
|
||||
The sync server (Connect or Cloud Sync agent host) is typically treated as a utility VM. It holds an identity capable of DCSync. Treat it accordingly.
|
||||
|
||||
**Immediate checks:**
|
||||
- Is the server domain-joined to the production domain? If yes, its blast radius is one hop from any Tier 1 or Tier 2 compromise. Ideal: join it to a dedicated Tier 0 or management forest, or isolate it behind jump-box access only.
|
||||
- What account runs the connector service, and what permissions does it have? For Connect Sync, the on-prem connector account needs `Replicate Directory Changes` and `Replicate Directory Changes All`. Confirm it is a dedicated service account (ideally gMSA), not a human admin account that doubled up.
|
||||
- Has the server ever been patched? Check `Get-HotFix | Sort-Object InstalledOn -Descending | Select-Object -First 5`. If nothing in the last 60 days, that is a finding.
|
||||
- Is the Entra connector account (Directory Synchronization Accounts role) monitored? Any sign-in from any host other than the sync server should alert immediately.
|
||||
- Are local administrators on the sync server documented and minimal?
|
||||
|
||||
---
|
||||
|
||||
### Cloud-only Global Admins — enforce it on day one
|
||||
|
||||
**P0 if not in place.** Synced accounts holding Global Admin are the most common single finding across all engagements and the most direct path from a ransomwared on-prem AD to cloud dominance.
|
||||
|
||||
**Find the synced GAs:**
|
||||
```powershell
|
||||
# Connect-MgGraph -Scopes "Directory.Read.All"
|
||||
$gaRoleId = (Get-MgDirectoryRole -Filter "displayName eq 'Global Administrator'").Id
|
||||
Get-MgDirectoryRoleMember -DirectoryRoleId $gaRoleId |
|
||||
Where-Object { $_.AdditionalProperties['userPrincipalName'] -notlike "*.onmicrosoft.com" }
|
||||
```
|
||||
|
||||
Every result is a synced account. Every synced account in GA is a P0.
|
||||
|
||||
**Remediation path:**
|
||||
1. Create a new cloud-only account (`user@tenant.onmicrosoft.com` format), assign GA, configure phishing-resistant MFA.
|
||||
2. Validate the new account works — sign in, confirm PIM activation if PIM is in place.
|
||||
3. Remove GA from the synced account.
|
||||
4. Add a Conditional Access policy blocking synced account UPNs from holding privileged roles (belt-and-suspenders; requires knowing the UPN pattern).
|
||||
|
||||
---
|
||||
|
||||
### Seamless SSO key — rotate it
|
||||
|
||||
`AZUREADSSOACC` was created when Seamless SSO was enabled and is almost certainly unrotated. The Kerberos key on this account is a silver-ticket / cloud token-forging exposure if the on-prem is compromised.
|
||||
|
||||
**Check last password set:**
|
||||
```powershell
|
||||
Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet | Select-Object PasswordLastSet
|
||||
```
|
||||
|
||||
If this matches the approximate go-live date of the Microsoft 365 tenant, it has never been rotated.
|
||||
|
||||
**Rotate it:** Use the `Update-AzureADSSOForest` PowerShell command (in the MSOnline / Entra Connect tooling). Run it twice per domain — same discipline as KRBTGT rotation. If Seamless SSO is not needed (Entra join and modern auth only), remove `AZUREADSSOACC` entirely.
|
||||
|
||||
---
|
||||
|
||||
### Writebacks — name and own each one
|
||||
|
||||
Enumerate which writebacks are enabled (password writeback, group writeback, device writeback) in Connect Sync or Cloud Sync configuration. For each:
|
||||
- Who owns the decision to have it enabled?
|
||||
- What does an attacker reach if the cloud side is compromised — can they write into on-prem AD?
|
||||
- Is the reverse blast radius documented?
|
||||
|
||||
Password writeback is usually justified (SSPR usability). Group writeback creates a two-way channel between cloud security groups and on-prem AD — the blast radius should be explicit. If there is no current owner or justification for a writeback, disable it.
|
||||
|
||||
---
|
||||
|
||||
## 2. Privileged Access
|
||||
|
||||
### PIM: table stakes in 2026
|
||||
|
||||
If the client has Entra ID P2 (included in Microsoft 365 E5, Business Premium, and available as an add-on) and is not using PIM for Entra administrative roles, that is a P0. There is no acceptable reason in 2026 for standing Global Admin, Privileged Role Administrator, Security Administrator, or Exchange Administrator assignments when PIM provides JIT elevation.
|
||||
|
||||
**What to confirm during engagement:**
|
||||
- Global Admin: eligible only, not active. Any active (permanent) GA assignment that is not a break-glass account is a finding.
|
||||
- Privileged Role Administrator: requires approval workflow on activation, not just MFA. This role controls who becomes admin — it should require a second human to approve.
|
||||
- Security Administrator and Exchange Administrator: eligible, MFA on activation, justified time box (8 hours maximum for a working day).
|
||||
- PIM activation requires phishing-resistant MFA. If it accepts push-approve, it is phishable.
|
||||
|
||||
**2026 note:** PIM now supports custom role definitions. If a client is assigning built-in broad roles (like Global Admin) to do a narrow task, check whether a custom role or a more scoped built-in (e.g., Intune Administrator instead of Global Admin) applies.
|
||||
|
||||
---
|
||||
|
||||
### Service principals: the 2026 audit
|
||||
|
||||
Service principals hold more standing privilege in most tenants than all human admins combined. They cannot do MFA. They are almost never reviewed. This is the dark matter of privileged access.
|
||||
|
||||
**Escalation-grade Graph permissions — find every app holding these in 2026:**
|
||||
- `RoleManagement.ReadWrite.Directory` — can grant any Entra role
|
||||
- `AppRoleAssignment.ReadWrite.All` — can assign any app role, including to itself
|
||||
- `Application.ReadWrite.All` — can modify any application and create new ones
|
||||
- `Directory.ReadWrite.All` — broad directory write
|
||||
- Any API permission scoped `Full` or ending in `.ReadWrite.All` for sensitive services
|
||||
|
||||
```powershell
|
||||
# Find service principals with dangerous Graph permissions (application permissions)
|
||||
Get-MgServicePrincipal -All | ForEach-Object {
|
||||
$sp = $_
|
||||
Get-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $sp.Id |
|
||||
Where-Object { $_.PrincipalId -eq $sp.Id }
|
||||
} # — pipe to filter on the dangerous role IDs listed above
|
||||
```
|
||||
|
||||
For every hit: who created this app registration, when, is the permission still needed, is there an expiring secret or certificate, and can it be replaced with a managed identity?
|
||||
|
||||
**Secrets never expire — find them:** In the Entra portal > App registrations > All applications > sort by "Certificate & secrets expiration." Filter for never-expiring secrets. Every one is a standing credential with no forced rotation.
|
||||
|
||||
---
|
||||
|
||||
### On-prem service accounts: gMSA yes, dMSA wait
|
||||
|
||||
**gMSA (Group Managed Service Accounts):** The right answer for on-prem service accounts in 2026. Automatic password rotation (no static secret), not Kerberoastable in the traditional sense, natively supported across Windows Server 2012+. If a client has regular service accounts with static passwords (especially if those passwords are 2+ years old), migrate to gMSA.
|
||||
|
||||
**Kerberoasting check (run this, not just ask about it):**
|
||||
```powershell
|
||||
# Find accounts with SPNs and static passwords
|
||||
Get-ADUser -Filter {ServicePrincipalName -ne "$null"} -Properties ServicePrincipalName, PasswordLastSet, Enabled |
|
||||
Where-Object {$_.Enabled -eq $true} |
|
||||
Select-Object Name, PasswordLastSet, ServicePrincipalName
|
||||
```
|
||||
|
||||
Any result with a `PasswordLastSet` older than 1 year is Kerberoastable and a P0.
|
||||
|
||||
**dMSA (Delegated Managed Service Accounts):** Introduced with Windows Server 2025-era tooling, targeting the migration path from standing service accounts. Do not recommend dMSA in 2026 — there is published privilege-escalation research against the migration path. Use gMSA until the specific vulnerabilities are patched and the client's environment is confirmed current. Check current Microsoft advisories at engagement time.
|
||||
|
||||
---
|
||||
|
||||
### LAPS: Windows LAPS deployment in 2026
|
||||
|
||||
**Legacy Microsoft LAPS** (the separately-downloaded agent) should be migrated to **Windows LAPS**, the built-in solution available in Windows 10 22H2 / Windows 11 22H2 and Windows Server 2019+ with April 2023 updates or later.
|
||||
|
||||
Windows LAPS can store passwords in AD, in Entra ID (for Entra-joined devices), or both. For hybrid estates, store in both. Manage via Intune (cloud-joined) or GPO (domain-joined).
|
||||
|
||||
**Coverage check:**
|
||||
```powershell
|
||||
# Computers without LAPS password set (null = not managed)
|
||||
Get-ADComputer -Filter * -Properties 'ms-Mcs-AdmPwd', 'msLAPS-Password' |
|
||||
Where-Object { $_.'ms-Mcs-AdmPwd' -eq $null -and $_.'msLAPS-Password' -eq $null } |
|
||||
Select-Object Name
|
||||
```
|
||||
|
||||
Every result is a computer with a shared or unknown local admin password — lateral movement risk.
|
||||
|
||||
---
|
||||
|
||||
### KRBTGT rotation
|
||||
|
||||
Check password age. 365+ days without rotation is a P1. No documented rotation since domain creation (common when the domain is 5–10 years old) is a P0 for any high-sensitivity engagement.
|
||||
|
||||
```powershell
|
||||
Get-ADUser krbtgt -Properties PasswordLastSet | Select-Object PasswordLastSet
|
||||
```
|
||||
|
||||
Rotation procedure: rotate once, wait at least the max ticket lifetime (default 10 hours), rotate again. Document both rotation timestamps. After rotation, monitor for authentication failures caused by cached golden tickets — if detections fire, that was a real golden ticket, not a drill finding.
|
||||
|
||||
---
|
||||
|
||||
### ADCS: treat it as Tier 0
|
||||
|
||||
If the client has Active Directory Certificate Services deployed (almost all do if they have a domain older than 7 years), run a basic ESC vulnerability check. The ESC1–ESC8 misconfigurations are well-documented, freely exploitable, and almost never remediated because most organizations do not know they have ADCS issues.
|
||||
|
||||
**Quick check:**
|
||||
- Is ADCS installed? `Get-WindowsFeature ADCS-Cert-Authority` on any server
|
||||
- Is any template published with "Supply subject in request" + broad enrollment rights? That is ESC1.
|
||||
- Certipy (open source) or Certify: run in read-only enumeration mode (`certipy find`) to identify vulnerable templates
|
||||
|
||||
ADCS is Tier 0. It sits on whatever server it runs on, and that server should have the same access controls as a domain controller. Verify it is not on a Tier 1 or Tier 2 server.
|
||||
|
||||
---
|
||||
|
||||
### Admin workstations — the cloud VM is the deployable PAW
|
||||
|
||||
Physical PAWs are right in principle and almost never get deployed. Hardware procurement, second device, behaviour change — the project does not survive contact with a real IT budget. Do not open the conversation with "you need a dedicated PAW laptop." Open it with the cloud admin VM.
|
||||
|
||||
**The cloud admin VM:** a Windows 365 or Azure Virtual Desktop instance provisioned from a hardened template. The admin connects from their normal device via browser or RDP. Privileged credentials — including WireGuard keys for the management overlay — live in the cloud VM, not on the admin's local device. Compromise response: wipe it, reprovision from template in under 20 minutes.
|
||||
|
||||
**Provisioning the cloud admin VM:**
|
||||
1. Create a Windows 365 or AVD instance from a hardened base image (CIS L2 baseline or equivalent)
|
||||
2. Enrol in Intune, apply a configuration profile: no internet browsing, no personal email, no Microsoft Store apps, screen lock on idle, BitLocker enforced
|
||||
3. Scope a CA policy restricting Global Admin and privileged role activation to this device (device compliance + named Intune group)
|
||||
4. Install the Nebula client (if deploying T0 overlay) and distribute the pre-signed node certificate
|
||||
5. Install the Tailscale client (if deploying T1 overlay) and enrol with the Entra OIDC identity
|
||||
|
||||
**Minimum viable without the overlay:** a dedicated Intune-enrolled, Entra-joined cloud VM with no email and no general browsing, and a CA policy restricting GA activation to it. Not perfect, but it will actually get deployed and maintained.
|
||||
|
||||
---
|
||||
|
||||
### Management overlay — Nebula for T0, Tailscale for T1
|
||||
|
||||
**When a client needs this:** SME and mid-market clients with multi-cloud resources, DevOps workloads, or remote admins — and no physical data centre with a proper management VLAN. The overlay builds the management plane that the physical network cannot provide.
|
||||
|
||||
**When a client does not need this:** organisations with their own data centres and physical network infrastructure already in place. Traditional management VLAN segmentation plus jump boxes is the right answer there. Adding an overlay creates a new Tier 0 component without proportional benefit.
|
||||
|
||||
**The T0 overlay — Nebula:**
|
||||
|
||||
Nebula has no coordinator in the runtime path. Once certificates are distributed, the overlay runs with zero external dependencies. This is the right property for T0: a compromised or unavailable external service cannot affect access to your domain controllers.
|
||||
|
||||
Deployment steps:
|
||||
1. Provision the Nebula CA on a dedicated air-gapped machine (a dedicated laptop that is never networked, or a cheap PC kept in a drawer)
|
||||
2. Generate and sign node certificates for each T0 node (DCs, sync server, ADCS, cloud admin VMs/PAWs)
|
||||
3. Distribute the signed certificates and the CA certificate to each node
|
||||
4. Configure the Nebula ACL policy: cloud admin VMs can reach DCs on port 3389 (RDP) and 5985/5986 (WinRM); nothing else. DCs do not reach each other through Nebula (they have their own replication channel)
|
||||
5. Start the Nebula service on each node. Test connectivity from the cloud admin VM to a DC
|
||||
6. Document the CA signing ceremony: who can sign new certs, what approval is needed, where the CA key is stored, how to revoke (distribute updated blocklist to all nodes)
|
||||
|
||||
**Realistic T0 node count:** 15–25 nodes for a 5,000-person organisation. Certificate management is a documented ceremony run a few times a year, not an ongoing operational burden.
|
||||
|
||||
**The T1 overlay — Tailscale:**
|
||||
|
||||
Tailscale with Entra OIDC + key expiry gives you device trust (WireGuard node key) plus per-session identity assertion (Entra MFA on re-authentication). Configure key expiry to force re-authentication on a schedule aligned with the session risk tolerance (8–24 hours for admin access).
|
||||
|
||||
Deployment steps:
|
||||
1. Create a Tailscale account or deploy Headscale (for sovereign requirements)
|
||||
2. Configure the OIDC integration with Entra ID. Set the MFA requirement to phishing-resistant (FIDO2) in the Entra Conditional Access policy that governs Tailscale authentication
|
||||
3. Set key expiry: 8–24 hours for admin nodes, 24–72 hours for standard nodes
|
||||
4. Define ACL policy: cloud admin VMs reach T1 servers on management ports only; standard user devices do not appear in the T1 ACL
|
||||
5. Enrol cloud admin VMs as nodes. Enrol T1 servers (member servers, cloud management hosts, K8s API server endpoints)
|
||||
6. Test: attempt to reach a T1 server from a non-enrolled device. Expected: no route. From an enrolled cloud admin VM: connected
|
||||
|
||||
**What Tailscale carries for multi-cloud:** kubectl access to K8s clusters, SSH/RDP to member servers and cloud VMs, cloud CLI access where the management API is behind a private endpoint. It does not carry M365 admin traffic — that goes direct to Microsoft over the internet, gated by Conditional Access.
|
||||
|
||||
**The Nebula CA — the one critical operation:**
|
||||
|
||||
The CA key is the trust anchor for the entire T0 overlay. Its compromise means an attacker can enrol their own node and grant it access to every DC. Treat it accordingly:
|
||||
- Air-gapped machine, never networked after initial setup
|
||||
- CA key encrypted at rest on the machine and backed up separately
|
||||
- Certificate lifetime: 180 days maximum, so non-renewal handles most revocation cases
|
||||
- Revocation: generate and distribute an updated `blocklist.pem` to all nodes if a PAW is lost or an admin departs before cert expiry
|
||||
- At least two named people who know the ceremony and can perform it
|
||||
|
||||
---
|
||||
|
||||
## 3. Devices & Endpoint
|
||||
|
||||
### Reconcile the real fleet — do this on day one
|
||||
|
||||
Do not trust Intune's enrolled device count or any CMDB. Pull from four sources and compare them:
|
||||
1. Intune managed devices (Intune portal)
|
||||
2. Entra registered/joined devices (Entra portal > Devices)
|
||||
3. Entra sign-in logs, device detail (what is actually authenticating)
|
||||
4. Network device discovery if in scope
|
||||
|
||||
The gap between sources 1+2 and source 3 is your shadow/dark device population. Source 3 will almost always be larger. Every device authenticating that is not in sources 1+2 is an unmanaged device reaching data.
|
||||
|
||||
**Concrete — pull sign-in logs by device compliance state:** In the Entra portal: Sign-in logs > Add filter > "Managed device" = No or "Compliant" = No > export. Count the distinct device IDs. That count, compared against your Intune enrolled count, is the gap metric.
|
||||
|
||||
---
|
||||
|
||||
### Cloud-native migration: Entra join + Intune as default
|
||||
|
||||
For any new device deployment or device refresh in 2026, **Entra join + Intune management** is the default. Hybrid Entra join (AD-joined + cloud-registered) is technical debt to retire, not a target state.
|
||||
|
||||
**Migration readiness check:** What on-prem resources does the client's fleet actually need? Line-of-business applications, file shares, printers? Each dependency is a reason to stay hybrid; each that can be moved or resolved with another mechanism is a reason to go cloud-native. Build the dependency map first.
|
||||
|
||||
**GPO to Settings Catalog:** Most GPO settings now have equivalents in the Intune Settings Catalog. The IntunePolicyParser tool can parse existing GPOs and identify Settings Catalog equivalents. Run this early in an endpoint engagement to scope the migration effort.
|
||||
|
||||
---
|
||||
|
||||
### Conditional Access — test every policy before signing off
|
||||
|
||||
This is not a recommendation. It is a requirement.
|
||||
|
||||
**Protocol:**
|
||||
1. Before changing or reviewing any CA policy, write down the expected behavior for the users and conditions in scope: *"User X, device Y, location Z → MUST be [blocked/granted/MFA-prompted]."*
|
||||
2. Use What If as a logic check only — it evaluates configuration, not enforcement.
|
||||
3. Drive real sign-ins for every important user/condition combination. Observe the actual result.
|
||||
4. If the observed result contradicts the displayed configuration, recreate the policy from scratch. Do not edit the existing object — a ghost policy carries corruption forward through edits.
|
||||
5. Re-test after any tenant-level change: adding a domain, changing federation, new app registration. You do not need to have touched the CA policy for it to ghost.
|
||||
|
||||
**Report-only mode:** Use report-only to pre-validate before enabling. But test in enabled mode before signing off. Report-only cannot find a ghost policy — only a live enforcement failure can.
|
||||
|
||||
---
|
||||
|
||||
### EPM: eliminate standing local admin
|
||||
|
||||
In 2026, **Endpoint Privilege Management (EPM)** in Intune is the right answer for "some users need admin rights for specific software." EPM provides JIT, audited, approved elevation without giving the user permanent local admin.
|
||||
|
||||
**Licensing:** Requires Intune Plan 2 or the Intune Suite (not included in standard Business Premium or E3 — verify licensing before scoping).
|
||||
|
||||
**Deployment:**
|
||||
1. Audit current local admin membership across the fleet (GPO reporting or Intune device reports)
|
||||
2. Identify the specific applications or tasks requiring elevation
|
||||
3. Create EPM rules for those specific executables
|
||||
4. Remove standing local admin from standard user accounts
|
||||
5. Monitor EPM elevation events for anomalies
|
||||
|
||||
If EPM licensing is not available, Windows LAPS for local admin credentials (randomized, no shared password) plus a JIT process for elevation requests is the intermediate posture.
|
||||
|
||||
---
|
||||
|
||||
### Update rings: the lesson from 2024
|
||||
|
||||
Configure update rings in Intune for all managed endpoints. Every client needs:
|
||||
- **Pilot ring** (5–10% of devices, IT staff / early adopters): 0 days deferral
|
||||
- **Broad ring** (remainder): 7-day deferral after pilot passes
|
||||
- A named person with the authority to **halt a broad ring push** — confirmed they know how and have tested it
|
||||
|
||||
**Windows Autopatch** (included in Business Premium, E3 with Intune add-on, E5) automates ring management and defers intelligently. If the client is licensed for it and not using it, that is a quick win.
|
||||
|
||||
The 2024 CrowdStrike event applies not just to AV/EDR updates — it applies to any software distributed at scale. Update ring discipline is now an endpoint governance requirement, not a preference.
|
||||
|
||||
---
|
||||
|
||||
### MAM boundaries: test them on a real device
|
||||
|
||||
If the client uses App Protection Policies for BYOD (MAM-WE), the policy screen does not prove enforcement. Test on real devices, on current OS builds, per platform:
|
||||
|
||||
**Test protocol (run separately on iOS and Android):**
|
||||
- Attempt to copy text from a managed app (Outlook, Teams) and paste into an unmanaged app
|
||||
- Attempt to "Open in" from a managed attachment to an unmanaged app
|
||||
- Attempt to save a file locally or to the camera roll
|
||||
- Attempt to screenshot (if blocked by policy)
|
||||
- Test from an unmanaged browser accessing SharePoint or OWA
|
||||
|
||||
Document where "Block" does not block. When you find a gap that survives reinstall on multiple devices, that is a vendor escalation, not a configuration fix.
|
||||
|
||||
---
|
||||
|
||||
## 4. Data & Collaboration
|
||||
|
||||
### Anonymous sharing: disable at the tenant level on day one
|
||||
|
||||
"Anyone with the link" sharing is a bearer token for your data — no identity required, forwardable, often with no expiry, reachable by anyone who ever held the link. This is the single largest data exposure fragility in M365.
|
||||
|
||||
**Immediate action:** SharePoint Admin Center > Policies > Sharing > External sharing: set to "New and existing guests" (requires authentication) or "Only people in your organization." If the client has a business case for anonymous links, scope specific sites where it is permitted and disable at the tenant level for everything else.
|
||||
|
||||
**Enumerate existing anonymous links:**
|
||||
```powershell
|
||||
# PnP PowerShell
|
||||
Get-PnPTenantSite -IncludeOneDriveSites | ForEach-Object {
|
||||
Get-PnPSiteCollectionSharingLinks -Site $_.Url
|
||||
} | Where-Object { $_.Link -like "*guestaccess*" }
|
||||
```
|
||||
|
||||
The list you get is almost always longer than anyone expected. The exercise of producing it is itself a finding.
|
||||
|
||||
---
|
||||
|
||||
### External auto-forwarding: block it and check for active rules
|
||||
|
||||
**Block at the global level:** Exchange Admin Center > Mail flow > Remote domains > Default domain > Automatic forwarding: Disabled.
|
||||
|
||||
**Check for existing rules (do this before blocking in case active BEC is in progress):**
|
||||
```powershell
|
||||
Get-TransportRule | Where-Object {$_.BlindCopyTo -ne $null -or $_.RedirectMessageTo -ne $null} |
|
||||
Select-Object Name, BlindCopyTo, RedirectMessageTo, Enabled
|
||||
```
|
||||
|
||||
Any rule forwarding to an external address with no documented business owner is a potential BEC persistence mechanism. Treat as P0 until confirmed otherwise.
|
||||
|
||||
Also check Outlook/OWA rules at the mailbox level for executive accounts:
|
||||
```powershell
|
||||
Get-Mailbox -ResultSize Unlimited | Get-InboxRule |
|
||||
Where-Object {$_.ForwardTo -ne $null -or $_.RedirectTo -ne $null} |
|
||||
Select-Object MailboxOWAUrl, Name, ForwardTo, RedirectTo
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Crown jewels: name them before scoping DLP or labels
|
||||
|
||||
The first question in every data engagement: *"Which five data sets, if exfiltrated, would end or materially damage this business?"*
|
||||
|
||||
If the client cannot name them, that is finding #1 and the prerequisite for everything else. DLP and sensitivity labels applied before the crown jewels are identified are DLP and sensitivity labels that protect the wrong things.
|
||||
|
||||
Common crown jewels in 2026: M&A communications, board and executive email, source code repositories, customer PII data subject to GDPR/NIS2, financial forecasts and models, intellectual property, credentials and secrets stored in SharePoint/Teams.
|
||||
|
||||
Once named: where do they live? Who has access? Are they labeled? Is access audited?
|
||||
|
||||
---
|
||||
|
||||
### Sensitivity labels and auto-labeling
|
||||
|
||||
**2026 recommendation:** If the client is on E5 Compliance or equivalent, deploy auto-labeling policies for the crown jewel data types. Manual labeling depends on user behavior; auto-labeling does not.
|
||||
|
||||
**Licensing check first:** Sensitivity labels: all M365 plans. Auto-labeling, advanced DLP, and Purview data governance: M365 E5 Compliance or the Microsoft Purview compliance add-on. Verify before scoping.
|
||||
|
||||
**Implementation sequence:**
|
||||
1. Define the crown jewels (see above)
|
||||
2. Create sensitivity labels in order from most to least restrictive (Highly Confidential, Confidential, Internal, Public)
|
||||
3. Apply encryption to Highly Confidential and Confidential labels — encryption travels with the file, including after exfiltration
|
||||
4. Configure auto-labeling for known high-value content types (credit card numbers, national IDs, custom regex for the client's IP)
|
||||
5. Monitor label application events before enforcing auto-labeling in production
|
||||
|
||||
---
|
||||
|
||||
### Guest access: treat as standing blast radius
|
||||
|
||||
Run a guest access review on every engagement. Most tenants cannot produce the list of current guests without effort. The exercise of trying to produce it is the finding.
|
||||
|
||||
**Enumerate guests:**
|
||||
```powershell
|
||||
Get-MgUser -Filter "userType eq 'Guest'" -All |
|
||||
Select-Object DisplayName, Mail, CreatedDateTime, SignInActivity
|
||||
```
|
||||
|
||||
Sort by `LastSignInDateTime`. Guests who have not signed in for 90+ days have no legitimate active need. The default should be expiration, not permanence.
|
||||
|
||||
**Configure guest access reviews** in Entra Identity Governance > Access reviews. Set recurring reviews for all guests at 90-day intervals. When a reviewer does not respond, the default action should be removal, not retention.
|
||||
|
||||
---
|
||||
|
||||
### Audit log: verify it is on and retained
|
||||
|
||||
Do not assume audit logging is enabled. Go to Microsoft Purview > Audit > Start recording user and admin activity (if the banner appears, it is not on). Then run a test search to confirm log entries are being captured.
|
||||
|
||||
**Retention check — critical:**
|
||||
- E3 licensing: 90-day default retention
|
||||
- E5 / Purview Audit Premium: 1 year (extendable to 10 years with add-on)
|
||||
- Unified audit log must be explicitly enabled; it has historically not been on by default in older tenants
|
||||
|
||||
For incident response purposes: if a breach is discovered 60 days in, and the client has 90-day retention, the evidence window is 30 days. For most meaningful incidents, 90 days is insufficient. Scope the retention discussion explicitly.
|
||||
|
||||
---
|
||||
|
||||
## 5. Recovery & Detection
|
||||
|
||||
### M365 backup: the mandatory conversation
|
||||
|
||||
Native Microsoft 365 provides recycle bins and version history. It does not provide point-in-time backup against ransomware, malicious admin deletion, or retention policy expiry.
|
||||
|
||||
**The question to ask the client:** "If someone with Global Admin access right now deleted every Exchange Online mailbox and every SharePoint site, what is your recovery path, and how long does it take?"
|
||||
|
||||
If the answer involves the Microsoft recycle bin and "we would call Microsoft support," that is not a recovery plan. The recycle bin window is 14–93 days depending on the workload and configuration, and it does not protect against retention policy deletion or hard-delete operations by a malicious admin.
|
||||
|
||||
**2026 recommendation:** A third-party M365 backup solution covering Exchange Online, SharePoint Online, OneDrive for Business, and Teams is a baseline requirement for any client treating M365 as business-critical. The market is mature. Veeam, AvePoint, Acronis, and Dropsuite are the common options. Assess per client need.
|
||||
|
||||
---
|
||||
|
||||
### Configuration-as-code: export the control plane
|
||||
|
||||
Export CA policies, Intune baseline configurations, and Entra role assignments to code or structured files at the start of every engagement. This serves three purposes:
|
||||
1. Known-good baseline to detect drift and ghost configuration against
|
||||
2. Rebuild artifact for a compromised or corrupted tenant
|
||||
3. Change management — you can diff the configuration before and after every change
|
||||
|
||||
**CA policies:** Use CAExporter (`vibecoding/CAExporter`) to export all CA policies to JSON. Store in client's repository. Run the export again at the close of the engagement and diff against the opening export — changes are documented, not assumed.
|
||||
|
||||
**Intune:** The Graph API can export most Intune configuration; IntunePolicyParser assists with policy comprehension. Store the export.
|
||||
|
||||
**Entra roles:** Capture the current role assignment list (who holds what role, eligibility vs activation) as a document. This is your before-state for any privileged access engagement.
|
||||
|
||||
---
|
||||
|
||||
### Detection: eight signals that matter more than eight hundred that don't
|
||||
|
||||
Configure these eight before anything else. Each one represents a category of attack where silence is catastrophic:
|
||||
|
||||
| Signal | Where to configure | Why it cannot be noise |
|
||||
|--------|-------------------|----------------------|
|
||||
| Break-glass account sign-in (any use at all) | Entra audit logs → alert rule or Sentinel | An account that should never sign in has signed in |
|
||||
| New Global Admin assigned | Entra audit logs, `Add member to role` for GA role | Shadow admin creation |
|
||||
| DCSync from non-DC host | Microsoft Defender for Identity or Sentinel | On-prem AD credential harvest in progress |
|
||||
| Impossible-travel sign-in for admin accounts | Entra ID Protection > User risk alerts | Account takeover in flight |
|
||||
| External auto-forward rule created | Exchange audit logs | BEC persistence being established |
|
||||
| Mass download from SharePoint/OneDrive | Defender for Cloud Apps or Purview | Exfiltration in progress |
|
||||
| New OAuth consent grant to high-privilege scope | Entra audit logs, `Consent to application` | Illicit app consent attack |
|
||||
| Privileged role activation outside business hours | PIM alerts | Credential use at suspicious time |
|
||||
|
||||
Each of these should route to a named human who will respond within a defined SLA. Detection that fires into an unmonitored queue is theatre with a subscription cost.
|
||||
|
||||
---
|
||||
|
||||
### AD forest recovery: have the conversation
|
||||
|
||||
Ask the client: "Has anyone on your team ever run an AD forest recovery — not in a training lab, on a real forest?" The answer is almost universally no.
|
||||
|
||||
This is not a project you complete in an engagement — it is a finding and a recommendation. The finding: if AD is destroyed or corrupted (ransomware taking the DCs), recovery is a multi-day, expert-dependent process that nobody on this team has ever performed. The recommendation: run a tabletop of the procedure, identify the gaps in the runbook, and ensure the runbook is stored somewhere that survives the estate being dark (not in SharePoint, not in an AD-authenticated file share).
|
||||
|
||||
The minimum viable runbook should cover: authoritative DC restore sequence, metadata cleanup, double KRBTGT reset, trust rebuilds, and how the Entra side reconnects when on-prem is back.
|
||||
|
||||
---
|
||||
|
||||
### Break-glass: test it, don't just create it
|
||||
|
||||
Break-glass accounts exist in most tenants. They are tested in almost none. On every engagement:
|
||||
|
||||
1. Does the break-glass account exist? (Cloud-only, `.onmicrosoft.com`, not synced)
|
||||
2. Is it phishing-resistant? (FIDO2 key or certificate — not push-approve)
|
||||
3. Is it excluded from the CA policy that would otherwise block it?
|
||||
4. Does its use trigger an immediate alert? (If yes, verify the alert fires during the test — not just that the alert rule exists)
|
||||
5. Where are the credentials? (Not in the client's normal password manager that requires the same identity to access)
|
||||
6. When was it last signed in to? (Credential should be proven functional — test it)
|
||||
|
||||
The test is non-negotiable. An untested break-glass account is a belief, not a recovery path.
|
||||
|
||||
---
|
||||
|
||||
## What changed: 2025 → 2026
|
||||
|
||||
| Area | Prior state | 2026 position |
|
||||
|------|------------|---------------|
|
||||
| AD FS | Roadmap item for most clients | P0 conversation — tooling mature, no excuse |
|
||||
| Entra Cloud Sync | "For simple topologies" | Recommended default for new deployments |
|
||||
| dMSA | Newly released, cautiously recommended | Hold — published escalation research; use gMSA |
|
||||
| EPM | Available, optional | Table stakes for zero-standing-admin on endpoints |
|
||||
| Windows Autopatch | Optional | Default recommendation for update ring discipline |
|
||||
| CA ghost policy | Edge case, occasionally found | Documented pattern — test every policy as standard |
|
||||
| M365 native backup | "Microsoft covers it" (wrong but common) | Third-party backup framed as baseline, not option |
|
||||
| PIM activation MFA | Often push-approve | Must be phishing-resistant to count |
|
||||
| Windows LAPS | New, replacing legacy LAPS | Deployed as standard; legacy LAPS is tech debt |
|
||||
|
||||
---
|
||||
|
||||
## The governing question — carry it into every session
|
||||
|
||||
Before every finding, every recommendation, every conversation:
|
||||
|
||||
> **If this is owned tonight, what is the largest thing an attacker reaches before hitting a wall — and can I draw that wall?**
|
||||
|
||||
If the wall is missing or undrawn, you have found the work. Everything else is sequencing.
|
||||
|
||||
---
|
||||
|
||||
*Field Guide for the Antifragile Handbook. Updated June 2026. Review and update January 2027 — the honest uncertainty sections of the books define what will change.*
|
||||
@@ -0,0 +1,509 @@
|
||||
# Field Guide — Adversarial Validation
|
||||
|
||||
> *"It's a nice compliance dashboard you have here."*
|
||||
|
||||
**Last updated:** June 2026
|
||||
**Companion to:** [Field Guide — 2026 Edition](field-guide-2026.md) · Books I–VI
|
||||
**Engagement type:** Phase 2 — for clients who have done the foundational work
|
||||
**Checklist:** [Adversarial Validation Checklist](../assessment-templates/adversarial-validation-checklist.md)
|
||||
**Next review:** January 2027
|
||||
|
||||
---
|
||||
|
||||
## The premise
|
||||
|
||||
The client has MFA. They have Conditional Access. They have Intune. They have a SIEM. Their CIS score is in the seventies or eighties. Their audit passed. The dashboard is green.
|
||||
|
||||
This is the most dangerous estate to walk into — not because it is badly configured, but because everyone in the room believes it works. That belief is the fragility. Book I calls it directly: *"Green dashboards, untested reality — the most dangerous estate of all, because it feels safe."*
|
||||
|
||||
The foundational field guide tells you how to build controls. This engagement is about finding out which of the client's existing controls are real and which are representations — configurations that *display* correctly but *enforce* nothing, backups that exist but have never been restored, detection that fires into a queue nobody reads, attack paths to Domain Admin that nobody has mapped because the BloodHound licence expired.
|
||||
|
||||
**What you are doing in this engagement:** Systematically converting claimed security into observed security, domain by domain, and producing a structural change for every gap found. Not a pentest. Not a red team. A constructive adversarial validation — you are working with the client, with full authorization, with the explicit goal of finding what breaks before an attacker does.
|
||||
|
||||
**What you are not doing:** Adding more controls. This engagement deliberately does not recommend new tooling or new policies. If a control exists and does not work, the finding is that the control does not work — not that a different control is needed. Via negativa applies here too: the fragility is almost always that the existing controls have too many exceptions, too little monitoring, and have never been tested.
|
||||
|
||||
---
|
||||
|
||||
## Before you start
|
||||
|
||||
### Authorization scope
|
||||
|
||||
Before any test in this engagement, confirm written authorization covering:
|
||||
|
||||
- Simulating attacks against identity (Kerberoasting, DCSync simulation, PIM bypass attempts)
|
||||
- Triggering security alerts deliberately (break-glass sign-in, impossible-travel simulation, fake consent grant)
|
||||
- Testing compliance controls on managed devices (rooting a test device, forcing a non-compliant state)
|
||||
- Attempting data exfiltration through DLP and labeling controls (on test data, to controlled test destinations)
|
||||
- Restoring from backup in a test environment
|
||||
|
||||
Authorization is not "we told them verbally." It is a document signed by the named executive sponsor covering the scope of tests. Scope the authorization to the test accounts, test devices, and test data used — do not test on production privileged accounts or production data unless explicitly scoped.
|
||||
|
||||
### Baseline capture before anything changes
|
||||
|
||||
On day one, before any test or change:
|
||||
|
||||
1. Export all CA policies to JSON (CAExporter or Graph API). This is the declared state you will test against and the known-good you will compare the close-of-engagement state to.
|
||||
2. Run BloodHound and capture the full attack graph. The number of paths to Domain Admin at T+0 is your opening metric.
|
||||
3. Pull the Entra role assignment list — who holds what role, eligible vs. active.
|
||||
4. Pull the service principal inventory with their Graph permissions.
|
||||
5. Export Intune compliance and configuration policy assignments.
|
||||
6. Run `Get-ADUser krbtgt -Properties PasswordLastSet`, `Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet`, and document both.
|
||||
7. Count sign-in log distinct device IDs for the last 30 days. Compare to Intune enrolled device count. Record the gap.
|
||||
|
||||
These numbers are your before-state. Every structural change produced by this engagement is measured against them.
|
||||
|
||||
### The opening conversation
|
||||
|
||||
This engagement starts with a single question asked out loud, to the most senior technical person in the room:
|
||||
|
||||
> *"Can you show me one control in this estate that you are certain works — not because the portal says so, but because you have watched it fire under real conditions?"*
|
||||
|
||||
The answer tells you everything. A person who can point to a specific tested control on a specific date has a security programme. A person who gestures at the dashboard has a compliance programme. Both deserve good consulting — but they need different things.
|
||||
|
||||
---
|
||||
|
||||
## 1. Identity — proving the wall is real
|
||||
|
||||
### The firebreak claim
|
||||
|
||||
The client almost certainly claims that cloud privilege is separated from on-prem compromise. Test the claim, don't accept it.
|
||||
|
||||
**Draw the full graph, out loud:**
|
||||
Starting from Domain Admin (or a simulated compromise of the sync server), trace every path that reaches a cloud privileged role:
|
||||
- Are any GAs synced from on-prem? (They claim no — verify.)
|
||||
- Can the sync server connector account be used to tamper with cloud objects?
|
||||
- Do any admins use the same device for Tier 0 and cloud admin work?
|
||||
- Is there a PTA agent that could be compromised to intercept credentials?
|
||||
- Does any MFA for cloud admin rely on an authenticator app on a device that is also used for email? (The MFA device is Tier 2. The admin role is cloud Tier 0. That is a tier violation across the MFA layer.)
|
||||
|
||||
**Verify cloud-only GAs are actually cloud-only:**
|
||||
```powershell
|
||||
$gaRoleId = (Get-MgDirectoryRole -Filter "displayName eq 'Global Administrator'").Id
|
||||
Get-MgDirectoryRoleMember -DirectoryRoleId $gaRoleId |
|
||||
Select-Object @{N='UPN';E={$_.AdditionalProperties['userPrincipalName']}},
|
||||
@{N='OnPremSyncEnabled';E={$_.AdditionalProperties['onPremisesSyncEnabled']}}
|
||||
```
|
||||
`onPremisesSyncEnabled: true` on any GA is a P0 finding. "We moved them to cloud-only" is the claim; this is the verification.
|
||||
|
||||
**Test the break-glass is actually independent:**
|
||||
With the client present: sign in to the break-glass account. Does it succeed? Does an alert fire? Does the person named as the responder to that alert actually receive it and acknowledge it within the agreed SLA? An alert rule that exists but routes to an unmonitored inbox is a ghost detection.
|
||||
|
||||
### AD FS: is the token-signing key actually monitored?
|
||||
|
||||
If AD FS is still running (and in a "mature" estate it often is, "migration is on the roadmap"):
|
||||
|
||||
```powershell
|
||||
Get-AdfsCertificate -CertificateType Token-Signing |
|
||||
Select-Object Thumbprint, NotAfter, @{N='DaysSinceRotation';E={(Get-Date) - $_.Certificate.NotBefore | Select-Object -ExpandProperty Days}}
|
||||
```
|
||||
|
||||
Then ask: if an attacker obtained the private key for this certificate right now, what would you see in your logs? Walk through the scenario. In almost every case the honest answer is "nothing — a Golden SAML token is indistinguishable from a legitimate one." That is the finding. The migration is no longer a roadmap item.
|
||||
|
||||
### PIM: test the activation path, not the configuration
|
||||
|
||||
The client has PIM. But:
|
||||
|
||||
- **What MFA method is required on activation?** Navigate to PIM > Settings for Global Administrator role > Require MFA on activation. Then confirm the MFA method registered for each eligible GA. Push-approve MFA + PIM activation = phishable PIM. The control is not what it appears.
|
||||
- **Test an activation:** Have a test user with an eligible GA role activate it. Time the process. Observe: does the approval notification reach the approver? Does the approver know what they are approving, or does it arrive as a blind "approve this"? An approval workflow where approvers routinely click approve without context is not an approval workflow.
|
||||
- **Check for standing GA assignments that are supposed to be eligible-only.** `Get-MgDirectoryRoleMember` for GA — any user with no corresponding PIM eligible assignment has a permanent standing assignment that exists outside PIM, whether intentionally or by configuration drift.
|
||||
- **Check the maximum activation time box.** 24-hour activation windows are common in "we have PIM" deployments. An activation window that covers an entire working day is functionally standing privilege during business hours.
|
||||
|
||||
### The connector account as a canary
|
||||
|
||||
Reconfigure: any sign-in by the Entra connector account (Directory Synchronization Accounts role) from any host other than the sync server should fire an alert. Then test it: simulate a sign-in from an unexpected host. Does the alert fire? Does someone respond?
|
||||
|
||||
If the answer is "we have an alert rule," test it. "We have an alert rule" is a declaration. A firing alert reaching a responding human is an observation. The handbook's hardest rule applies here: verify by observation, never by inspection.
|
||||
|
||||
---
|
||||
|
||||
## 2. Privilege — attack paths the client has not mapped
|
||||
|
||||
### BloodHound as a metric, not a one-time scan
|
||||
|
||||
The client's mature estate almost certainly has attack paths to Domain Admin that nobody has counted since the last pentest, if ever. Run BloodHound, capture the full graph, and count:
|
||||
|
||||
- **Total paths to Domain Admin** (all principals)
|
||||
- **Paths reachable from standard user compromise** (the realistic starting point for a phishing attack)
|
||||
- **Paths involving Kerberoastable service accounts** specifically
|
||||
- **Paths involving ADCS** (add `-CollectionMethod ACL,ObjectProps,Trusts` to catch certificate-based escalation)
|
||||
|
||||
Present the number. Do not present it as "you have X findings." Present it as: *"From a single compromised standard user account, there are N independent routes to Domain Admin. Each route is a path through controls the attacker does not need to break because they route around them."* Then pick the three shortest paths and show them concretely.
|
||||
|
||||
This number is now a tracked metric. The engagement is not complete until it is going down.
|
||||
|
||||
### Kerberoast it — don't ask if it's possible
|
||||
|
||||
Run the attack:
|
||||
```powershell
|
||||
# Using Rubeus or Invoke-Kerberoast in an authorized test context
|
||||
Invoke-Kerberoast -OutputFormat Hashcat | Out-File kerberoast_hashes.txt
|
||||
```
|
||||
|
||||
The question is not "are there Kerberoastable accounts" (there are) — the question is: **did anything detect it?** A Kerberoast produces distinctive TGS request patterns. If Defender for Identity, Microsoft Sentinel, or any SIEM is watching, it should alert. If it does not, you have found a detection gap more important than the accounts themselves.
|
||||
|
||||
Then attempt to crack the hashes offline (with explicit authorization, on a controlled device). Report which accounts crack and in what time. Most clients are surprised. The service account from 2019 with the password that was "rotated" to `ServiceAcc0unt!2019` cracks in minutes.
|
||||
|
||||
### ADCS: the forgotten Tier 0 target
|
||||
|
||||
Run a basic ESC vulnerability enumeration:
|
||||
```
|
||||
certipy find -u <test-account>@domain.com -p <password> -dc-ip <DC-IP> -stdout
|
||||
```
|
||||
Or Certify if a Windows test host is more convenient:
|
||||
```
|
||||
Certify.exe find /vulnerable
|
||||
```
|
||||
|
||||
In a mature estate, the ADCS server has been running for years, was configured for a specific purpose in 2018, and has never been audited against the ESC series. ESC1 (supply subject in request + broad enrollment rights) in particular is common and catastrophic — it allows any enrolled user to obtain a certificate for any principal, including Domain Admins. Find it, show the exploit path, and document that the ADCS server is being treated as Tier 1 when it is Tier 0.
|
||||
|
||||
### Service principal dark matter
|
||||
|
||||
The client's mature estate has app registrations. Some of them have permissions that were granted for a reason that nobody in the room can explain. Find the escalation-grade ones:
|
||||
|
||||
```powershell
|
||||
# Application permissions (not delegated — these run without a user)
|
||||
$dangerousPermissions = @(
|
||||
"9e3f62cf-ca93-4989-b6ce-bf83c28f9fe8", # RoleManagement.ReadWrite.Directory
|
||||
"06b708a9-e830-4db3-a914-8e69da51d44f", # AppRoleAssignment.ReadWrite.All
|
||||
"1bfefb4e-e0b5-418b-a88f-73c46d2cc8e9", # Application.ReadWrite.All
|
||||
"19dbc75e-c2e2-444c-a770-ec69d8559fc7" # Directory.ReadWrite.All
|
||||
)
|
||||
|
||||
Get-MgServicePrincipal -All | ForEach-Object {
|
||||
$sp = $_
|
||||
Get-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $sp.Id |
|
||||
Where-Object { $_.AppRoleId -in $dangerousPermissions } |
|
||||
ForEach-Object {
|
||||
[PSCustomObject]@{
|
||||
ServicePrincipal = $sp.DisplayName
|
||||
Permission = $_.AppRoleId
|
||||
GrantedDate = $_.CreatedDateTime
|
||||
}
|
||||
}
|
||||
} | Sort-Object GrantedDate
|
||||
```
|
||||
|
||||
For each result: ask the room who created this app registration, what it does, and whether the permission is still needed. The answer to all three is usually "I don't know." That is the finding.
|
||||
|
||||
Then go further: check which of these service principals have non-expiring client secrets and which have never been used (check the sign-in logs for the service principal's `lastSignInDateTime`). A service principal that has not authenticated in 180 days with a never-expiring secret holding escalation-grade Graph permissions is a standing credential an attacker can use indefinitely without triggering a human sign-in.
|
||||
|
||||
### Standing privilege check: the PIM compliance gap
|
||||
|
||||
Ask for the full current list of active (not eligible) privileged role assignments. For each one:
|
||||
- Is it a break-glass account? If not, it should not be standing.
|
||||
- Is it a service account that cannot use PIM? Document and scope the managed-identity migration.
|
||||
- Is it an account someone added "temporarily" and forgot?
|
||||
|
||||
In most mature tenants, the list of active non-break-glass assignments is longer than anyone expects, because PIM was deployed and the existing standing assignments were not cleaned up at the time.
|
||||
|
||||
---
|
||||
|
||||
## 3. Devices — the compliance signal gap
|
||||
|
||||
### The ghost CA policy protocol
|
||||
|
||||
Apply this to every CA policy the client considers important (not every policy — prioritize the ones that block legacy auth, enforce device compliance, and gate privileged sign-in):
|
||||
|
||||
**Before testing any policy:**
|
||||
Write down the expected outcome: *"User [X], device [Y], from location [Z], accessing [App] → MUST be [blocked / MFA-prompted / compliant-device-required]."* Write this before looking at the policy configuration. This prevents rationalizing whatever you observe.
|
||||
|
||||
**The tests to run:**
|
||||
|
||||
1. **Legacy auth block:** Use a mail client that supports Basic Auth (older Outlook, curl with basic auth headers to Exchange Online) from a test account. Expected: blocked. If it succeeds, the CA policy that blocks legacy auth either has an exclusion, is in report-only, or is a ghost.
|
||||
|
||||
2. **Compliant device gate:** Sign in from a device that is known to be non-compliant (a personal device, or a managed device you have taken out of compliance by disabling BitLocker or removing an agent). Expected: blocked from sensitive workloads. If access is granted, either the CA policy is not evaluating correctly or the compliance signal is stale.
|
||||
|
||||
3. **Admin sign-in from non-PAW:** Attempt to activate a PIM role from a standard workstation or a personal device. Expected: blocked if there is a CA policy restricting admin access to compliant or named devices. If it succeeds, the PAW policy is a claim.
|
||||
|
||||
4. **The ghost test:** If any policy above fails to enforce despite its configuration appearing correct — recreate the policy from scratch with identical parameters. Re-test. If the recreated policy enforces and the original did not, you have found a ghost policy. Document the specific policy name, the discrepancy, the recreation, and the re-test result.
|
||||
|
||||
**Important:** Do not re-edit a failing policy to fix it. Recreate it. A ghost policy carries its corruption forward through edits.
|
||||
|
||||
### Compliance signal spoofing: measure the lag
|
||||
|
||||
Take a test enrolled device (a managed device you have authorization to modify):
|
||||
|
||||
1. Root/jailbreak it, or manually induce a non-compliant state (disable encryption, disable the screen lock, install a prohibited app — whatever the compliance policy checks).
|
||||
2. Record the timestamp.
|
||||
3. Watch Intune and Entra ID: when does the compliance state flip to non-compliant?
|
||||
4. When does Conditional Access revoke the session token?
|
||||
5. Is Continuous Access Evaluation (CAE) in place for the workloads that matter? If yes, token revocation should be near-real-time for supported apps. If no, the window is bounded by the token lifetime.
|
||||
|
||||
The gap between step 2 and step 4 is the attacker's window after compromising a compliant device. Present it in minutes, not as "the token may be stale." Most clients have never measured it.
|
||||
|
||||
### Reconcile the real fleet
|
||||
|
||||
Pull four numbers and compare them:
|
||||
|
||||
| Source | Count |
|
||||
|--------|-------|
|
||||
| Intune managed devices | |
|
||||
| Entra registered/joined devices | |
|
||||
| Distinct device IDs in sign-in logs (last 30 days) | |
|
||||
| Distinct device IDs signing in with "Device compliant: No" or "Device managed: No" | |
|
||||
|
||||
The gap between row 1+2 and row 3 is the shadow population. The number in row 4 is the unmanaged population actively accessing data. Neither of these are hypothetical risks — they are current, observable facts about who is accessing the tenant right now.
|
||||
|
||||
For every device in row 4: what data can it reach, and what Conditional Access policy, if any, applies to it?
|
||||
|
||||
### Legacy auth: find the surviving flows
|
||||
|
||||
Even with a "block legacy auth" CA policy in place, find the exceptions:
|
||||
|
||||
```
|
||||
Sign-in logs → Add filter → Client App → select all non-modern entries:
|
||||
Exchange ActiveSync
|
||||
Exchange Online PowerShell
|
||||
Exchange Web Services
|
||||
IMAP4
|
||||
MAPI Over HTTP
|
||||
Other clients
|
||||
POP3
|
||||
Reporting Web Services
|
||||
SMTP
|
||||
```
|
||||
|
||||
Export the results. Every entry is a legacy auth flow that either bypasses the CA policy (via an exclusion you should examine) or is a service account using a protocol that will break when the exclusion is removed. Build the map. The goal is zero — but the path to zero requires knowing what is currently there.
|
||||
|
||||
---
|
||||
|
||||
## 4. Data — does protection actually travel
|
||||
|
||||
### Exfiltrate a labelled document
|
||||
|
||||
With authorization, take a test document labelled at the highest sensitivity tier available (Highly Confidential, or equivalent):
|
||||
|
||||
1. Forward it as an email attachment to a personal test email address outside the tenant. Does DLP intercept it? Does the label encryption hold on the received document?
|
||||
2. Download it to an unmanaged device (one that is not Intune-enrolled). Open it. Does encryption require authentication to the tenant?
|
||||
3. Share it via an anonymous "Anyone with the link" URL (if anonymous sharing is still permitted). Access the link from a browser with no tenant authentication. Does it open?
|
||||
4. Copy and paste the content from the document into an unmanaged app (on a device where the MAM boundary applies). Does the block work?
|
||||
5. Open it in a browser through Conditional Access App Control session policy. Attempt to download. Does the block work?
|
||||
|
||||
Document which paths hold and which do not. The ones that do not hold are the exfiltration routes an attacker (or a careless employee) will actually use. Every failed block is a finding; the label configuration that passed in the policy screen is the ghost, and the exfiltrated file is the fact.
|
||||
|
||||
### Enumerate the anonymous link population
|
||||
|
||||
The tenant sharing setting may say "restricted." That setting controls new links. It does not remove existing ones. Run:
|
||||
|
||||
```powershell
|
||||
# PnP PowerShell — requires SiteCollection Admin on each site
|
||||
Get-PnPTenantSite | ForEach-Object {
|
||||
Connect-PnPOnline -Url $_.Url -Interactive
|
||||
Get-PnPSharingLinks | Where-Object { $_.SharingLinkType -eq "Anonymous" }
|
||||
} | Export-Csv anonymous_links.csv
|
||||
```
|
||||
|
||||
Present the count. In mature tenants, the anonymous link population predates the current tenant sharing settings by years. The setting was changed; the links were not revoked. Every entry is an active bearer token for data that predates the restriction.
|
||||
|
||||
### The BEC forward rule: simulate it
|
||||
|
||||
With a test account (not an executive, not a privileged account):
|
||||
|
||||
1. Create an Inbox rule forwarding all email to an external test address you control.
|
||||
2. Wait to see whether anything detects it and when.
|
||||
3. Check whether the global block on external auto-forwarding (`Get-RemoteDomain Default | Select-Object AutoForwardEnabled`) actually blocks this test rule from executing.
|
||||
4. Confirm: does the transport rule block the forwarding, or does the block only apply to Outlook/OWA auto-forwarding (not to manually-created Inbox rules)?
|
||||
|
||||
There is a documented distinction: the transport-level `AutoForwardEnabled: false` on Remote Domains blocks transport-rule-level forwarding and OWA Auto-Reply forwarding, but Inbox rules created in Outlook/OWA by the user may still forward depending on the specific configuration. Test this on the client's environment. Do not assume.
|
||||
|
||||
### Crown jewel access review
|
||||
|
||||
For the data sets the client has identified as crown jewels (if they have not identified them, that is the first finding — go back to basic engagement):
|
||||
|
||||
1. Pull the access list for the crown-jewel SharePoint sites and OneDrive locations.
|
||||
2. Pull the audit log for access events on those locations over the last 30 days.
|
||||
3. Identify: who accessed them, how frequently, from what devices?
|
||||
4. Find: any access from unmanaged devices. Any access from accounts that should not have visibility. Any bulk download events.
|
||||
5. Specifically check for guest access to the crown-jewel locations — guests whose project has concluded but whose access persists.
|
||||
|
||||
The audit log review is also a test of the audit infrastructure: can you produce a coherent forensic reconstruction of who accessed what, when, from where, over the last 30 days? If the answer is "we would need to run several different reports and correlate them manually," that is an incident response readiness finding.
|
||||
|
||||
---
|
||||
|
||||
## 5. Detection — does it fire, does anyone act
|
||||
|
||||
This section is the difference between robustness and antifragility. Everything before this is about whether controls hold. This section is about whether the organization learns when they do not.
|
||||
|
||||
### The eight simulations
|
||||
|
||||
For each of these, run the simulation with authorization, observe the outcome, and measure the time from event to human acknowledgment. The SLA the client believes they have is the declared state. The measured time is the observed state.
|
||||
|
||||
**Simulation 1 — Break-glass sign-in:**
|
||||
Sign in to the break-glass Global Admin account. This should trigger an immediate, high-priority alert routed to a named responder. Measure: how long from sign-in to human acknowledgment? If the answer is longer than 15 minutes, the break-glass is not monitored at the level it needs to be.
|
||||
|
||||
**Simulation 2 — New Global Admin assigned:**
|
||||
Assign GA to a test account. Observe: does an alert fire in Microsoft Sentinel, Microsoft Defender, or the configured SIEM? Who receives it? When? Revoke the assignment after the test.
|
||||
|
||||
**Simulation 3 — DCSync simulation:**
|
||||
From a non-DC host with a test account that has the relevant permissions (or using Mimikatz in an authorized test context), simulate a DCSync operation. Defender for Identity should alert on `Directory Services Replication`. Does it? Does the alert reach a human? Most mature clients have DfI deployed; fewer have confirmed the specific alert fires and routes correctly.
|
||||
|
||||
**Simulation 4 — Kerberoasting (detection, not just the attack):**
|
||||
Run the Kerberoast from section 2 again, now with the explicit goal of measuring detection. Did the TGS request pattern generate an alert? The attack was run earlier to find the vulnerable accounts; run it again now to find the detection gap.
|
||||
|
||||
**Simulation 5 — Impossible travel for an admin account:**
|
||||
Using a VPN exit node or a cloud VM in a geographically distant region, sign in as a test user who recently signed in from the client's location. Entra ID Protection should flag this as a risky sign-in. Does the user risk policy elevate the risk? Does a CA policy enforce remediation (MFA challenge or block)? Does an alert fire to the SOC? For admin accounts specifically, this should be a high-priority signal.
|
||||
|
||||
**Simulation 6 — External auto-forward rule:**
|
||||
From the data section — did anything alert when the test Inbox rule was created? If no detection fired during that test, that is a finding: BEC persistence can be established without triggering a single alert.
|
||||
|
||||
**Simulation 7 — Mass download from SharePoint:**
|
||||
With a test account that has access to a document library, download 50+ files in rapid succession. Does Defender for Cloud Apps or Microsoft Purview generate an unusual-download alert? Does anything block or throttle it?
|
||||
|
||||
**Simulation 8 — OAuth consent grant:**
|
||||
Register a test app requesting `Mail.Read` and `Files.ReadWrite.All` permissions. Grant it on behalf of a test user (simulating a user who clicks "Accept" on a consent prompt). Does anything alert on the grant event? Is user consent for this class of permission blocked by policy, or can users grant it freely?
|
||||
|
||||
### Alert fatigue: measure it honestly
|
||||
|
||||
Pull the alert volume from the last 30 days (from Sentinel, Defender XDR, or wherever alerts are collected). Calculate:
|
||||
|
||||
- Total alerts generated
|
||||
- Alerts closed as "true positive" with a documented response
|
||||
- Alerts closed as "false positive"
|
||||
- Alerts that have sat open for more than 48 hours
|
||||
- Alerts that were suppressed or auto-closed without human review
|
||||
|
||||
The ratio of responded-to versus everything else is the real detection efficacy rate. Most mature clients discover that their effective detection rate is single-digit percentages of generated alerts. Present the number; it is a more honest metric than "we have Sentinel."
|
||||
|
||||
### The structural change test
|
||||
|
||||
Pull the last five security incidents or alerts that resulted in a closed ticket. For each:
|
||||
|
||||
- What was the incident?
|
||||
- What was the response?
|
||||
- What structural change resulted — what was removed, severed, restricted, or reconfigured because of this incident?
|
||||
|
||||
If the answer to the third question is "we sent a reminder," "we noted it in the risk register," or "we trained the affected user" — the feedback loop is broken. Pain that closes a ticket without changing the architecture is wasted pain. Present the count of structural changes from the last five incidents. If it is zero, that is the most important finding in the report.
|
||||
|
||||
---
|
||||
|
||||
## 6. Recovery — is the exit ramp real
|
||||
|
||||
### Restore something
|
||||
|
||||
Before the engagement closes, restore a real dataset from backup. Not a test restore of a test file — a production dataset (authorized, scoped, non-disruptive) or the clearest approximation the client can authorize.
|
||||
|
||||
Time it. Record the actual MTTR. Compare it to the RTO written in the policy document.
|
||||
|
||||
If the actual MTTR is longer than the policy MTTR, the policy is fiction. Present the observed time as the finding. The goal is not to shame the recovery team — it is to replace a comfortable fiction with a useful truth.
|
||||
|
||||
**For M365 specifically:** Restore a mailbox or a SharePoint document library item from the third-party backup (if one exists). If no third-party backup exists in a mature estate, that is a P0 — it means the client has delegated recovery to Microsoft's recycle bin, which is not a backup posture.
|
||||
|
||||
### AD forest recovery readiness
|
||||
|
||||
Ask the client to produce their AD forest recovery runbook. Three things to verify:
|
||||
|
||||
1. **Is the runbook stored where it can be accessed when AD is down?** Not in SharePoint. Not in an AD-authenticated file share. Not in a password manager that authenticates against the domain. Paper, or a system outside the recovery domain, or both.
|
||||
2. **Has anyone ever run the procedure?** Not a tabletop — an actual restore, even in a lab. The first time you perform AD forest recovery must not be during the real disaster.
|
||||
3. **Does the runbook account for the double-KRBTGT rotation, metadata cleanup, and trust resets?** If it says "restore the DC from backup and you're done," it is incomplete.
|
||||
|
||||
If the answer to question 2 is no, scope a recovery rehearsal. This is the finding: the organization is one ransomware incident away from performing the hardest IT operation in existence for the first time, under maximum pressure, with incomplete runbooks.
|
||||
|
||||
### Configuration drift from the known-good
|
||||
|
||||
Compare the CA policy export from the beginning of this engagement against the current state. In any mature estate where CA policies are managed by multiple people without change control, there will be differences. For each difference:
|
||||
|
||||
- Was it intentional? Is there a change record?
|
||||
- Does the difference make the policy more or less restrictive?
|
||||
- If a policy was modified by someone without change authorization, how long ago and how would it have been detected?
|
||||
|
||||
The absence of a known-good baseline means the client cannot answer these questions. The presence of a known-good baseline and a diff is the beginning of drift detection. If the diff reveals changes made outside the change window or without documentation, that is a control failure independent of whether the change was malicious.
|
||||
|
||||
---
|
||||
|
||||
## The close
|
||||
|
||||
### What changes structurally
|
||||
|
||||
At the end of this engagement, for every finding that was verified by observation (not just inspected), produce a specific structural change:
|
||||
|
||||
| Finding type | Structural change target |
|
||||
|---|---|
|
||||
| Ghost CA policy found | Policy recreated, re-tested, documented |
|
||||
| PIM activation MFA is push-approve | Migration to phishing-resistant MFA scoped |
|
||||
| Kerberoasting not detected | Detection rule created, tested end-to-end |
|
||||
| Standing GA outside PIM | Account removed from role; break-glass confirmed working |
|
||||
| Anonymous links not revoked | Links enumerated and revoked; expiration policy applied |
|
||||
| BEC rule creation not detected | Exchange alert configured, tested |
|
||||
| Alert queue not triaged | Alert owner named, SLA defined, volume reduced |
|
||||
| Backup MTTR exceeds policy | Policy updated to observed time; rehearsal scheduled |
|
||||
|
||||
The engagement deliverable is not the report. The deliverable is the list of structural changes, plus the metrics: BloodHound path count before and after, standing privilege account count before and after, confirmed-working detection count, and measured MTTR.
|
||||
|
||||
### Metrics to deliver at close
|
||||
|
||||
| Metric | Before | After |
|
||||
|--------|--------|-------|
|
||||
| BloodHound paths to Domain Admin (from standard user) | | |
|
||||
| Standing (non-break-glass) Global Admin count | | |
|
||||
| Standing (non-break-glass) Domain Admin count | | |
|
||||
| CA policies verified to enforce by observation | | |
|
||||
| Detection signals tested end-to-end and confirmed working | | |
|
||||
| Anonymous link count (existing) | | |
|
||||
| Unmanaged devices in sign-in logs (% of total) | | |
|
||||
| Actual MTTR from backup restore drill | | |
|
||||
| Structural changes from last 5 incidents (before) | | |
|
||||
|
||||
These numbers are the honest alternative to a compliance score. None of them can be faked by clicking a toggle. All of them represent something an attacker either can or cannot do.
|
||||
|
||||
---
|
||||
|
||||
## 7. The leave-behind
|
||||
|
||||
The engagement ends. The admin has to operate the estate alone until the next engagement. This section is what you set up during the engagement so they can do that.
|
||||
|
||||
### The self-service cadence document
|
||||
|
||||
Every adversarial validation engagement closes with a filled-in [Self-Service Cadence](../assessment-templates/self-service-cadence.md) document, customized for the client. The template becomes their recurring runbook — monthly portal checks, quarterly tool runs, and a clear list of "call us if you see this" triggers.
|
||||
|
||||
Spend the last session of the engagement walking through the document with the named admin. Run the first quarterly check together, with them driving. The goal is not to hand over a PDF — it is to verify they can execute it without you in the room.
|
||||
|
||||
### Tools to leave installed and working
|
||||
|
||||
Before you leave, confirm these are installed and the admin has run each at least once:
|
||||
|
||||
| Tool | Confirm working | Leave-behind |
|
||||
|------|----------------|--------------|
|
||||
| PingCastle | Run a healthcheck scan, admin can read the output | HTML report from today as the baseline |
|
||||
| Purple Knight | Run a full scan, admin can read the indicators | PDF report from today as the baseline |
|
||||
| CAExporter | Exported today's CA policies, stored in agreed location | JSON files from today as the known-good |
|
||||
| Graph PowerShell module | Admin can connect and run the scripts in the cadence document | Scripts saved to the agreed local path |
|
||||
| PnP PowerShell | Admin can connect to SharePoint admin and run the anonymous link export | Confirmed connected during the session |
|
||||
|
||||
Do not leave a tool installed that the admin has never run. An unfamiliar tool is not a capability — it is a task that will not get done.
|
||||
|
||||
### The baseline numbers
|
||||
|
||||
At close of engagement, record the opening and closing metrics in the tracking spreadsheet you set up with the admin. These are the numbers their quarterly PingCastle and Purple Knight runs will be compared against. Without a baseline, a quarterly scan is a point in time with no direction — with a baseline, it tells a story.
|
||||
|
||||
| Metric | Value at close of engagement |
|
||||
|--------|------------------------------|
|
||||
| PingCastle score | |
|
||||
| Purple Knight: Critical indicators | |
|
||||
| BloodHound paths to DA (standard user) | |
|
||||
| Standing GA count (non-break-glass) | |
|
||||
| Anonymous link count | |
|
||||
| Stale guest count (90+ days inactive) | |
|
||||
| CA policies verified to enforce | |
|
||||
| Detection signals confirmed working | |
|
||||
|
||||
### "Call us" triggers — agree them explicitly
|
||||
|
||||
From the [cadence document](../assessment-templates/self-service-cadence.md), go through the trigger list out loud with the admin and confirm they understand each one. The list exists so they do not have to judge whether something is important enough to contact you — the bar is already defined.
|
||||
|
||||
The most important part of this conversation: *"When in doubt, contact us. We would rather look at a false alarm than hear about a real incident that sat for two weeks because you were not sure if it was worth mentioning."*
|
||||
|
||||
---
|
||||
|
||||
## What this engagement is not
|
||||
|
||||
**Not a red team.** The client knows you are here. You are working with them, not against them. When a simulation fires an alert, you tell the responder it is a test. The goal is to calibrate the detection, not to prove that you can evade it.
|
||||
|
||||
**Not a vulnerability scan.** You are not looking for unpatched CVEs or misconfigured services in bulk. You are validating the specific controls the client believes are in place.
|
||||
|
||||
**Not a compliance audit.** You will not produce a CIS score or a NIST gap report at the end. You will produce a list of controls that work and a list of controls that do not, measured by observation, with structural changes attached to the ones that do not.
|
||||
|
||||
**Not additive.** You are not recommending new tools, new policies, or new products. If something does not work, the fix is almost always to remove the exception, test the existing control, or eliminate the coupling — not to add a compensating control on top of the broken one.
|
||||
|
||||
---
|
||||
|
||||
*Field Guide — Adversarial Validation. Updated June 2026. Review alongside the main field guide — January 2027.*
|
||||
@@ -0,0 +1,133 @@
|
||||
# Quantum Vulnerability Management
|
||||
|
||||
> *"You do not have 40,000 critical vulnerabilities. You have ~400 that are real, ~40 that are on fire, and a process that cannot tell them apart. Quantum vulnerability management is the discipline of sizing remediation to the time you actually have — and of admitting that the unit of work was never the vulnerability. It was the path."*
|
||||
|
||||
This is the operating framework behind [Book VII — Vulnerability Management](../books/06-vulnerability-management.md). Book VII is the philosophy; this is the model a consultant runs in an engagement. It pairs with the [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md) (which sizes the quanta) and the [AI-Assisted TVM Blueprint](../playbooks/ai-assisted-tvm.md) (which automates the hours-lane).
|
||||
|
||||
---
|
||||
|
||||
## The problem in one paragraph
|
||||
|
||||
Time-to-exploit has collapsed to roughly **4 hours** while median remediation sits at **43 days**; CVE volume has gone past **59,000/year** and the public enrichment data (NVD) is degrading; and as of the **2026 Verizon DBIR, vulnerability exploitation is the #1 initial-access vector, roughly twice phishing.** A human-paced, CVSS-sorted patch programme cannot close a gap that runs the wrong way by two orders of magnitude. The answer is not "patch faster." It is to **stop using the vulnerability list as the unit of work**, size remediation into time-budgeted quanta, contain the few that matter in hours, make the rest not matter through architecture, and feed every exploited path back into a shorter kill chain.
|
||||
|
||||
---
|
||||
|
||||
## What a quantum is
|
||||
|
||||
A **quantum** is the smallest unit of remediation that:
|
||||
|
||||
1. **Fully closes a specific exploitable path** — not a CVE in the abstract, a path an adversary could actually walk.
|
||||
2. **Is sized to a time budget it can actually be completed within** — hours, days, or a sprint.
|
||||
3. **Ends in a verifiable signal** — a test that proves the path is closed, not a ticket marked done.
|
||||
|
||||
The word is chosen deliberately:
|
||||
|
||||
- **Atomic.** You cannot ship half a quantum and claim half the protection. A patch on 80% of the fleet, or a rule applied but never verified to block, is a *ghost patch* — fully exploitable and now invisible. A quantum is all-or-nothing.
|
||||
- **Discrete.** Work is packetised into units that fit the time available, not smeared across an infinite backlog. An undifferentiated backlog has no front; quanta give it one.
|
||||
|
||||
---
|
||||
|
||||
## The sort key: time-to-existential-impact
|
||||
|
||||
Quanta are ordered not by severity but by **time-to-existential-impact**, a function of three things the *environment* determines — not the CVE:
|
||||
|
||||
> **time-to-existential-impact = f( kill-chain position, reachability, exploit availability )**
|
||||
|
||||
| Factor | Question | Where it comes from |
|
||||
|--------|----------|---------------------|
|
||||
| **Kill-chain position** | Does this sit on a path to existential compromise? | [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md), BloodHound, the diagnostic |
|
||||
| **Reachability** | Can the adversary actually get to it (internet-facing, one hop from T0, behind segmentation)? | Network topology, external scan, [Perimeter Scanning](../playbooks/perimeter-scanning-capability.md) |
|
||||
| **Exploit availability** | Is there a working exploit in the wild now? | CISA KEV, exploit databases, threat intel |
|
||||
|
||||
The same CVE has a different quantum on different assets, because position, not severity, sets the clock. **A 9.8 on a segmented, unreachable, non-privileged host is a sprint quantum. A 7.5 on an internet-facing box one hop from a domain controller is an hours quantum.** This is the Book I principle — kill-chain position changes the priority, not the score — made operational.
|
||||
|
||||
---
|
||||
|
||||
## The four quanta
|
||||
|
||||
| Quantum | Time budget | What's in it | The response | Lane character |
|
||||
|---------|-------------|--------------|--------------|----------------|
|
||||
| **Critical** | **Hours** | On the kill chain, reachable, exploit available now | **Compensating control, not the patch** — sever reachability, edge-block, isolate, disable feature. Patch follows later. | Must be partly **autonomous**; human at policy boundary |
|
||||
| **Severe** | **Days** | Material risk; reachable with friction, or partial compensating cover | Batched, completed and verified inside one short change window | Human-run, tightly scheduled |
|
||||
| **Standard** | **Sprint** | The long, real, non-urgent tail | Drained in sprint-sized batches that can actually be finished; this is where patch velocity is the right tool | Routine engineering rhythm |
|
||||
| **Dark** | **Unsized** | Can't see the asset, can't establish reachability, can't determine exploitability | **Route to discovery** — turn an uncharacterised risk into a sized quantum | Discovery, not remediation |
|
||||
|
||||
### Why "compensating control, not the patch" for the critical quantum
|
||||
|
||||
You cannot meet an hours budget with a vendor patch cycle, and often the patch does not exist yet. So the critical quantum's job is **not to fix the vulnerability — it is to move the asset out of the hours-window** by the cheapest fast control available: cut the reachability, block at the edge, isolate the host, disable the vulnerable feature, pull it behind the WAF. A 4-hour time-to-impact becomes a non-urgent one, and the actual patch drops into the standard lane on the normal change calendar. Reachability is almost always faster to change than a patch is to ship — which makes **reachability the fastest remediation you own.**
|
||||
|
||||
### Why the dark quantum is the most dangerous
|
||||
|
||||
The old model ignores the dark quantum because it has no score. That is exactly backwards: an uncharacterised risk on an unknown asset is how estates die. A *known* severe is safer than an *unknown* nothing, because you can plan around the known one. The antifragile move is to spend judgement converting dark quanta into sized ones — which is why discovery (the [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md), [zero-budget discovery](../playbooks/zero-budget-vulnerability-discovery.md), osquery) is part of vulnerability management, not separate from it.
|
||||
|
||||
---
|
||||
|
||||
## The barbell: contain fast or architect away — never the fragile middle
|
||||
|
||||
```
|
||||
CHEAP / FAST / REVERSIBLE SLOW / STRUCTURAL / DURABLE
|
||||
Hours-lane compensating controls Segmentation, least privilege,
|
||||
(edge block, isolate, cut reachability) T0 protection, assume-breach
|
||||
── wins the time race the patch can't ── ── makes ~90% of vulns not matter ──
|
||||
◄────────────── THE FRAGILE MIDDLE TO AVOID ──────────────►
|
||||
The aging "critical patch backlog": carries hours-lane urgency,
|
||||
moves at sprint-lane speed. Max anxiety, min protection,
|
||||
and the attacker clears it for you one exploited host at a time.
|
||||
```
|
||||
|
||||
Both ends of the barbell are convex (small cost, large payoff — Pillar 5). The fragile middle is concave (maximum cost, minimum return). The rule: **contain it fast, or architect it away. Never let it age in the middle.**
|
||||
|
||||
---
|
||||
|
||||
## The ~90% subtraction — via negativa applied to the list
|
||||
|
||||
The single highest-leverage move, and it is pure subtraction. Industry data suggests **roughly 90% of "critical" vulnerabilities are not exploitable in a given environment** once compensating controls, reachability, and segmentation are mapped. So before adding any work:
|
||||
|
||||
1. Map, per asset: internet reachability, EDR coverage, WAF rules, segmentation distance from T0.
|
||||
2. Delete the false urgency on everything segmented, unreachable, or already neutralised.
|
||||
3. What remains — the genuinely reachable, genuinely exploitable ~10% — is the only thing the hours- and days-lanes ever touch.
|
||||
|
||||
This turns "40,000 criticals" into a few hundred real findings and a few dozen on fire. The compensating-control map that makes it possible is **the single most valuable artefact in the programme** — build it before the incident, because during a zero-day it answers "are we actually exposed?" in minutes instead of days. The caveat (Book I): a mapped control that has rotted into a ghost is a false negative. **Test the controls you are counting on; do not trust the map.**
|
||||
|
||||
---
|
||||
|
||||
## The feedback loop — the antifragile difference
|
||||
|
||||
A vulnerability that was exploited or nearly exploited is the cheapest penetration test you will ever get. Patching the CVE wastes the data. The antifragile move is to **sever the path** the attacker used — boundary the flat segment, collapse the over-privileged service account, pull the reachable management interface behind the bastion — so the *next* vulnerability that lands there is a non-event before it is even disclosed.
|
||||
|
||||
**The metric is not MTTR. It is: did the kill chain get shorter?** Ten incidents that produce ten patches and zero severed paths mean you are merely fast. Ten incidents that produce six structurally shortened kill chains mean the estate is getting harder to compromise every time it is tested — the only honest definition of antifragile.
|
||||
|
||||
---
|
||||
|
||||
## Running it in an engagement — the sequence
|
||||
|
||||
1. **Discover** — run the [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md) to map assets, reachability, and the shortest existential path. Anything you cannot characterise is a dark quantum; route it to deeper discovery.
|
||||
2. **Subtract** — apply the ~90% reduction using the compensating-control and reachability map. Delete false urgency.
|
||||
3. **Size** — place every remaining real finding into a quantum (critical / severe / standard) by time-to-existential-impact.
|
||||
4. **Contain the hours-lane** — apply compensating controls to the critical quantum *today*, autonomously where guardrails allow ([AI-Assisted TVM](../playbooks/ai-assisted-tvm.md)). Verify each closes with a signal.
|
||||
5. **Batch the rest** — days-lane in the next change window, sprint-lane in the engineering rhythm.
|
||||
6. **Architect away the middle** — feed the recurring paths into segmentation and least-privilege work (Books II–V) so the same class of vulnerability stops mattering.
|
||||
7. **Close the loop** — after every exploited-or-near finding, ask what path got shorter, and track that number over time.
|
||||
|
||||
---
|
||||
|
||||
## What to measure
|
||||
|
||||
| Metric | Why it matters | Antifragile target |
|
||||
|--------|----------------|--------------------|
|
||||
| Critical-quantum containment time | The hours-lane is the race you must not lose | Hours, trending down |
|
||||
| % of "criticals" confirmed reachable | Proves the ~90% subtraction is real, not assumed | Known, not "unknown" |
|
||||
| Ghost-patch rate (closed-but-unverified) | Half-done remediation is hidden full exposure | Zero — every quantum closes with a signal |
|
||||
| Dark-quantum count | Uncharacterised risk is the dangerous kind | Shrinking; each one converted to sized |
|
||||
| **Kill-chain length after incidents** | The only measure of getting *stronger* | Shorter after each exploited-or-near event |
|
||||
| Items aging in the fragile middle | The concave zone the barbell forbids | Zero — contained or architected, never aging |
|
||||
|
||||
---
|
||||
|
||||
## Honest uncertainty
|
||||
|
||||
The headline statistics (the 4-hour, 43-day, ~59,000-CVE, ~90%-not-exploitable, and "#1, ~2× phishing" figures) are point-in-time and churn annually — re-check them against the current DBIR, M-Trends, and FIRST/CVE data before putting them on a slide. The *direction* is the stable signal; the numbers move. The autonomous-execution tooling for the hours-lane is real but immature and fast-moving — verify current capability and failure modes, and start with reversible compensating controls, never irreversible change. What does not churn: kill-chain position beats CVSS, most criticals aren't reachable, a half-done remediation is a hidden full vulnerability, and every exploited path should shorten the chain.
|
||||
|
||||
---
|
||||
|
||||
*See [Book VII — Vulnerability Management](../books/06-vulnerability-management.md) for the full philosophy, [Kill Chain Assessment app](../playbooks/kill-chain-assessment-app.md) for sizing the quanta in unknown territory, and [AI-Assisted TVM Blueprint](../playbooks/ai-assisted-tvm.md) for automating the hours-lane.*
|
||||
@@ -42,6 +42,7 @@ Operational and persuasion documents used in engagements. **Start every new clie
|
||||
| [Antifragile Manifest](core/antifragile-manifest.md) | Five pillars of antifragile enterprise | Executives, Architects, Consultants |
|
||||
| [AI Sovereignty Framework](core/ai-sovereignty-framework.md) | Strategic arguments and implementation for local AI | CISOs, CTOs, Security Architects |
|
||||
| [T0 Asset Framework](core/t0-asset-framework.md) | Tier 0 classification and protection for critical assets | Security Architects, Infrastructure Leads |
|
||||
| [Quantum Vulnerability Management](core/quantum-vulnerability-management.md) | Sizing remediation into time-budgeted quanta (hours/days/sprint/dark) for the exploitation-first era; companion to Book VII | CISOs, Vulnerability Management, Consultants |
|
||||
| [Spontaneous Order Principles](core/spontaneous-order-principles.md) | Philosophical foundation for the five pillars | Executives, Architects, Strategists |
|
||||
|
||||
## Playbooks
|
||||
@@ -51,6 +52,7 @@ Operational and persuasion documents used in engagements. **Start every new clie
|
||||
| [Rapid Modernisation Plan](playbooks/rapid-modernisation-plan.md) | 30-60-90-180 day transformation roadmap | Program Managers, Consultants, CISOs |
|
||||
| [Endpoint Management Entry Vector](playbooks/endpoint-management-entry-vector.md) | Intune/device management as the ideal engagement entry point | M365 Consultants, Account Managers |
|
||||
| [AI-Assisted TVM Blueprint](playbooks/ai-assisted-tvm.md) | AI-powered vulnerability management for AI-powered adversaries | CTOs, CISOs, Vulnerability Management |
|
||||
| [Kill Chain Assessment App](playbooks/kill-chain-assessment-app.md) | Spec for the offline tool that maps unknown estates into an attack graph, computes the shortest existential path, and sizes quanta. Tool: [`tools/kill-chain-assessment.html`](tools/kill-chain-assessment.html) | Consultants, Assessors, Security Architects |
|
||||
| [Zero-Budget Vulnerability Discovery](playbooks/zero-budget-vulnerability-discovery.md) | Script-based and osquery-based server/container vuln discovery without Tenable/Qualys | Security Engineers, Consultants |
|
||||
| [Perimeter Scanning Capability](playbooks/perimeter-scanning-capability.md) | External attack surface strategy: build, partner, or hybrid | Security Architects, Consultants |
|
||||
| [Osquery: The Sovereign Discovery Platform](playbooks/osquery-custom-platform.md) | Build a custom vulnerability and asset inventory platform on osquery | Security Engineers, Consultants, CTOs |
|
||||
|
||||
@@ -0,0 +1,292 @@
|
||||
# Assignment: Conditional Access Architecture
|
||||
|
||||
> *CA policies are enforcement points, not audit tools. A policy in report-only mode is a sensor. A policy in enabled mode is a wall. Know which you're building before you start.*
|
||||
|
||||
This is a **scoped assignment package** — a complete, principled delivery guide for one specific client brief. It can be delivered standalone or immediately after [Assignment: Identity Baseline](assignment-identity-baseline.md). If identity baseline has not been completed, the prerequisites section below applies first.
|
||||
|
||||
---
|
||||
|
||||
## The Brief
|
||||
|
||||
Client requests that fall within this scope:
|
||||
|
||||
- *"Review our Conditional Access policies — we're not sure they're right"*
|
||||
- *"We need to enforce MFA properly, not just per-user MFA"*
|
||||
- *"Our auditor wants evidence of access controls"*
|
||||
- *"We got a new employee and nobody knows how access actually works"*
|
||||
- *"We bought E5 and want to use the CA features"*
|
||||
- *"We need compliant devices to be required for access"* (if Intune baseline is already deployed)
|
||||
|
||||
This assignment does not require executive sponsorship. It requires one named IT lead with Global Administrator access, tolerance for a 72-hour report-only period per policy before enforcement, and awareness that policy changes affect all users.
|
||||
|
||||
---
|
||||
|
||||
## Scope Boundary
|
||||
|
||||
**In scope:**
|
||||
- Audit of all existing CA policies (coverage, gaps, naming, exclusions, mode)
|
||||
- Design and documentation of a complete CA policy set
|
||||
- Staged deployment of the baseline policy set (identity-level controls)
|
||||
- Device compliance integration if Intune compliance policies are already active
|
||||
- Named locations configuration
|
||||
- Authentication strengths configuration (phishing-resistant MFA for admins)
|
||||
|
||||
**Out of scope:**
|
||||
- Intune compliance policy configuration → [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md)
|
||||
- Microsoft Defender for Cloud Apps session controls (app-enforced restrictions are in scope; MDCA-dependent session policies are not)
|
||||
- Privileged Identity Management configuration → privileged access engagement
|
||||
- Identity Baseline (MFA registration, legacy auth, admin account hygiene) → [Assignment: Identity Baseline](assignment-identity-baseline.md)
|
||||
|
||||
**Dependency:** This assignment can configure device compliance as a CA signal, but only if Intune compliance policies are already active and returning compliance state for enrolled devices. If Intune is not deployed, the device-compliance policies in this assignment are designed in report-only mode and left for activation when Intune is ready. Do not activate device-compliance CA policies against an environment where device enrollment is incomplete — the result is a broad lockout.
|
||||
|
||||
---
|
||||
|
||||
## Before You Touch Anything
|
||||
|
||||
**1. Break-glass confirmation.**
|
||||
Before touching any CA policy, confirm that two cloud-only break-glass Global Admin accounts exist and are excluded from all CA policies. If they do not exist, create them and configure sign-in alerts before proceeding. See [Assignment: Identity Baseline](assignment-identity-baseline.md) for the break-glass standard. This step is non-negotiable — a misconfigured CA policy with no break-glass is a full tenant lockout.
|
||||
|
||||
**2. CAExporter baseline.**
|
||||
Export all existing CA policies using [CAExporter](https://github.com/merill/caexporter). Store the JSON export as the before-state. Every change is measurable against it. This is also the rollback reference.
|
||||
|
||||
**3. Per-user MFA audit.**
|
||||
Run the per-user MFA state report (Entra admin center → Users → Per-user MFA). If per-user MFA is enabled for any accounts, document it. Per-user MFA and CA-enforced MFA operate on separate control planes and interact unpredictably: a user with per-user MFA *enforced* may bypass some CA policies. Resolution is part of Step 3 below.
|
||||
|
||||
**4. Sign-in log baseline.**
|
||||
Export 30 days of sign-in logs. Note the distribution of authentication methods in use, client application types (modern vs. legacy), and any conditional access results (success, failure, report-only). This is the baseline against which policy impact is measured.
|
||||
|
||||
---
|
||||
|
||||
## Principles Applied
|
||||
|
||||
**Automation over procedure.**
|
||||
A CA policy enforces MFA whether or not anyone remembers to ask for it. A checklist does not. Every identity control in this assignment is implemented as a CA policy — self-enforcing, continuous, requiring no human decision to operate after deployment.
|
||||
|
||||
**Kill chain first.**
|
||||
The policy set in this assignment is sequenced by structural impact. Legacy auth block and universal MFA enforcement come first because they close the widest attack path. Device compliance, location controls, and session policies come after. If the engagement ends early, the first two policies are the ones that matter.
|
||||
|
||||
**Explicit design, documented intent.**
|
||||
Every policy deployed in this assignment has a documented name, purpose, conditions, grant controls, exclusions, and the date it was set to enabled. A CA policy with no documented intent is a liability: nobody can safely modify it, nobody knows if it can be removed, and future administrators work around it rather than through it. The leave-behind package for this assignment is the policy design document — not just the JSON export.
|
||||
|
||||
**Report-only before enforcement.**
|
||||
Every new policy goes to report-only mode for a minimum of 48–72 hours. Sign-in logs are reviewed during that window to confirm expected behavior before enforcement. This is not optional. The cost of a production lockout — even for 30 minutes — is higher than the cost of 72 hours' delay.
|
||||
|
||||
---
|
||||
|
||||
## Delivery Architecture
|
||||
|
||||
### Step 1 — Audit (no changes)
|
||||
|
||||
Document the current state honestly. The finding is not a criticism of the IT team — it is the starting point.
|
||||
|
||||
| Action | Output |
|
||||
|--------|--------|
|
||||
| CAExporter export | CA policy baseline JSON and human-readable summary |
|
||||
| Per-user MFA state export | Accounts with per-user MFA enforced vs. disabled vs. not configured |
|
||||
| Policy coverage matrix | Every policy: name, state (enabled/report-only/disabled), conditions, grant, exclusions, last modified, named owner |
|
||||
| Gap analysis | Conditions with no coverage; duplicate coverage; exclusion lists with individual accounts |
|
||||
| Sign-in log review | Authentication methods in use; legacy auth clients; CA policy results |
|
||||
| Named locations inventory | Trusted IPs and named locations configured, if any |
|
||||
|
||||
Deliver the audit findings to the named client lead before writing any policies. The coverage matrix should be readable without technical background — each row is one policy, each column answers one question. Include a plain-language summary: "You have 14 policies. Three are disabled and appear forgotten. Two overlap in ways that may create gaps. Five have no named owner and no documented purpose. Legacy authentication is not blocked at the CA level."
|
||||
|
||||
---
|
||||
|
||||
### Step 2 — Design
|
||||
|
||||
Before deploying anything, produce the complete policy set design on paper (or in a document). Every policy defined, every exclusion justified, every interaction between policies mapped. Review with the named client lead before deployment begins.
|
||||
|
||||
The policy set is designed in three layers. Deploy them in order.
|
||||
|
||||
**Layer 1 — Identity controls (no device dependency)**
|
||||
These work immediately, without Intune or any device management. Deploy first.
|
||||
|
||||
**Layer 2 — Admin controls (elevated requirements for privileged roles)**
|
||||
Stricter controls applied specifically to accounts holding privileged roles. Deploy after Layer 1 is stable.
|
||||
|
||||
**Layer 3 — Device and session controls (Intune dependency)**
|
||||
Require device compliance as a CA signal. Deploy only when Intune compliance policies are active and returning results. Design these policies now; activate them when the Intune assignment is complete.
|
||||
|
||||
---
|
||||
|
||||
### Step 3 — Deploy Layer 1 (staged)
|
||||
|
||||
Each policy follows the same deployment sequence:
|
||||
1. Create policy in **report-only** mode
|
||||
2. Wait 48–72 hours; review sign-in logs for the policy's report-only results
|
||||
3. Identify any legitimate traffic that would be blocked; create exclusion groups or refine conditions
|
||||
4. Switch to **enabled**
|
||||
5. Monitor sign-in logs for 24 hours
|
||||
6. Only then move to the next policy
|
||||
|
||||
Do not deploy multiple policies simultaneously. Each policy change has independent blast radius; sequential deployment makes causality clear when something breaks.
|
||||
|
||||
**Legacy authentication block first.** This is the one control that cannot afford to be partially deployed. If legacy auth is blocked via CA but not via Entra authentication policies, a policy gap in CA can allow legacy auth through. Confirm after deployment that the sign-in log shows zero legacy auth sign-ins. Zero is the only acceptable result.
|
||||
|
||||
**Per-user MFA resolution.** After CA-enforced MFA is active for all users, disable per-user MFA for all accounts except break-glass. Leaving both active creates a split control plane. The CA policy is the authoritative control; per-user MFA is the legacy mechanism. They should not coexist once CA is stable.
|
||||
|
||||
---
|
||||
|
||||
## The Baseline Policy Set
|
||||
|
||||
This is the policy set to deploy on every engagement. Adapt scope and exclusions to the client's environment; do not adapt the design principles.
|
||||
|
||||
**Naming convention:**
|
||||
`CA-[Audience]-[Condition or Trigger]-[Grant or Block]`
|
||||
|
||||
Examples: `CA-AllUsers-LegacyAuth-Block`, `CA-Admins-AllApps-RequirePhishingResistantMFA`
|
||||
|
||||
Consistent naming is not aesthetic preference — it is the difference between a policy set that can be maintained and one that accumulates technical debt.
|
||||
|
||||
**Exclusion groups:**
|
||||
All exclusions use Entra ID security groups, never individual accounts (except break-glass, which is excluded by account). Group membership is reviewed as part of the leave-behind. A group named `CA-Exclusion-BreakGlass` is named and owned; an individual account exclusion is invisible in aggregate policy review.
|
||||
|
||||
---
|
||||
|
||||
### Layer 1 — Identity Controls
|
||||
|
||||
| Policy | Conditions | Grant / Block | Notes |
|
||||
|--------|-----------|---------------|-------|
|
||||
| `CA-AllUsers-LegacyAuth-Block` | All users / All cloud apps / Legacy auth clients (Exchange ActiveSync + Other clients) | Block | Deploy first. Confirm zero legacy auth in sign-in logs post-enforce. |
|
||||
| `CA-AllUsers-AllApps-RequireMFA` | All users / All cloud apps / All platforms / Exclude break-glass group | Require MFA | Core enforcement. Deploy second. Resolve per-user MFA conflict after this is stable. |
|
||||
| `CA-GuestUsers-AllApps-RequireMFA` | Guest and external users / All cloud apps | Require MFA | Separate policy: guests often require different exclusion handling. |
|
||||
|
||||
**E3 stops here for identity-layer controls.** Risk-based policies (sign-in risk, user risk) require Entra ID P2. If the client has P2 licensing, add:
|
||||
|
||||
| Policy | Conditions | Grant / Block | Notes |
|
||||
|--------|-----------|---------------|-------|
|
||||
| `CA-AllUsers-HighUserRisk-RequirePasswordChange` | All users / High user risk | Require MFA + password change | P2 required. Requires Identity Protection enabled. |
|
||||
| `CA-AllUsers-MedHighSignInRisk-RequireMFA` | All users / Medium and High sign-in risk | Require MFA | P2 required. Step-up for risky sign-ins. |
|
||||
|
||||
---
|
||||
|
||||
### Layer 2 — Admin Controls
|
||||
|
||||
| Policy | Conditions | Grant / Block | Notes |
|
||||
|--------|-----------|---------------|-------|
|
||||
| `CA-Admins-AllApps-RequirePhishingResistantMFA` | Directory roles (Global Admin, Privileged Role Admin, Security Admin, Exchange Admin, SharePoint Admin, User Admin, Conditional Access Admin, Application Admin) / All cloud apps | Require authentication strength: Phishing-resistant MFA | Phishing-resistant = FIDO2 security key, Windows Hello for Business, or certificate-based auth. Requires auth strength configured in Entra. Standard Authenticator push is not phishing-resistant. |
|
||||
| `CA-Admins-AllApps-RequireCompliantOrHybridDevice` | Same role scope / All cloud apps | Require compliant device OR hybrid Azure AD joined | Layer 3 control applied early to admins specifically. Activate this even before broad device compliance enforcement if Intune covers admin workstations. |
|
||||
|
||||
**Why admins get a separate, stricter policy set:** Admin credentials are the highest-value target in the tenant. An attacker who can bypass MFA on an admin account owns the tenant. Standard Authenticator push MFA is bypassed by MFA fatigue attacks (request flooding until the user approves). Phishing-resistant MFA is not. The separation in the policy set makes it explicit that admin accounts have a different requirement — and makes it auditable.
|
||||
|
||||
---
|
||||
|
||||
### Layer 3 — Device Controls (activate when Intune is ready)
|
||||
|
||||
Design these policies now. Activate them after [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md) is complete and device compliance results are stable.
|
||||
|
||||
| Policy | Conditions | Grant / Block | Notes |
|
||||
|--------|-----------|---------------|-------|
|
||||
| `CA-AllUsers-AllApps-RequireCompliantDevice` | All users / All cloud apps / All platforms | Require compliant device OR require MFA | Start with OR (compliant device OR MFA) — gives unmanaged-device users a path via MFA. Once enrollment is high enough, switch to AND or compliant-only. |
|
||||
| `CA-AllUsers-SensitiveApps-RequireCompliantDevice` | All users / Exchange Online + SharePoint Online / All platforms | Require compliant device | Strict. Apply to sensitive apps first before all apps. |
|
||||
| `CA-AllUsers-UnmanagedDevice-AppEnforcedRestrictions` | All users / Exchange Online + SharePoint Online / Any platform / Filter: not compliant, not hybrid-joined | Session: app-enforced restrictions (use limited web access) | Limits download and sync on unmanaged devices accessing mail and documents. Requires Exchange Online and SharePoint to be configured for app-enforced restrictions. E3-compatible. |
|
||||
|
||||
The `CA-AllUsers-UnmanagedDevice-AppEnforcedRestrictions` policy is the most immediately valuable Layer 3 control for E3 clients without full Intune enrollment — it degrades access rather than blocks it, which is easier to deploy without user disruption.
|
||||
|
||||
---
|
||||
|
||||
### Named Locations (supporting the policy set)
|
||||
|
||||
Configure named locations before deploying any location-based policies.
|
||||
|
||||
| Location | Purpose |
|
||||
|----------|---------|
|
||||
| **Trusted corporate networks** | Office IP ranges. Used to relax MFA requirements on trusted networks if the client explicitly requests it. Default recommendation: do not relax MFA on any network — trusted location is less durable than device compliance. |
|
||||
| **High-risk countries** (optional) | Countries from which the client has no operations and no expected sign-ins. Can be used to block access or require MFA as a step-up. Use carefully: VPN exit nodes and mobile roaming will trigger this. Document the decision. |
|
||||
|
||||
Named locations are often requested but rarely worth the operational overhead unless the client has a specific use case (blocking sign-ins from a defined list of countries, or relaxing physical office controls). Include in the design document; deploy only if the client has a clear requirement.
|
||||
|
||||
---
|
||||
|
||||
## Structural Resilience Checklist
|
||||
|
||||
Controls that hold without ongoing human willingness after this engagement closes.
|
||||
|
||||
- [ ] `CA-AllUsers-LegacyAuth-Block` is **enabled** — not report-only — and sign-in logs confirm zero legacy auth clients
|
||||
- [ ] `CA-AllUsers-AllApps-RequireMFA` is **enabled** and covers all users including guests (separate guest policy)
|
||||
- [ ] `CA-Admins-AllApps-RequirePhishingResistantMFA` is **enabled** and authentication strength is configured
|
||||
- [ ] Per-user MFA has been disabled for all accounts after CA-enforced MFA is stable (except break-glass)
|
||||
- [ ] All exclusions use named Entra ID groups — no individual account exclusions except break-glass
|
||||
- [ ] Every policy has a documented name, intent, owner, and date of last review
|
||||
- [ ] CAExporter export (before and after) stored in client documentation
|
||||
- [ ] Layer 3 policies exist in **report-only** mode, ready for activation when Intune is complete
|
||||
|
||||
---
|
||||
|
||||
## Kill Chain Contribution
|
||||
|
||||
**What this assignment closes:**
|
||||
|
||||
| Attack vector | Control deployed |
|
||||
|---------------|-----------------|
|
||||
| Password spray with no MFA prompt | `CA-AllUsers-AllApps-RequireMFA` |
|
||||
| MFA fatigue attack against admin accounts (push flooding) | `CA-Admins-AllApps-RequirePhishingResistantMFA` |
|
||||
| Legacy protocol abuse (SMTP AUTH, IMAP, Basic Auth REST) | `CA-AllUsers-LegacyAuth-Block` |
|
||||
| Credential stuffing from breached credential lists | MFA enforcement |
|
||||
| Guest account lateral movement through weakly controlled external access | `CA-GuestUsers-AllApps-RequireMFA` |
|
||||
| Unmanaged device access to sensitive apps (if Layer 3 activated) | `CA-AllUsers-UnmanagedDevice-AppEnforcedRestrictions` |
|
||||
|
||||
**What this assignment does not close:**
|
||||
|
||||
| Remaining gap | Addressed by |
|
||||
|---------------|-------------|
|
||||
| Adversary-in-the-middle / session token theft post-MFA | Device compliance in CA + Entra token protection (P2) |
|
||||
| Unmanaged device as unrestricted access vector | [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md) + Layer 3 activation |
|
||||
| Standing admin privilege (long-lived sessions, no JIT) | Privileged access engagement (PIM) |
|
||||
| Sign-in risk and impossible travel detection | Entra ID P2 Layer 1 additions |
|
||||
| App permission abuse (OAuth consent phishing) | Service identity engagement |
|
||||
|
||||
The residual gap the client is most likely to feel: a stolen session token (from phishing with AiTM proxy) bypasses MFA because it captures the token after MFA completes. This is the next-generation phishing technique. Mitigating it requires token binding to device compliance — a Layer 3 control — plus Entra token protection (P2 feature). Document this in the residual risk statement.
|
||||
|
||||
---
|
||||
|
||||
## Leave-Behind Package
|
||||
|
||||
| Artifact | Description |
|
||||
|----------|-------------|
|
||||
| **CAExporter JSON (before)** | CA policy state at engagement start |
|
||||
| **CAExporter JSON (after)** | CA policy state at engagement close |
|
||||
| **Policy design document** | Every deployed policy: name, intent, conditions, grant/block, exclusion groups, owner, date enabled |
|
||||
| **Policy coverage matrix** | Human-readable: which users are covered by which policies, which apps, which platforms |
|
||||
| **Per-user MFA resolution record** | Confirmation that per-user MFA has been disabled post-CA deployment |
|
||||
| **Layer 3 design document** | Device compliance policies designed but not yet activated; activation prerequisites and checklist |
|
||||
| **Exclusion group inventory** | Every CA exclusion group: name, members, review cadence |
|
||||
| **Sign-in log confirmation** | Legacy auth: zero clients post-block. MFA: applied to >99% of sign-ins. |
|
||||
| **Named locations documentation** | Any configured named locations with business justification |
|
||||
| **Scope boundary log** | Every finding outside this scope, named and prioritized |
|
||||
| **Residual risk statement** | What this assignment did not close, specifically including AiTM/token theft risk |
|
||||
|
||||
The Layer 3 design document is the explicit handoff to the Intune assignment. A CISO reading the leave-behind package can see exactly what was built, why, what it prevents, and what comes next — without needing to ask.
|
||||
|
||||
---
|
||||
|
||||
## Scope Boundary Signals
|
||||
|
||||
| Signal | Points toward |
|
||||
|--------|--------------|
|
||||
| No Intune enrollment or compliance policies active | Intune Security Baseline assignment — activate Layer 3 after |
|
||||
| Global Admins have no phishing-resistant MFA method registered | Auth method enrollment drive; may need hardware key procurement |
|
||||
| Entra ID P2 not licensed; client has credential-stuffing exposure | Licensing recommendation: P2 for Identity Protection (cheaper than full E5) |
|
||||
| App registrations with broad Graph permissions visible in sign-in logs | Service identity engagement |
|
||||
| Service accounts authenticating with CA policies applied | Service account remediation — service accounts should use managed identities or workload identity federation, not user-like credential flows through CA |
|
||||
| Defender for Cloud Apps not licensed; session control requests needed | MDCA engagement for full session control |
|
||||
| Sign-in logs show access from unexpected geographies | Named location policy review; may warrant country block |
|
||||
| Audit log retention < 90 days | Detection baseline assignment |
|
||||
|
||||
---
|
||||
|
||||
## Buildable-On: What the Next Assignment Depends On
|
||||
|
||||
The Intune Security Baseline assignment builds directly on the CA architecture deployed here. Specifically, it depends on:
|
||||
|
||||
1. **`CA-AllUsers-AllApps-RequireCompliantDevice` exists in report-only mode.** The Intune assignment activates this policy as its final step — the point where device compliance becomes an access control, not just a reporting tool.
|
||||
2. **CA exclusion groups are using the right naming convention.** Device compliance policies deployed in Intune reference the same user groups used in CA. Consistent group naming prevents the Intune assignment from having to clean up CA policy exclusions mid-deployment.
|
||||
3. **Sign-in logs show MFA is enforced.** The Intune assignment cannot safely activate device-compliance CA policies if MFA enforcement is incomplete — an unmanaged device could otherwise use the compliance check as a bypass path.
|
||||
|
||||
If all three conditions are true at handover, the Intune assignment can activate Layer 3 without revisiting the CA work. If any condition is false, the scope boundary log documents what needs to be resolved first.
|
||||
|
||||
---
|
||||
|
||||
*For the identity foundation this builds on, see [Assignment: Identity Baseline](assignment-identity-baseline.md).*
|
||||
*For the device compliance integration that activates Layer 3, see [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md).*
|
||||
*For the technical depth on privileged access architecture that informs admin CA requirements, see [Book III — Privileged Access](../books/02-privileged-access.md).*
|
||||
@@ -0,0 +1,443 @@
|
||||
# Assignment: Collaboration and Data Security
|
||||
|
||||
> *Data is liquid. It leaves where you put it — copied, shared, forwarded, synced, linked. The question is never "is it locked down" but "where can it flow, who can reshare it, and can you see and reverse the flow?"*
|
||||
|
||||
This is a **scoped assignment package** and the fourth in the M365 security sequence. It addresses the data and collaboration layer: how corporate data moves, where it leaks, and what structural controls reduce the blast radius when it does. It can be delivered standalone, but the device and identity controls from the preceding assignments are assumed in the residual risk analysis.
|
||||
|
||||
This assignment completes the **"Secure M365"** engagement when delivered after Identity Baseline, CA Architecture, and Intune Security Baseline.
|
||||
|
||||
---
|
||||
|
||||
## The Brief
|
||||
|
||||
Client requests that fall within this scope:
|
||||
|
||||
- *"Secure our M365 / harden our Exchange and SharePoint"*
|
||||
- *"We're worried about data leaking through email or shared links"*
|
||||
- *"We got a phishing email and want to prevent it"*
|
||||
- *"Our auditor wants to see DLP controls"*
|
||||
- *"We need email authentication — DMARC / DKIM / SPF"*
|
||||
- *"We need to know what's being shared externally"*
|
||||
- *"Set up sensitivity labels"*
|
||||
|
||||
This assignment does not require executive sponsorship. It requires one named IT lead with Global Administrator and Exchange Administrator access, tolerance for discovering that external sharing is significantly wider than assumed, and willingness to remove sharing types that users may push back on.
|
||||
|
||||
---
|
||||
|
||||
## Scope Boundary
|
||||
|
||||
**In scope:**
|
||||
- External sharing exposure mapping ("Anyone" links, external guests, external shares)
|
||||
- Removal of anonymous sharing and external auto-forwarding
|
||||
- Exchange Online Protection (EOP) hardening: anti-phishing, anti-malware, anti-spam
|
||||
- Email authentication: SPF verification, DKIM enablement, DMARC deployment
|
||||
- SharePoint and OneDrive tenant-level sharing governance
|
||||
- Guest access governance: expiration, review cadence
|
||||
- Sensitivity label taxonomy and deployment (foundation: 3–4 labels)
|
||||
- DLP baseline: 3–5 known high-value patterns for Exchange, SharePoint, OneDrive
|
||||
- Audit logging verification and configuration
|
||||
- App consent governance: restrict user consent, enable admin consent workflow
|
||||
|
||||
**Out of scope:**
|
||||
- Comprehensive data classification programme → separate Purview engagement
|
||||
- Defender for Office 365 P1/P2 advanced configuration (Safe Links, Safe Attachments, Attack Simulation) → E5 or add-on engagement
|
||||
- Microsoft Defender for Cloud Apps session controls → MDCA engagement
|
||||
- Retention policies and data lifecycle governance → separate Purview engagement
|
||||
- On-premises Exchange decommissioning → separate hybrid engagement
|
||||
- Cross-tenant access configuration (B2B direct connect) → out of scope unless specifically requested
|
||||
- Entitlement management and full guest lifecycle (P2 feature) → out of scope for E3
|
||||
|
||||
When the client asks for comprehensive DLP — covering all data types across all services — scope it as a separate engagement. A DLP programme that attempts to cover everything produces alert fatigue that degrades the protection for the things that actually matter.
|
||||
|
||||
---
|
||||
|
||||
## Before You Touch Anything
|
||||
|
||||
**1. Crown jewels question.**
|
||||
Before configuring any control, ask the named client lead one question: *"Which three data sets, if leaked, would cause the most harm to the organisation — regulatory, competitive, or reputational?"*
|
||||
|
||||
If they cannot answer, that inability is finding #1. You cannot apply protection asymmetrically until you know what the asymmetry is for. Sensitivity labels, DLP policies, and restricted-site configurations all depend on this answer. If the organisation genuinely cannot identify its crown jewels, document it and apply the default framework (financial data, HR data, and strategic/M&A communications) as a starting point.
|
||||
|
||||
**2. Surface map.**
|
||||
Before making any changes, enumerate the actual external exposure. The findings are almost always worse than the client assumes — and the enumeration itself, shared with the client lead, is often the moment that creates willingness for the removal steps that follow.
|
||||
|
||||
Run these reports before touching configuration:
|
||||
|
||||
| Report | Tool / Location |
|
||||
|--------|----------------|
|
||||
| "Anyone" (anonymous) links | SharePoint admin center → Reports → Sharing → or Graph API |
|
||||
| External shares (authenticated guest links) | SharePoint admin center → Sharing report |
|
||||
| Guest users with last sign-in date | Entra ID → External Identities → All users (filter: Guest) |
|
||||
| External auto-forwarding rules | Exchange admin center → Mail flow → Rules; or PowerShell: `Get-TransportRule` filtered for external redirect |
|
||||
| User-consented OAuth app grants | Entra ID → Enterprise applications → filter: User consent |
|
||||
| SPF, DKIM, DMARC status | MXToolbox or PowerShell DNS lookup per domain |
|
||||
| Unified Audit Log status | Compliance portal → Audit → or `Get-AdminAuditLogConfig` |
|
||||
|
||||
Deliver the surface map to the named client lead before proceeding to any removal steps. State the findings plainly: "You have 847 anonymous sharing links. Fourteen mailboxes have active external forwarding rules. You have 312 guest accounts, 189 of whom have not signed in within 90 days. DMARC is not configured. Your Unified Audit Log has not been enabled."
|
||||
|
||||
These are facts, not accusations. The client lead needs to see the actual exposure before approving the removal steps.
|
||||
|
||||
---
|
||||
|
||||
## Principles Applied
|
||||
|
||||
**Remove first, then govern.**
|
||||
The highest-impact actions in this assignment are removals: anonymous links, external auto-forwarding, over-permissioned OAuth grants. These are not governance gaps — they are open doors. No amount of sensitivity labelling or DLP configuration compensates for an anonymous sharing link that routes around every identity control built in the preceding three assignments. Subtraction comes first.
|
||||
|
||||
**Name the crown jewels before you protect them.**
|
||||
Even-spreading protection across all data is the concave failure: enormous maintenance cost, false positive noise that trains users to click through warnings, and the real exfiltration lost in the background. Sensitivity labels and DLP policies are applied to the crown jewels and known high-value patterns — not to everything. Three well-targeted DLP policies that fire reliably are worth more than thirty policies that nobody trusts.
|
||||
|
||||
**Visibility before governance.**
|
||||
The surface map is the most valuable deliverable in this assignment. An organisation that has never seen its "Anyone" link count, its guest list with last sign-in dates, or its auto-forward rule inventory cannot govern what it has. The surface map creates visibility; governance follows from it.
|
||||
|
||||
**Protection must travel with the data.**
|
||||
A sensitivity label with encryption is the only control that survives data leaving the tenant. Container controls — SharePoint permissions, CA policies, device compliance — stop working the moment the file is downloaded and forwarded. For the crown jewels, the protection must be bound to the file itself. Everything else is a gate on the way out, not a lock on the data.
|
||||
|
||||
---
|
||||
|
||||
## Delivery Architecture
|
||||
|
||||
### Step 1 — Surface Map (no changes)
|
||||
|
||||
*Described above in "Before You Touch Anything." Complete and deliver before proceeding.*
|
||||
|
||||
The surface map has a second purpose beyond informing the work: it is the before-state that makes the leave-behind measurable. "You had 847 anonymous links; you now have 0" is a concrete, auditable risk-reduction statement.
|
||||
|
||||
---
|
||||
|
||||
### Step 2 — Remove the Dangerous Paths
|
||||
|
||||
These actions have the highest impact per unit of effort in the entire assignment. They should be completed before any additive control is deployed.
|
||||
|
||||
**Kill anonymous "Anyone" links.**
|
||||
|
||||
Set the tenant-level sharing policy to prohibit new "Anyone" links:
|
||||
- SharePoint admin center → Policies → Sharing
|
||||
- External sharing: set to **New and existing guests** (requires authentication) — not "Anyone"
|
||||
- This stops new anonymous links from being created. It does not revoke existing links.
|
||||
|
||||
Existing anonymous links must be revoked separately. Use the SharePoint Sharing Report or a Graph API query to enumerate them, then decide with the client lead: bulk revoke all, or review and selectively revoke. Bulk revoke is correct for any link created more than 90 days ago with no documented business justification. Document the decision and the revocation count.
|
||||
|
||||
**Block external auto-forwarding.**
|
||||
|
||||
External auto-forwarding rules are the most reliable mailbox-compromise exfiltration technique. They should not exist.
|
||||
|
||||
- Exchange admin center → Mail flow → Remote domains → Default domain → Uncheck "Allow automatic forwarding"
|
||||
- Or via the outbound anti-spam policy: set automatic forwarding to **Off**
|
||||
- After disabling, audit existing rules: `Get-TransportRule | Where-Object { $_.RedirectMessageTo -like "*@*" }` and `Get-Mailbox -ResultSize Unlimited | Get-InboxRule | Where-Object { $_.ForwardTo -or $_.RedirectTo -like "*@*" }`
|
||||
|
||||
Any active external forwarding rule found during the audit is a potential incident indicator. Treat each one as suspicious until confirmed legitimate by the mailbox owner and the named client lead. Document the outcome for each.
|
||||
|
||||
**Restrict user OAuth consent.**
|
||||
|
||||
Users should not be able to grant arbitrary third-party applications access to tenant data.
|
||||
|
||||
- Entra ID → Enterprise applications → Consent and permissions → User consent settings
|
||||
- Set to: **Allow user consent for apps from verified publishers, for selected permissions (classified as low impact)** — or **Do not allow user consent** (more restrictive; requires admin approval workflow to compensate)
|
||||
- Enable the **Admin consent workflow**: users can submit a request; named admins receive and review it
|
||||
|
||||
Review existing user-consented grants. Flag any app with permissions in these categories:
|
||||
- `Mail.Read`, `Mail.ReadWrite`, `Mail.Send` — reads or sends all mail
|
||||
- `Files.ReadWrite.All`, `Sites.Read.All` — accesses all files and sites
|
||||
- `User.Read.All`, `Directory.Read.All` — reads full directory
|
||||
|
||||
High-permission user-consented grants should be reviewed with the named client lead and revoked where the app is not recognised, not actively used, or not from a verified publisher. Revoke through Entra ID → Enterprise applications → [App] → Permissions → Revoke user consent.
|
||||
|
||||
---
|
||||
|
||||
### Step 3 — Exchange Online Protection Baseline
|
||||
|
||||
EOP is included in E3 and M365 Business Premium. It handles anti-phishing, anti-malware, and anti-spam for Exchange Online. Default EOP configuration is functional but not optimal.
|
||||
|
||||
**Email authentication (SPF, DKIM, DMARC):**
|
||||
|
||||
| Protocol | What it does | Configuration |
|
||||
|----------|-------------|---------------|
|
||||
| **SPF** | Declares which servers may send email as your domain | DNS TXT record — verify it exists and is not over-broad (`+all` invalidates it) |
|
||||
| **DKIM** | Cryptographically signs outbound email | Enable in Exchange admin center → Email authentication → DKIM → Enable for each domain. Key rotation is handled automatically. |
|
||||
| **DMARC** | Specifies how receiving servers handle SPF/DKIM failures | DNS TXT record. Deploy in stages: `p=none` (monitoring) → verify no legitimate mail fails → `p=quarantine` → eventually `p=reject`. Minimum target for this assignment: `p=quarantine` after 30-day monitoring period shows no legitimate mail failing. |
|
||||
|
||||
Without DMARC, your domain can be spoofed in inbound email to your users and in outbound email to others. SPF and DKIM without DMARC do not enforce — DMARC is the enforcement record.
|
||||
|
||||
**Anti-phishing policy (EOP):**
|
||||
|
||||
- Exchange admin center → Policies & rules → Threat policies → Anti-phishing
|
||||
- Enable impersonation protection for: the organisation's own domain(s), key users (CEO, CFO, board members, finance team)
|
||||
- Enable mailbox intelligence (learning sender patterns)
|
||||
- Set action for impersonation detections: **Quarantine** (not move to Junk — quarantine is reviewed; Junk is ignored)
|
||||
|
||||
If the client has Defender for Office 365 P1 (included in M365 Business Premium or as an add-on): enable Safe Links and Safe Attachments. These are materially more effective than EOP baseline anti-phishing. Note the gap if E3 without the add-on.
|
||||
|
||||
**Anti-malware policy:**
|
||||
|
||||
- Threat policies → Anti-malware
|
||||
- Enable common attachment filter: block executable file types (.exe, .vbs, .js, .ps1, .bat, .cmd and others)
|
||||
- Zero-hour auto purge (ZAP): ensure it is enabled — retroactively quarantines malware found after delivery
|
||||
- Admin notifications: notify security team on malware detection
|
||||
|
||||
**Anti-spam policy:**
|
||||
|
||||
- Threat policies → Anti-spam
|
||||
- Bulk complaint level threshold: set to 6 (aggressive; default is 7)
|
||||
- Enable outbound spam notifications: alert the security team when a mailbox is detected sending spam (indicator of compromise)
|
||||
- Verify SPF hard fail is evaluated
|
||||
|
||||
---
|
||||
|
||||
### Step 4 — Sharing Governance
|
||||
|
||||
Sharing governance operates at multiple levels in M365. The tenant setting is the ceiling — per-site can be more restrictive but never more permissive than the tenant setting.
|
||||
|
||||
**Tenant-level settings (SharePoint admin center → Policies → Sharing):**
|
||||
|
||||
| Setting | Target value | Notes |
|
||||
|---------|-------------|-------|
|
||||
| External sharing — SharePoint | New and existing guests | Requires guest authentication. "Anyone" was removed in Step 2. |
|
||||
| External sharing — OneDrive | New and existing guests | Match SharePoint setting or more restrictive. |
|
||||
| Require guests to sign in using the same account | Yes | Prevents link forwarding to a different account. |
|
||||
| Allow guests to share items they don't own | No | Prevents reshare chain from escaping first-hop control. |
|
||||
| Guest access expiration | 30 days (or per organisation policy) | Guests must be reviewed and re-invited; standing access expires. |
|
||||
| Link permissions default | View | Least privilege; users explicitly upgrade if edit is needed. |
|
||||
| Link expiry (new and existing guest links) | 30 days | Prevents permanent link accumulation. |
|
||||
|
||||
**Per-site controls — crown jewel sites:**
|
||||
For sites identified in the crown jewels question (Step 1 of "Before You Touch Anything"):
|
||||
- Set external sharing to **Only people in your organization**
|
||||
- Remove broad internal permissions ("Everyone except external users", "All company")
|
||||
- Document the named owners of the site and the access review schedule
|
||||
|
||||
Internal oversharing is often overlooked: a finance site accessible to "All company" means any compromised internal account reaches the financial data. Restrict sensitive sites to named groups with specific membership.
|
||||
|
||||
---
|
||||
|
||||
### Step 5 — Guest Governance
|
||||
|
||||
Guest accounts are standing external blast radius. Every guest that has not been reviewed is an unknown with access to unknown data.
|
||||
|
||||
**Immediate actions:**
|
||||
|
||||
1. **Export the guest list with last sign-in date.** In Entra ID → Users → filter by User type: Guest. Export to CSV. Sort by last sign-in date.
|
||||
2. **Flag for removal:** guests who have not signed in within 90 days and have no active project sponsorship. Present the list to the named client lead for approval before removing.
|
||||
3. **Remove approved stale guests.** Document the count.
|
||||
|
||||
**Ongoing governance (configure before handover):**
|
||||
|
||||
| Control | Configuration |
|
||||
|---------|--------------|
|
||||
| Guest invitation restrictions | Restrict to Entra ID admins only (not all users can invite guests) |
|
||||
| Guest access expiration | Configure in Entra ID → External Identities → External collaboration settings: Guest user access expires after 180 days unless reviewed |
|
||||
| Access reviews | Entra ID → Identity Governance → Access reviews — create a quarterly review for all guests. Reviewer: IT lead or line-of-business owner. Action on no response: remove access. |
|
||||
|
||||
Access reviews require Entra ID P2 for full automation. For E3, a manual quarterly review using the Entra guest export is the alternative — document the cadence in the leave-behind and assign an owner.
|
||||
|
||||
---
|
||||
|
||||
### Step 6 — Sensitivity Labels Foundation
|
||||
|
||||
Sensitivity labels are the mechanism that makes protection travel with the data. A labelled document carries its permissions wherever it goes — downloaded, emailed, shared externally.
|
||||
|
||||
**Label taxonomy — baseline (4 labels):**
|
||||
|
||||
| Label | Meaning | Default protection |
|
||||
|-------|---------|-------------------|
|
||||
| **Public** | Intended for external distribution | No restrictions |
|
||||
| **Internal** | Default for internal business content | No external sharing by default |
|
||||
| **Confidential** | Business-sensitive; restricted distribution | Encrypt; restrict to organisation members; no external forwarding |
|
||||
| **Highly Confidential** | Crown jewels: financial, legal, M&A, HR | Encrypt; restrict to named group; no download on unmanaged device; watermark |
|
||||
|
||||
Keep the taxonomy to four labels. More labels increase classification fatigue and reduce the percentage of content that gets labelled at all. A four-label taxonomy that users understand and apply is worth more than a twelve-label taxonomy that nobody uses.
|
||||
|
||||
**Deployment:**
|
||||
|
||||
1. Create labels in Microsoft Purview compliance portal → Information protection → Labels
|
||||
2. Publish labels to all users via a label policy
|
||||
3. Configure auto-labelling for the Highly Confidential label: define content patterns (e.g., project name, internal designation) that trigger auto-labelling in SharePoint and OneDrive
|
||||
4. Set the default label for SharePoint sites identified as crown jewel sites: Confidential
|
||||
|
||||
**For Highly Confidential — encryption configuration:**
|
||||
- Rights Management encryption: Only organisation members can open; no external forwarding; no printing
|
||||
- Apply to: the named crown-jewel sites and document libraries
|
||||
|
||||
The label is the escape hatch. A Highly Confidential document downloaded to an unmanaged device and forwarded externally is still encrypted — the attacker has ciphertext, not data. This is the only control in this assignment that holds after data leaves the tenant.
|
||||
|
||||
---
|
||||
|
||||
### Step 7 — DLP Baseline
|
||||
|
||||
DLP policies intercept known sensitive information patterns transiting Exchange, SharePoint, and OneDrive. Deploy DLP as a scalpel: 3–5 specific, high-confidence patterns. Do not attempt comprehensive coverage.
|
||||
|
||||
**Target patterns for most organisations:**
|
||||
|
||||
| Policy | Pattern | Initial action |
|
||||
|--------|---------|---------------|
|
||||
| Payment card data | Credit card numbers (PCI scope) | Policy tip to user + admin alert |
|
||||
| National identity numbers | National ID / tax number format for the client's jurisdiction | Policy tip to user |
|
||||
| Crown jewel content | Sensitivity label: Highly Confidential (label-based DLP) | Block external sharing + admin alert |
|
||||
| External forwarding with attachments | Email to external recipients with attachments > threshold | Notify user |
|
||||
|
||||
Start every DLP policy in **simulation mode** (test/audit) before enforcement. Review DLP activity reports after 48 hours of simulation. Identify false positives. Tune the policy. Then enable with **notify only** before moving to **block**.
|
||||
|
||||
The sequence: simulation → notify → block. Never skip the simulation and notify stages.
|
||||
|
||||
**What E3 DLP covers:** Exchange Online, SharePoint Online, OneDrive for Business. It does not cover Teams messages (requires Purview add-on) or endpoint DLP (requires Purview or E5 compliance).
|
||||
|
||||
Note the gaps in the residual risk statement: DLP at this scope does not cover Teams conversations or files shared through channels. If Teams is a primary working environment for crown-jewel content, document this as a gap pointing toward a Purview engagement.
|
||||
|
||||
---
|
||||
|
||||
### Step 8 — Audit Logging
|
||||
|
||||
Audit logging is the foundation of any post-incident forensics capability. If it is not enabled, every breach investigation starts with nothing.
|
||||
|
||||
**Unified Audit Log:**
|
||||
|
||||
```powershell
|
||||
# Verify status
|
||||
Get-AdminAuditLogConfig | Select-Object UnifiedAuditLogIngestionEnabled
|
||||
|
||||
# Enable if false
|
||||
Set-AdminAuditLogConfig -UnifiedAuditLogIngestionEnabled $true
|
||||
```
|
||||
|
||||
E3 default retention: 90 days. Verify actual retention in the Compliance portal → Audit. If the client has regulatory requirements for longer retention (NIS2, DORA, banking regulations typically require 1 year minimum), document the gap. The E3 upgrade path is the Audit (Premium) add-on or E5 compliance.
|
||||
|
||||
**Mailbox audit logging:**
|
||||
|
||||
```powershell
|
||||
Get-Mailbox -ResultSize Unlimited |
|
||||
Where-Object {$_.AuditEnabled -eq $false} |
|
||||
Set-Mailbox -AuditEnabled $true
|
||||
```
|
||||
|
||||
Verify that key mailbox audit operations are captured: MailboxLogin, SendAs, SendOnBehalf, HardDelete, FolderBind.
|
||||
|
||||
**Critical audit events to verify are captured:**
|
||||
|
||||
| Event category | Why it matters |
|
||||
|---------------|---------------|
|
||||
| File and page activities | Accessed, downloaded, shared — the data exfiltration footprint |
|
||||
| Sharing and access request activities | External shares created; guest invitations sent |
|
||||
| Synchronization activities | Files synced to devices (OneDrive sync client) |
|
||||
| Exchange admin activities | Transport rule creation/modification; external forwarding |
|
||||
| Azure AD sign-in events | Anomalous sign-ins, MFA failures, conditional access decisions |
|
||||
| DLP rule matches | Evidence that DLP policies are firing |
|
||||
|
||||
---
|
||||
|
||||
## Structural Resilience Checklist
|
||||
|
||||
Controls that hold without ongoing human willingness after this engagement closes.
|
||||
|
||||
- [ ] Anonymous sharing blocked at tenant level — confirmed by SharePoint sharing settings
|
||||
- [ ] Existing anonymous links revoked — count documented
|
||||
- [ ] External auto-forwarding blocked at tenant level — confirmed by transport rule and outbound spam policy
|
||||
- [ ] Active external forwarding rules reviewed and removed
|
||||
- [ ] DKIM enabled for all domains
|
||||
- [ ] DMARC deployed at minimum `p=quarantine` after monitoring period
|
||||
- [ ] User OAuth consent restricted — admin consent workflow active
|
||||
- [ ] High-permission user-consented OAuth grants reviewed
|
||||
- [ ] Guest expiration configured — new guests expire by default
|
||||
- [ ] Stale guests removed (90+ days inactive, no active sponsorship)
|
||||
- [ ] Guest access review cadence documented with named owner
|
||||
- [ ] Sensitivity labels published to all users — Highly Confidential label with encryption
|
||||
- [ ] DLP baseline policies active (post-simulation and notify stages) — not in simulation only
|
||||
- [ ] Unified Audit Log enabled
|
||||
- [ ] Mailbox audit logging enabled for all mailboxes
|
||||
|
||||
---
|
||||
|
||||
## Kill Chain Contribution
|
||||
|
||||
**What this assignment closes:**
|
||||
|
||||
| Attack vector | Control deployed |
|
||||
|---------------|-----------------|
|
||||
| Data exfiltration via anonymous link (bypasses all identity controls) | Anonymous link prohibition + existing link revocation |
|
||||
| Business email compromise via mailbox forwarding rule | External auto-forwarding block + rule audit |
|
||||
| OAuth consent phishing (malicious app requesting mail/file access) | User consent restriction + high-permission grant review |
|
||||
| Domain spoofing (impersonation of the client's domain in email) | DMARC `p=quarantine` |
|
||||
| Phishing email impersonating known users or domain | Anti-phishing impersonation protection |
|
||||
| Crown-jewel document leaking outside the tenant | Sensitivity label encryption (Highly Confidential) — protection travels with file |
|
||||
| Known sensitive data patterns transiting email or SharePoint | DLP baseline policies |
|
||||
| Stale guest accounts as standing external foothold | Guest expiration + stale guest removal |
|
||||
|
||||
**What this assignment does not close:**
|
||||
|
||||
| Remaining gap | Addressed by |
|
||||
|---------------|-------------|
|
||||
| Advanced phishing: Safe Links, Safe Attachments | Defender for Office 365 P1 (E5 or add-on) |
|
||||
| Teams message DLP | Purview compliance add-on |
|
||||
| Endpoint DLP (data leaving via USB, local app) | Purview E5 compliance or endpoint DLP engagement |
|
||||
| Full data lifecycle governance (retention, disposal) | Purview engagement |
|
||||
| MDCA session controls (block download from browser on unmanaged device) | MDCA engagement |
|
||||
| Full guest lifecycle management (access packages, entitlement) | Entra ID Governance (P2) engagement |
|
||||
| Residual data on unmanaged/BYOD devices | App Protection Policies (Intune assignment) |
|
||||
|
||||
---
|
||||
|
||||
## Leave-Behind Package
|
||||
|
||||
| Artifact | Description |
|
||||
|----------|-------------|
|
||||
| **Surface map report** | Before-state: "Anyone" link count, external shares, guest list with last sign-in, forwarding rules found, OAuth grant inventory, SPF/DKIM/DMARC status |
|
||||
| **Anonymous link revocation record** | Links revoked: count, method, date |
|
||||
| **External forwarding rule audit** | Rules found, disposition of each (removed / confirmed legitimate / flagged as suspicious) |
|
||||
| **OAuth grant review record** | Grants reviewed, grants revoked, grants retained with justification |
|
||||
| **EOP policy documentation** | Anti-phishing, anti-malware, anti-spam settings with rationale |
|
||||
| **DMARC monitoring report** | DMARC aggregate reports at `p=none` before moving to `p=quarantine`; confirmation of quarantine deployment |
|
||||
| **Sharing governance configuration** | Tenant sharing settings, crown-jewel site configurations |
|
||||
| **Guest governance documentation** | Expiration settings, access review configuration, stale guest removal count, review cadence with named owner |
|
||||
| **Sensitivity label documentation** | Label taxonomy, label policy, encryption configuration for Highly Confidential |
|
||||
| **DLP policy documentation** | Each policy: target pattern, scope, actions, simulation results before enforcement |
|
||||
| **Audit logging confirmation** | Unified Audit Log status, retention period, mailbox audit status |
|
||||
| **Scope boundary log** | Every finding outside this scope, named and prioritized |
|
||||
| **Residual risk statement** | What this assignment did not close: Teams DLP gap, endpoint exfil path, advanced phishing gap, guest lifecycle limitations |
|
||||
|
||||
---
|
||||
|
||||
## Scope Boundary Signals
|
||||
|
||||
| Signal | Points toward |
|
||||
|--------|--------------|
|
||||
| Significant Teams usage for crown-jewel content; Teams DLP not covered | Purview compliance engagement |
|
||||
| No independent M365 backup — Microsoft recycle bin only | Recovery and detection engagement (Book VI) |
|
||||
| Audit log retention < regulatory requirement | Audit (Premium) add-on; or compliance-driven M365 upgrade |
|
||||
| On-premises Exchange still in the estate | Hybrid Exchange engagement — decommissioning path |
|
||||
| Advanced phishing; no Defender for Office 365 P1 | E5 / MDO add-on evaluation |
|
||||
| High volume of user-consented high-permission OAuth apps | Entitlement management engagement |
|
||||
| Crown-jewel data accessible to broad internal groups | Information architecture engagement (governance, IA, Purview classification) |
|
||||
| No independent M365 backup | Recovery and detection engagement |
|
||||
| No incident response plan | IR planning engagement |
|
||||
|
||||
---
|
||||
|
||||
## Completing the "Secure M365" Engagement
|
||||
|
||||
When all four assignments are delivered, the client has:
|
||||
|
||||
**Identity Baseline** — MFA enforced for all users and phishing-resistant MFA for admins. Legacy authentication blocked at the tenant level. Break-glass accounts established and monitored. Admin accounts separated and audited.
|
||||
|
||||
**CA Architecture** — A named, documented, principled CA policy set. Layer 1 (identity) and Layer 2 (admin elevation) enforced. Layer 3 (device compliance) activated following the Intune assignment. Per-user MFA conflict resolved.
|
||||
|
||||
**Intune Security Baseline** — Device compliance policies returning results for the enrolled fleet. Compliant device required for M365 access (CA Layer 3 active). BitLocker, patch compliance, and LAPS deployed. Update rings with canary. App Protection Policies for BYOD. The real device population is mapped and documented.
|
||||
|
||||
**Collaboration and Data Security** — Anonymous links removed. External auto-forwarding blocked. Email authentication at DMARC quarantine. External sharing governed. Stale guests removed. Sensitivity labels deployed with crown-jewel encryption. DLP baseline active for known high-value patterns. Audit logging enabled.
|
||||
|
||||
**What this engagement does not close** — and what the CISO has in writing:
|
||||
- Session token theft (AiTM phishing) → Entra ID P2 + token protection
|
||||
- EDR and post-compromise detection → Defender for Endpoint P2 or Wazuh augmentation
|
||||
- Standing privilege → PIM / PAM engagement
|
||||
- Active Directory on-premises hardening → hybrid identity and AD hardening engagement
|
||||
- Full data governance → Purview engagement
|
||||
- Backup and recovery → recovery and detection engagement
|
||||
- Incident response capability → IR planning and detection baseline engagement
|
||||
|
||||
The residual risk statement across all four packages is the honest description of what has been built and what remains. It is not a sales document — it is the record that the client's security posture was improved deliberately, with full awareness of what was and was not in scope.
|
||||
|
||||
---
|
||||
|
||||
*For the identity foundation, see [Assignment: Identity Baseline](assignment-identity-baseline.md).*
|
||||
*For the CA architecture, see [Assignment: CA Architecture](assignment-ca-architecture.md).*
|
||||
*For the device security baseline, see [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md).*
|
||||
*For the data and collaboration philosophy, see [Book V — Data & Collaboration](../books/04-data-and-collaboration.md).*
|
||||
*For the recovery and detection layer this engagement exposes as the next priority, see [Book VI — Recovery & Detection](../books/05-recovery-and-detection.md).*
|
||||
@@ -0,0 +1,222 @@
|
||||
# Assignment: Identity Baseline
|
||||
|
||||
> *Enforce what you already have. Every other M365 security control is downstream of this one.*
|
||||
|
||||
This is a **scoped assignment package** — a complete, principled delivery guide for one specific client brief. It is designed to work with limited organizational engagement and to leave behind infrastructure that holds without anyone needing to want it.
|
||||
|
||||
---
|
||||
|
||||
## The Brief
|
||||
|
||||
Client requests that fall within this scope:
|
||||
|
||||
- *"Secure our M365 / our identities are a mess"*
|
||||
- *"We need MFA enforced — the auditor asked for it"*
|
||||
- *"We got phished and IT wants to prevent it happening again"*
|
||||
- *"Review our user accounts and admin accounts"*
|
||||
- *"Make sure only the right people have access"*
|
||||
|
||||
This assignment does not require executive sponsorship. It requires one named IT lead with Global Administrator access and a tolerance for findings.
|
||||
|
||||
---
|
||||
|
||||
## Scope Boundary
|
||||
|
||||
**In scope:**
|
||||
- Entra ID authentication configuration (MFA, legacy auth, auth methods)
|
||||
- Conditional Access policy review for existing policies (not full CA architecture)
|
||||
- Global Administrator and other privileged role audit
|
||||
- Break-glass account establishment
|
||||
- Entra ID Protection risk policy baseline
|
||||
- Authentication method registration and SSPR configuration
|
||||
- Service principal and app registration review (inventory and flag — not remediate)
|
||||
|
||||
**Out of scope:**
|
||||
- Conditional Access policy design and architecture → [Assignment: CA Architecture](assignment-ca-architecture.md)
|
||||
- Device compliance and Intune → [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md)
|
||||
- Privileged Access Management (PIM, PAM, PAW) → separate privileged access engagement
|
||||
- Active Directory on-premises → hybrid identity engagement
|
||||
- Application permissions remediation → separate service identity engagement
|
||||
|
||||
When the client asks for something adjacent, log it in the scope boundary signals section at the end of the engagement. Do not absorb it silently and do not pitch the next engagement. The log is the record.
|
||||
|
||||
---
|
||||
|
||||
## Before You Touch Anything
|
||||
|
||||
These three steps happen before any change, on day one.
|
||||
|
||||
**1. Break-glass accounts.**
|
||||
If the tenant has no cloud-only break-glass accounts excluded from all CA policies, create two before proceeding. Document their credentials out of band (not in the same tenant). Alert on their sign-in. This is the safety net. Without it, a misconfigured CA policy can lock the entire tenant — including you.
|
||||
|
||||
**2. CAExporter baseline.**
|
||||
Export the current CA policy state using [CAExporter](https://github.com/merill/caexporter). This JSON export is the before-state. Every change made during this engagement is measurable against it. It is also the rollback reference if something breaks.
|
||||
|
||||
**3. Authentication sign-in log baseline.**
|
||||
Export 30 days of Entra sign-in logs, filtered for legacy authentication clients. This is the baseline for measuring the impact of legacy auth block and the evidence that the block is complete. Without it, you cannot demonstrate that legacy auth is actually gone — only that a policy exists.
|
||||
|
||||
---
|
||||
|
||||
## Principles Applied
|
||||
|
||||
**Automation over procedure.**
|
||||
Every control in this assignment is a policy, not a document. MFA enforcement is a CA policy, not a user awareness campaign. Legacy auth block is an authentication policy or CA rule, not a helpdesk notification. A procedure only works when someone follows it. A policy works when no one is looking.
|
||||
|
||||
**Kill chain first.**
|
||||
There are two controls in this assignment that matter more than all others: MFA enforcement on all users, and legacy auth block. Everything else — admin hygiene, SSPR configuration, risk policies — is valuable but secondary. If the engagement ends early, these two must be complete.
|
||||
|
||||
**Visibility as accountability.**
|
||||
Every export, every report, every baseline produced during this engagement exists in the client's own tenant and documentation system permanently. A sign-in log showing zero legacy auth clients is evidence that outlasts the engagement. An admin account inventory with a date on it creates accountability that does not require anyone to actively manage it.
|
||||
|
||||
**Scope discipline.**
|
||||
Anything discovered outside scope goes into the scope boundary log — not into the work plan. A consultant who silently fixes adjacent problems during a scoped engagement creates unscoped liability and destroys the client's ability to understand what was done. Log it, name it, leave it.
|
||||
|
||||
---
|
||||
|
||||
## Delivery Architecture
|
||||
|
||||
Sequenced by impact, not by calendar. Each step depends on the one before it.
|
||||
|
||||
### Step 1 — Baseline (no changes)
|
||||
|
||||
| Action | Output |
|
||||
|--------|--------|
|
||||
| CAExporter export | CA policy baseline JSON |
|
||||
| Break-glass accounts created and monitored | Break-glass documentation (out of band) |
|
||||
| Sign-in log export: legacy auth clients | Legacy auth client list |
|
||||
| Global Administrator audit: who holds it, cloud-only vs synced, standing vs eligible | Admin account inventory |
|
||||
| Service principal inventory: client secrets expiry, Graph permissions, admin consent | Service principal risk log |
|
||||
| Authentication method registration report | Who has MFA registered, by method |
|
||||
| SSPR configuration review | Current state documented |
|
||||
|
||||
At the end of Step 1, share the admin account inventory and legacy auth client list with the named client lead. No recommendations yet. Just findings, plainly stated.
|
||||
|
||||
---
|
||||
|
||||
### Step 2 — Kill Chain (two controls)
|
||||
|
||||
**Legacy authentication block.**
|
||||
Deploy via Entra authentication policies (tenant-wide, preferred) or CA policy (targeted by legacy auth client type). Stage it: report mode for 48 hours, confirm zero legitimate legacy auth clients in sign-in logs, then enforce. The 48-hour window exists because there are always surprises — a printer, a shared mailbox script, an MFA-unregistered VIP. Find them before enforcement, not after.
|
||||
|
||||
**MFA enforcement.**
|
||||
If the client has no CA policies at all: deploy one CA policy requiring MFA for all users, all cloud apps, excluding break-glass accounts. If the client has existing CA policies: review coverage gaps and close them. Staged: exclude a pilot group of 10 users for 24 hours, confirm no breakage, then enforce broadly.
|
||||
|
||||
These two controls are the assignment's kill chain contribution. Legacy auth block plus MFA enforcement closes the most common attack path in the Microsoft ecosystem. Both should be complete before Step 3 begins.
|
||||
|
||||
---
|
||||
|
||||
### Step 3 — Admin Hygiene
|
||||
|
||||
**Global Administrator audit.**
|
||||
Every account with Global Administrator should be cloud-only (not synced from on-premises AD — a synced account can be compromised on-prem to take the cloud). Count standing Global Admins. The target is zero standing Global Admins beyond break-glass and emergency access. If PIM is not in scope, document the gap and log it. If the client has PIM licensing (P2), note it — it is the correct next step.
|
||||
|
||||
**Admin account separation.**
|
||||
Admins should have a dedicated admin account separate from their daily-use account. If they do not, log it as a scope boundary signal for a privileged access engagement. If the client will accept one quick win: rename or create dedicated admin accounts for any standing Global Admins. This is a short task with meaningful blast-radius reduction.
|
||||
|
||||
**Service principal review.**
|
||||
Flag any service principal with:
|
||||
- Client secrets expiring in under 30 days (operational risk, not security risk — but surfaces the gap)
|
||||
- Tenant-wide admin consent granted
|
||||
- Graph permissions: `RoleManagement.ReadWrite.Directory`, `AppRoleAssignment.ReadWrite.All`, `Application.ReadWrite.All`, `Directory.ReadWrite.All`
|
||||
|
||||
Log all flags in the scope boundary signals. Do not remediate service principals in this assignment — it requires application owner coordination and deserves its own scoped engagement.
|
||||
|
||||
---
|
||||
|
||||
### Step 4 — Risk Baseline
|
||||
|
||||
**Entra ID Protection.**
|
||||
If the tenant has P2 licensing (included in E5, available separately), deploy:
|
||||
- User risk policy: require password change at High risk (Conditional Access, not legacy user risk policy)
|
||||
- Sign-in risk policy: require MFA step-up at Medium or High risk
|
||||
|
||||
If no P2: document the gap. Log the licensing delta for the leave-behind.
|
||||
|
||||
**SSPR.**
|
||||
If SSPR is not enabled: enable it for all users with a minimum of two authentication methods required. Default to Microsoft Authenticator + email or phone. SSPR with strong auth methods removes helpdesk dependency for password resets and is a prerequisite for a healthy MFA rollout.
|
||||
|
||||
---
|
||||
|
||||
## Structural Resilience Checklist
|
||||
|
||||
Controls that hold without ongoing human willingness after this engagement closes.
|
||||
|
||||
- [ ] MFA enforcement CA policy active — not in report mode
|
||||
- [ ] Legacy authentication blocked at tenant level — not just reported
|
||||
- [ ] Break-glass accounts exist, are cloud-only, are excluded from CA, are monitored with alerts
|
||||
- [ ] Break-glass credentials documented out of band
|
||||
- [ ] Sign-in risk and user risk policies active (if P2 licensed)
|
||||
- [ ] CAExporter export stored in client documentation
|
||||
- [ ] SSPR active for all users
|
||||
|
||||
These are the controls that keep working after the engagement ends. If any item is not checked at handover, document why and log the residual risk.
|
||||
|
||||
---
|
||||
|
||||
## Kill Chain Contribution
|
||||
|
||||
**What this assignment closes:**
|
||||
|
||||
| Attack vector | Control deployed |
|
||||
|---------------|-----------------|
|
||||
| Password spray against cloud accounts | MFA enforcement |
|
||||
| Credential stuffing using breached passwords | MFA enforcement + Entra ID Protection |
|
||||
| Legacy authentication protocol abuse (SMTP, IMAP, MAPI) | Legacy auth block |
|
||||
| Basic phishing for MFA bypass via legacy clients | Legacy auth block |
|
||||
| Attacker using compromised admin account persistently | Break-glass monitoring, admin hygiene |
|
||||
|
||||
**What this assignment does not close:**
|
||||
|
||||
| Remaining gap | Addressed by |
|
||||
|---------------|-------------|
|
||||
| Device-based attacks (unmanaged device as access vector) | [Assignment: Intune Security Baseline](assignment-intune-security-baseline.md) |
|
||||
| Adversary-in-the-middle / session token theft | Device compliance in CA + token protection |
|
||||
| Standing Global Administrator accounts | Privileged access engagement (PIM) |
|
||||
| Service principal over-permission | Service identity engagement |
|
||||
| Data exfiltration through sanctioned apps | Collaboration and data security assignment |
|
||||
| Persistence via application consent abuse | Service identity engagement |
|
||||
|
||||
The kill chain contribution of this assignment is significant and real. The residual gaps are also real. Both belong in the leave-behind.
|
||||
|
||||
---
|
||||
|
||||
## Leave-Behind Package
|
||||
|
||||
Every item below must be delivered at handover. The engagement is not complete until all items exist in the client's own documentation system.
|
||||
|
||||
| Artifact | Description |
|
||||
|----------|-------------|
|
||||
| **CAExporter JSON (before)** | CA policy state at engagement start |
|
||||
| **CAExporter JSON (after)** | CA policy state at engagement close |
|
||||
| **Admin account inventory** | Every privileged role assignment: account name, role, cloud-only vs. synced, standing vs. eligible, last sign-in |
|
||||
| **Legacy auth sign-in confirmation** | Sign-in log export showing zero legacy auth clients post-block |
|
||||
| **MFA registration report** | Authentication method registration by user, at engagement close |
|
||||
| **Break-glass documentation** | Account names, monitoring alert confirmation, out-of-band credential storage reference |
|
||||
| **Service principal risk log** | Flagged principals with permissions and expiry dates |
|
||||
| **Scope boundary log** | Every finding outside this scope, named and prioritized |
|
||||
| **Residual risk statement** | Plain-language summary of what this assignment did not close and why |
|
||||
|
||||
The residual risk statement is not optional. A client who receives a clean handover without a residual risk statement has been misled about their posture.
|
||||
|
||||
---
|
||||
|
||||
## Scope Boundary Signals
|
||||
|
||||
Log these when you find them. Do not fix them. Do not pitch them. The log is the record.
|
||||
|
||||
| Signal | Points toward |
|
||||
|--------|--------------|
|
||||
| No device compliance policies exist | Intune Security Baseline assignment |
|
||||
| CA policies exist but are poorly designed (overlapping, unnamed, undocumented) | CA Architecture assignment |
|
||||
| Global Admins have standing privilege with no PIM | Privileged access engagement |
|
||||
| Entra Connect / Cloud Sync server is domain-joined to production domain | Hybrid identity engagement — T0 isolation |
|
||||
| AD FS present | Hybrid identity engagement — Golden SAML risk, migration to PHS |
|
||||
| Service principals with tenant-wide admin consent | Service identity engagement |
|
||||
| No Defender for Office 365 baseline | Collaboration security assignment |
|
||||
| Audit logging not configured or retention < 90 days | Detection baseline assignment |
|
||||
|
||||
---
|
||||
|
||||
*For the conditional access architecture built on top of this baseline, see [Assignment: CA Architecture](assignment-ca-architecture.md).*
|
||||
*For technical depth on hybrid identity and the sync server risk, see [Book II — Hybrid Identity](../books/01-hybrid-identity.md).*
|
||||
*For privileged access architecture, see [Book III — Privileged Access](../books/02-privileged-access.md).*
|
||||
@@ -0,0 +1,384 @@
|
||||
# Assignment: Intune Security Baseline
|
||||
|
||||
> *The device will be compromised. Compliant is not the same as secure, and the portal toggle is not the same as the device's behaviour. Build for the compromise, not against it.*
|
||||
|
||||
This is a **scoped assignment package** — a complete, principled delivery guide for one specific client brief. It closes the device-layer gap and activates the CA Layer 3 policies designed in [Assignment: CA Architecture](assignment-ca-architecture.md). It can be delivered standalone, but its full structural value is realised when CA Layer 3 is activated at the end.
|
||||
|
||||
---
|
||||
|
||||
## The Brief
|
||||
|
||||
Client requests that fall within this scope:
|
||||
|
||||
- *"Deliver a security baseline for our Intune-managed endpoints"*
|
||||
- *"Set up Intune / we need device management"*
|
||||
- *"We need compliant devices to be required for M365 access"*
|
||||
- *"Our auditor wants evidence that devices are encrypted and patched"*
|
||||
- *"We have Intune but nobody set up the security policies"*
|
||||
- *"We're retiring SCCM and going cloud-native"* (if co-management migration is explicitly scoped)
|
||||
|
||||
This assignment does not require executive sponsorship. It requires one named IT lead with Intune Administrator access, a tolerance for a grace-period before enforcement, and an understanding that the enrollment rate at the start is almost never what the CMDB says.
|
||||
|
||||
---
|
||||
|
||||
## Scope Boundary
|
||||
|
||||
**In scope:**
|
||||
- Device population mapping (what is actually authenticating, vs. what is enrolled, vs. what the CMDB says)
|
||||
- Compliance policies: Windows, macOS, iOS, Android — as applicable to the fleet
|
||||
- Device configuration profiles: Windows security baseline settings
|
||||
- Windows Update rings (quality and feature updates)
|
||||
- Windows LAPS (local admin password management)
|
||||
- App Protection Policies for BYOD iOS and Android (MAM without MDM)
|
||||
- Enrollment review and gaps (not a new enrollment deployment unless scoped separately)
|
||||
- CA Layer 3 activation: connecting compliance state to Conditional Access
|
||||
|
||||
**Out of scope:**
|
||||
- SCCM co-management migration → separate engagement (scope is complex and fleet-specific)
|
||||
- Autopilot setup and Autopilot-based provisioning → separate deployment engagement
|
||||
- EDR configuration: Defender for Endpoint advanced features, custom detection rules → separate or within E5 engagement
|
||||
- WDAC / Smart App Control / application allowlisting → advanced application control engagement
|
||||
- Driver and firmware update management → note as gap, recommend Windows Update for Business or third-party where Intune is insufficient
|
||||
- GPO conflict resolution for hybrid-joined estates → flag; recommend cloud-native migration path
|
||||
- Endpoint Privilege Management (JIT local admin elevation) → note as follow-on if standing local admin cannot be removed
|
||||
|
||||
When the client asks about SCCM migration or Autopilot, scope it separately. Co-management is a legitimate transitional architecture but it adds complexity that deserves its own scoped engagement with its own completion criteria.
|
||||
|
||||
---
|
||||
|
||||
## Before You Touch Anything
|
||||
|
||||
**1. Break-glass exclusion.**
|
||||
Confirm that break-glass accounts are excluded from all device-compliance CA policies. A flaky compliance signal must never lock out tenant recovery. If CA Layer 3 is not yet designed, this step ensures the door is open when it is deployed.
|
||||
|
||||
**2. Four-population mapping.**
|
||||
The CMDB is a claim. Authentication logs are facts. Before configuring compliance policies, build the real device picture from four sources:
|
||||
|
||||
| Population | Source |
|
||||
|-----------|--------|
|
||||
| **Enrolled (MDM)** | Intune device list |
|
||||
| **Registered (Entra)** | Entra ID → Devices → All devices |
|
||||
| **Authenticating** | Entra sign-in logs (30 days), filtered by device detail |
|
||||
| **CMDB** | Whatever the client has |
|
||||
|
||||
Map the differences. Devices in sign-in logs but not in Intune are known-unmanaged — they reach data and you cannot apply compliance policies to them. Devices in the CMDB but not in sign-in logs may be retired equipment or offline devices that have never actually authenticated. The gap between enrolled and authenticating is the real finding, and it belongs in the leave-behind regardless of whether it is addressed in this engagement.
|
||||
|
||||
**3. Existing Intune policy audit.**
|
||||
If Intune has been configured before — even partially — audit what exists before touching anything. Duplicate compliance policies, conflicting configuration profiles, and orphaned enrollment restrictions are common. A client who says "Intune is set up" often has one compliance policy created in 2021, three enrollment profiles nobody recognises, and a Windows security baseline applied to a group that no longer exists. Export the current state.
|
||||
|
||||
**4. CA Layer 3 status.**
|
||||
Check whether `CA-AllUsers-AllApps-RequireCompliantDevice` exists in report-only mode from the CA Architecture assignment. If it does, this assignment ends by activating it. If it does not exist, design and deploy it in report-only mode as part of this assignment — but do not activate it until compliance coverage is proven.
|
||||
|
||||
---
|
||||
|
||||
## Principles Applied
|
||||
|
||||
**Compliance is a signal, not a checkbox.**
|
||||
A device marked compliant in Intune carries a staleness window: compliance is evaluated on check-in cadence, not continuously. A device can fall out of compliance — lose encryption, miss patches, be rooted — and still hold a valid compliant token and access grant for hours. Design around this: the compliance requirement at CA is a meaningful control that raises the cost of attack, not a guarantee of device integrity. Document what it is and what it isn't.
|
||||
|
||||
**Test on real devices, not portal configurations.**
|
||||
A Conditional Access policy can show a perfectly correct configuration in the portal and enforce nothing. The same applies to compliance policies: a policy assigned to a group can appear active and produce no compliance results for enrolled devices whose group membership has drifted. And MAM/App Protection enforcement has documented gaps between the toggle and the actual device behaviour — gaps that vary by platform, OS build, and companion app version. For every control that matters, confirm it with a real device producing the expected result. Write the expected result down before you test, not after.
|
||||
|
||||
**Velocity with a brake.**
|
||||
Update rings exist not to slow patching but to make patching safe at speed. An unbraked push to the entire fleet is one bad update away from a mass outage — the kind that stops production, not the kind that stops attackers. A canary ring with a real halt-and-rollback capability is the mechanism that lets the rest of the fleet patch fast and safely. The canary must be tested — an untested canary is just the first domino with a friendly name.
|
||||
|
||||
**The device is disposable; the data boundary is the protection.**
|
||||
Every design decision in this assignment should ask: if this device is wiped and reprovisioned in an hour, does anything important break? A device that can be reprovisioned in an hour is antifragile. A device whose compromise is a crisis is fragile, regardless of how many compliance policies are applied to it. Build for reprovisionability: Autopilot, LAPS, application deployment from Intune, user profile from OneDrive. The compliance baseline hardens the device; the reprovision capability makes its loss survivable.
|
||||
|
||||
---
|
||||
|
||||
## Delivery Architecture
|
||||
|
||||
### Step 1 — Population Mapping and Audit (no changes)
|
||||
|
||||
| Action | Output |
|
||||
|--------|--------|
|
||||
| Four-population mapping (enrolled / registered / authenticating / CMDB) | Device population report: counts, deltas, known-unmanaged estimate |
|
||||
| Existing compliance policy audit | Policy inventory: assignments, settings, mode, last modified |
|
||||
| Existing configuration profile audit | Profile inventory: conflicts, orphaned assignments, platform coverage |
|
||||
| Update ring inventory | Current rings or absence of rings |
|
||||
| Sign-in log: device compliance state | What proportion of sign-ins carried a compliant device signal in the last 30 days |
|
||||
| LAPS status | Whether Windows LAPS is deployed or legacy LAPS or neither |
|
||||
|
||||
Share the device population report with the named client lead before writing any policies. The finding is almost always the same: the managed fleet is smaller than assumed, the dark population is larger than assumed, and several CMDB entries have not authenticated in months. State it plainly.
|
||||
|
||||
---
|
||||
|
||||
### Step 2 — Compliance Policies (report mode first)
|
||||
|
||||
Deploy all compliance policies in report mode. Review results for 72 hours before activating noncompliance actions. The goal at this step is to see the real compliance state of the fleet — not to block anyone.
|
||||
|
||||
**Noncompliance action sequence (apply to all compliance policies):**
|
||||
|
||||
| Day | Action |
|
||||
|-----|--------|
|
||||
| 0 | Mark noncompliant (reporting only — this is immediate and always on) |
|
||||
| 1 | Send email notification to user |
|
||||
| 7 | Block access (activates when `CA-AllUsers-AllApps-RequireCompliantDevice` is enabled) |
|
||||
| 30 | Retire device (for persistent noncompliance — confirm with client lead before activating) |
|
||||
|
||||
The 7-day grace window is not leniency — it is the window in which IT can identify and remediate legitimate noncompliance (device in repair, device offline, missed check-in) before a user is blocked. Without it, the first enforcement wave produces a support ticket flood. With it, enforcement is gradual and explainable.
|
||||
|
||||
**Windows compliance policy — baseline settings:**
|
||||
|
||||
| Setting | Value | Rationale |
|
||||
|---------|-------|-----------|
|
||||
| BitLocker required | Yes | Unencrypted devices lose data on physical theft |
|
||||
| OS minimum version | Windows 10 22H2 / Windows 11 22H2 | Below this: no Windows LAPS; OS in extended support only |
|
||||
| Defender AV enabled | Yes | Baseline detection |
|
||||
| Defender real-time protection | Yes | |
|
||||
| Firewall enabled | Yes | |
|
||||
| Secure boot enabled | Yes | Blocks bootkit-level compromise |
|
||||
| TPM required | Yes (for new enrollments; consider exclusion group for legacy hardware) | PRT TPM-binding requires TPM |
|
||||
| Password required | Yes | Minimum complexity, minimum length 8 |
|
||||
| Maximum inactivity before screen lock | 15 minutes | |
|
||||
|
||||
Do not configure the compliance policy to evaluate Microsoft Defender for Endpoint risk score unless Defender for Endpoint P2 (E5) is licensed. Misconfiguring this setting against an E3 tenant produces false noncompliance for all devices.
|
||||
|
||||
**macOS compliance policy (if fleet includes Macs):**
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| FileVault enabled | Yes |
|
||||
| OS minimum version | macOS 13 (Ventura) or later |
|
||||
| Password required | Yes |
|
||||
| Firewall enabled | Yes |
|
||||
| System Integrity Protection | Yes |
|
||||
|
||||
**iOS compliance policy:**
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| OS minimum version | iOS 16 or later |
|
||||
| Passcode required | Yes |
|
||||
| Jailbreak detection | Block jailbroken devices |
|
||||
| Device threat level | Secured (no threat level tolerance) |
|
||||
|
||||
**Android compliance policy:**
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| OS minimum version | Android 12 or later |
|
||||
| Device PIN required | Yes |
|
||||
| Rooted devices | Block |
|
||||
| Minimum security patch level | Within 90 days |
|
||||
|
||||
**The honest note on jailbreak/root detection:** detection is an arms race. A motivated attacker with a current tool bypasses it. Treat root detection as a tripwire that raises the cost of the attack, never as a barrier that stops it. Document this in the residual risk statement.
|
||||
|
||||
---
|
||||
|
||||
### Step 3 — Device Configuration Baseline
|
||||
|
||||
The Microsoft Windows Security Baseline (available in Intune → Endpoint security → Security baselines) is the starting point. It encodes Microsoft's recommended settings as an Intune profile that enforces continuously.
|
||||
|
||||
**Deployment approach:**
|
||||
1. Deploy the Windows Security Baseline in **report mode** to a pilot group (10–20 devices, IT team first)
|
||||
2. Review conflicts and configuration gaps for 48 hours
|
||||
3. Resolve any conflicts with existing policies (overlapping profiles produce unpredictable results — Intune applies the stricter setting per-setting by default, but conflicting values create undefined behaviour)
|
||||
4. Expand to production groups
|
||||
5. Monitor Intune reports for policy conflicts and noncompliance
|
||||
|
||||
**Additional configuration profiles (deploy after the security baseline is stable):**
|
||||
|
||||
| Profile | Purpose | Notes |
|
||||
|---------|---------|-------|
|
||||
| **BitLocker configuration** | Enable BitLocker silently, escrow recovery keys to Entra | Separate from compliance (compliance requires BitLocker; this profile configures how it's applied) |
|
||||
| **Microsoft Defender AV** | Configure exclusions, scheduled scans, PUA protection | Do not configure AV exclusions broadly — each exclusion reduces coverage |
|
||||
| **Firewall configuration** | Block inbound connections, logging | Complements compliance requirement |
|
||||
| **Edge browser baseline** | SmartScreen, extension management, safe browsing, disable password manager sync | Applies to corporate Edge profile; test carefully — extension management can break legitimate workflows |
|
||||
| **Windows Hello for Business** | Phishing-resistant authentication at device layer | If deploying phishing-resistant MFA (required by CA-Admins policy), WHfB is the most practical path |
|
||||
|
||||
---
|
||||
|
||||
### Step 4 — Update Rings
|
||||
|
||||
Update rings are the mechanism that makes patching fast and safe simultaneously. Deploy three rings minimum.
|
||||
|
||||
**Ring structure:**
|
||||
|
||||
| Ring | Assignment | Quality update deferral | Feature update deferral | Notes |
|
||||
|------|-----------|------------------------|------------------------|-------|
|
||||
| **Canary** | IT team (5–10 devices) | 0 days | 0 days | Takes every update immediately. Canary for production rings. Must include at least one machine that runs every critical business application. |
|
||||
| **Pilot** | 10–15% of fleet, varied roles | 7 days | 30 days | Broad business representation. If Canary is clear after 7 days, Pilot proceeds. |
|
||||
| **Production** | Remainder | 14 days | 90 days | Conservative deferral. If Pilot is clear after 7 days, Production proceeds. |
|
||||
|
||||
**Pause and rollback configuration:**
|
||||
Configure Intune update rings with the pause capability enabled. Define in the client's runbook:
|
||||
- Who has authority to pause an update ring (named person, not a committee)
|
||||
- What the trigger is for pausing (Canary devices showing a known issue, not a vague "something might be wrong")
|
||||
- Maximum pause duration before the pause is reviewed (7 days)
|
||||
|
||||
An untested pause capability is a fiction. Test it during the engagement: deploy an update to Canary, confirm it lands, pause the ring, confirm the pause holds, resume. This takes 30 minutes and is the only proof the mechanism works.
|
||||
|
||||
---
|
||||
|
||||
### Step 5 — Windows LAPS
|
||||
|
||||
Standing local administrator accounts are the device-layer version of standing privilege. If the same local admin password is shared across the fleet (common in legacy environments), one compromised device yields lateral movement credentials for the entire estate.
|
||||
|
||||
**Windows LAPS (cloud-native):**
|
||||
- Available on Windows 10 22H2+ and Windows 11 22H2+ with current patches
|
||||
- Configure backup target: Entra ID (cloud-native; no on-prem infrastructure required)
|
||||
- Rotation schedule: 30 days, plus rotate on device handoff
|
||||
- Requires Entra ID P1 (included in E3)
|
||||
|
||||
**Deployment:**
|
||||
1. Enable LAPS in Entra ID (Entra admin center → Devices → Device settings → Enable Microsoft Entra Local Administrator Password Solution)
|
||||
2. Create an Intune LAPS policy (Endpoint security → Account protection → LAPS)
|
||||
3. Assign to a pilot group; confirm password backup to Entra after check-in
|
||||
4. Expand to production
|
||||
|
||||
**For legacy LAPS (on-prem AD environments where Windows LAPS is not yet deployable):**
|
||||
Legacy LAPS (the original Microsoft LAPS MSI) remains deployable via Intune for hybrid-joined devices. Flag this as a transitional state — cloud-native Windows LAPS is the destination.
|
||||
|
||||
**What this does not solve:** if standing Domain Admin or local admin is provided to specific IT staff outside of LAPS, that standing privilege is out of scope for this assignment. Log it in scope boundary signals.
|
||||
|
||||
---
|
||||
|
||||
### Step 6 — App Protection Policies (BYOD)
|
||||
|
||||
App Protection Policies (MAM without MDM) manage the data layer on personal devices without enrolling the device. This is the correct model for BYOD: wall the corporate data, not the device.
|
||||
|
||||
**The honest caveat, stated plainly:** App Protection Policy enforcement has gaps. The policy controls what managed apps should do; the actual enforcement is dependent on the app version, OS version, companion app (Company Portal on Android), and specific API support. "Block copy/paste to unmanaged apps" blocks in documented paths — it does not block screenshots, OS-level share sheet on some platforms, or every third-party clipboard manager. Test on real devices. Document what you verified and where the limits are.
|
||||
|
||||
**Deploy separate policies per platform.** iOS and Android are not symmetric. A policy that works on iOS may not produce the same behaviour on Android. Test both independently.
|
||||
|
||||
**iOS App Protection Policy — baseline settings:**
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| Prevent "Save As" to personal storage | Block |
|
||||
| Restrict cut/copy/paste to managed apps only | Managed apps with paste in |
|
||||
| Require PIN for app access | Yes (after 5 minutes inactivity) |
|
||||
| Minimum OS version | iOS 16 |
|
||||
| Offline grace period before access blocked | 720 hours (30 days) |
|
||||
| Selective wipe after failed PIN attempts | Yes (after 10 attempts) |
|
||||
| Minimum app version | Latest − 1 (configure per app) |
|
||||
| Jailbroken/rooted devices | Block |
|
||||
|
||||
Apply to: Outlook, Teams, Edge, OneDrive, SharePoint mobile. These are the apps through which corporate data flows on BYOD devices.
|
||||
|
||||
**Android App Protection Policy — same baseline settings.** Test enforcement independently — behaviour on Android differs, particularly clipboard controls and "open in" restrictions.
|
||||
|
||||
**Selective wipe verification:**
|
||||
Test selective wipe on a real BYOD device before the engagement closes. Confirm that corporate data (email, files, Teams content) is removed and personal data (photos, personal apps) is not. This is the capability that makes MAM politically viable — if the user doesn't trust that it won't touch their personal data, enrollment fails. Document the test.
|
||||
|
||||
---
|
||||
|
||||
### Step 7 — CA Layer 3 Activation
|
||||
|
||||
This is the step that connects device compliance to access control. Everything before this point has been deploying and measuring; this step makes compliance matter for access.
|
||||
|
||||
**Prerequisites before activating:**
|
||||
|
||||
- [ ] Compliance policy deployed and returning results for ≥ 80% of the enrolled fleet
|
||||
- [ ] 72 hours of report-only compliance results reviewed — no widespread false noncompliance identified
|
||||
- [ ] Break-glass accounts confirmed excluded from device compliance CA policies
|
||||
- [ ] Named client lead has approved activation in writing
|
||||
- [ ] IT team briefed on noncompliance action timeline (users blocked after day 7 if noncompliant)
|
||||
- [ ] Helpdesk runbook written: what to do when a user is blocked due to noncompliance
|
||||
|
||||
**Activation sequence:**
|
||||
1. Switch `CA-AllUsers-AllApps-RequireCompliantDevice` from report-only to **enabled**
|
||||
2. Monitor Intune compliance dashboard and Entra sign-in logs for 24 hours
|
||||
3. Confirm: compliant devices are signing in successfully; noncompliant devices are being blocked at CA
|
||||
4. Confirm: break-glass accounts are not blocked
|
||||
|
||||
Do not activate device-compliance CA policies on a Monday or before a public holiday. An unexpected compliance failure during a period of low IT staffing is a bad outcome that a one-day wait entirely prevents.
|
||||
|
||||
**After activation, the compliance signal is live.** A device that loses compliance — drops encryption, falls behind on patches, is rooted — will be blocked from M365 access within the 7-day noncompliance action window. This is the control working as designed.
|
||||
|
||||
---
|
||||
|
||||
## Structural Resilience Checklist
|
||||
|
||||
Controls that hold without ongoing human willingness after this engagement closes.
|
||||
|
||||
- [ ] Compliance policies deployed and returning results for enrolled devices
|
||||
- [ ] Noncompliance action timer active (day 7 block — not just report)
|
||||
- [ ] Windows Security Baseline profile active on production fleet
|
||||
- [ ] Update rings deployed with Canary, Pilot, and Production separation
|
||||
- [ ] Update ring pause tested at least once
|
||||
- [ ] Windows LAPS deployed; local admin passwords backing up to Entra
|
||||
- [ ] App Protection Policies active for iOS and Android BYOD (tested on real devices)
|
||||
- [ ] Selective wipe tested on BYOD device
|
||||
- [ ] `CA-AllUsers-AllApps-RequireCompliantDevice` **enabled** (not report-only)
|
||||
- [ ] Break-glass accounts excluded from device compliance CA policies — confirmed with a real sign-in
|
||||
|
||||
---
|
||||
|
||||
## Kill Chain Contribution
|
||||
|
||||
**What this assignment closes (or significantly raises the cost of):**
|
||||
|
||||
| Attack vector | Control deployed |
|
||||
|---------------|-----------------|
|
||||
| Stolen credentials used from unmanaged/unknown device | CA Layer 3: compliant device required |
|
||||
| Physical theft of unencrypted device | BitLocker compliance requirement |
|
||||
| Lateral movement via shared local admin credentials | Windows LAPS: unique per-device passwords |
|
||||
| Unpatched OS exploited at known CVE | Update rings: enforced patch cadence |
|
||||
| BYOD personal device accessing corporate data without controls | App Protection Policies: data container on unmanaged device |
|
||||
| Attacker persistence on device after credential reset | Compliance noncompliance action: device retired after persistent noncompliance |
|
||||
|
||||
**What this assignment does not close:**
|
||||
|
||||
| Remaining gap | Addressed by |
|
||||
|---------------|-------------|
|
||||
| Session token theft post-compliance check (AiTM phishing) | Entra token protection (P2) + continuous access evaluation |
|
||||
| Compromised but still-compliant device (stale signal window) | Defender for Endpoint device risk integration (E5) |
|
||||
| App-layer data exfiltration through sanctioned apps | Collaboration and data security assignment |
|
||||
| Advanced malware, post-exploitation on managed device | EDR: Defender for Endpoint P2 (E5) or Wazuh/Sysmon augmentation |
|
||||
| Standing privilege on servers accessed from managed devices | Privileged access engagement |
|
||||
| Dark access (legacy auth, long-lived tokens bypassing CA) | Legacy auth block (identity baseline) + token lifetime policies |
|
||||
|
||||
The most important gap to document plainly: a managed, compliant device that carries a stolen session token (issued after legitimate MFA) still has access. The compliance signal does not re-evaluate session tokens retroactively. Continuous Access Evaluation (CAE) narrows this window for supported apps — verify which apps in the client's environment support CAE, and document the remainder as residual risk.
|
||||
|
||||
---
|
||||
|
||||
## Leave-Behind Package
|
||||
|
||||
| Artifact | Description |
|
||||
|----------|-------------|
|
||||
| **Device population report** | Four-population map: enrolled, registered, authenticating, CMDB; delta analysis; known-unmanaged estimate |
|
||||
| **Compliance policy documentation** | Every policy: settings, assignments, noncompliance action timeline, rationale |
|
||||
| **Compliance dashboard export** | Compliance rates by policy and platform at engagement close |
|
||||
| **Configuration profile documentation** | Security baseline and supplemental profiles: settings, assignments, conflict analysis |
|
||||
| **Update ring documentation** | Ring structure, deferral schedule, pause/rollback procedure, pause test result |
|
||||
| **LAPS deployment confirmation** | Devices with LAPS active; Entra backup confirmed; rotation schedule |
|
||||
| **App Protection Policy documentation** | iOS and Android policies: settings, tested behaviours, documented gaps per platform |
|
||||
| **Selective wipe test record** | Device tested, result, personal data confirmed intact |
|
||||
| **CA Layer 3 activation confirmation** | Sign-in log showing compliant devices accessing successfully, noncompliant devices blocked |
|
||||
| **Scope boundary log** | Every finding outside this scope, named and prioritized |
|
||||
| **Residual risk statement** | What this assignment did not close: stale compliance signal, AiTM token theft, EDR gap, dark access |
|
||||
|
||||
---
|
||||
|
||||
## Scope Boundary Signals
|
||||
|
||||
| Signal | Points toward |
|
||||
|--------|--------------|
|
||||
| Shadow IT apps visible in Intune application inventory | Collaboration and data security assignment; shadow AI discovery |
|
||||
| SCCM co-management active; GPO policies conflicting with Intune | Co-management migration engagement; AD hardening |
|
||||
| Hybrid-joined devices that depend on line-of-sight to DC | Cloud-native migration path; hybrid identity engagement |
|
||||
| No Defender for Endpoint P2; device risk signal not feeding CA | E5 licensing gap; E3 augmentation with Wazuh/Sysmon |
|
||||
| Standing local admin accounts for IT staff outside LAPS scope | Privileged access engagement (Endpoint Privilege Management) |
|
||||
| Autopilot not configured; device reprovision takes days not hours | Autopilot deployment engagement |
|
||||
| Legacy devices below Windows 10 22H2 in the compliance-excluded group | Accelerate OS refresh; document as known risk with timeline |
|
||||
| Audit log retention < 90 days | Detection baseline assignment |
|
||||
| MAM enforcement gaps found during BYOD testing | Document with vendor; consider MDM enrollment for corporate-issued mobile |
|
||||
|
||||
---
|
||||
|
||||
## Buildable-On: What the Next Assignment Depends On
|
||||
|
||||
The Collaboration and Data Security assignment builds on the device posture deployed here. Specifically:
|
||||
|
||||
1. **`CA-AllUsers-UnmanagedDevice-AppEnforcedRestrictions` behaviour** is now testable against the real unmanaged device population. With enrolled and unmanaged devices mapped, you know which users will be affected by app-enforced restrictions and can design the policy accurately.
|
||||
2. **The application inventory from Intune** surfaces the shadow IT picture that informs data security scope — what apps are running, what cloud storage is installed, whether consumer AI tools are present.
|
||||
3. **Managed device as a data exfiltration boundary** — with compliant devices required for access, the remaining data risk is through sanctioned apps on managed devices. That is the scope of the next assignment.
|
||||
|
||||
---
|
||||
|
||||
*For the identity foundation, see [Assignment: Identity Baseline](assignment-identity-baseline.md).*
|
||||
*For the CA Layer 3 policies this assignment activates, see [Assignment: CA Architecture](assignment-ca-architecture.md).*
|
||||
*For the governing philosophy on device posture, see [Book IV — Devices & Endpoint](../books/03-devices-and-intune.md).*
|
||||
@@ -0,0 +1,93 @@
|
||||
# Kill Chain Assessment App
|
||||
|
||||
> *"We say it in every engagement: find the kill chain first. But how do you find it in territory you've never seen? You don't start with the chain — you start with the questions that surface the edges, and you let the graph tell you where the shortest path to the end of the company actually runs."*
|
||||
|
||||
This document specifies the **Kill Chain Assessment app** — a single-file, offline browser tool a consultant runs during the diagnostic to turn an unknown estate into a mapped attack graph, compute the shortest existential path (the kill chain), and size every node on it into a remediation [quantum](../core/quantum-vulnerability-management.md).
|
||||
|
||||
**The tool:** [`tools/kill-chain-assessment.html`](../tools/kill-chain-assessment.html) — open it in any browser. No install, no network, no data leaves the machine. State persists locally and exports to `.json` (to resume) and `.md` (to drop straight into the report or the [Findings Backlog](../assessment-templates/findings-backlog.md)).
|
||||
|
||||
---
|
||||
|
||||
## Why this needed to be built
|
||||
|
||||
The handbook and the [Move Fast and Fix Things](../core/move-fast-and-fix-things.md) posture both rest on a single instruction: *fix the kill chain first.* The [assessment team guide](../assessment-templates/assessment-team-guide.md) tells you what to run (BloodHound, Purple Knight, Elysium, Entra checks); the [sample engagement](sample-engagement-mid-market.md) shows a finished kill chain drawn as an ASCII path. But between "run the tools" and "here is the finished chain" there is a synthesis step that has always lived only in the consultant's head: **taking a pile of findings about an unfamiliar estate and working out which sequence of them actually ends the company.**
|
||||
|
||||
In unknown territory that synthesis is hard, inconsistent between consultants, and easy to get wrong — the obvious 9.8 grabs attention while the cheap two-hop path to the backups goes unseen. The app makes the synthesis explicit and repeatable: capture what you find as nodes and attacker moves, and let a shortest-path computation surface the chain you'd otherwise have to spot by eye. It is the missing instrument for the first and most important act of every engagement.
|
||||
|
||||
---
|
||||
|
||||
## The model
|
||||
|
||||
### Nodes
|
||||
|
||||
A **node** is any asset, foothold, identity, or system. Each carries the attributes that determine its position in the chain:
|
||||
|
||||
| Attribute | Meaning | Drives |
|
||||
|-----------|---------|--------|
|
||||
| **Layer** | entry / identity / privilege / device / data / infra-OT / recovery | Orientation, report grouping |
|
||||
| **Tier** | T0 / T1 / T2 ([T0 Asset Framework](../core/t0-asset-framework.md)) | Blast-radius weighting |
|
||||
| **Entry point** | Internet-reachable or unauth foothold | Source of the chain |
|
||||
| **Crown jewel** | Existential — the org cannot operate without it | End of the chain |
|
||||
| **Reachable?** | Can the adversary actually get to it (yes/no/**unknown**) | Quantum sizing |
|
||||
| **Exploit available?** | Working path/exploit in the wild (yes/no/**unknown**) | Quantum sizing |
|
||||
| **Compensating control** | EDR / WAF / segmentation already in front | Quantum sizing (the ~90% subtraction) |
|
||||
|
||||
The "unknown" values are first-class, not placeholders: a node you cannot characterise is a **dark quantum**, and capturing it honestly is the point.
|
||||
|
||||
### Moves (edges)
|
||||
|
||||
A **move** is one directed attacker step — "from here, an attacker can reach there" — with a *mechanism* (how: DCSync, NTLM relay, password spray, reused credential, OAuth consent) and an *effort* weight from 1 (trivial) to 5 (very hard). Effort is the consultant's judgement of how hard that single hop is for the adversary.
|
||||
|
||||
### The computation
|
||||
|
||||
The app runs a **multi-source Dijkstra** from every entry point across the move graph, and finds the **lowest-total-effort path to any crown jewel.** That path *is* the kill chain — the cheapest route from foothold to existential impact. The tool then classifies every node:
|
||||
|
||||
- **P0** — on the shortest chain. Break any one link and the existential path is severed.
|
||||
- **P1** — on *some* path from an entry to a jewel (reachable-from-entry ∧ can-reach-a-jewel), but not on the cheapest one.
|
||||
- **P2 / off-chain** — not on any path to a crown jewel. Real, but not existential — housekeeping, not kill chain.
|
||||
|
||||
This is the [Move Fast](../core/move-fast-and-fix-things.md) doctrine made computable: *kill-chain position sets priority, not CVSS.*
|
||||
|
||||
### Quantum sizing
|
||||
|
||||
Each node on a chain is sized into a [quantum](../core/quantum-vulnerability-management.md) by the same logic the framework defines:
|
||||
|
||||
| Quantum | Condition | Budget / action |
|
||||
|---------|-----------|-----------------|
|
||||
| **Critical** | On shortest chain, reachable **yes**, exploit **yes**, not compensated | **Hours** — sever reachability / compensating control now |
|
||||
| **Severe** | On a chain, reachable **or** exploit = yes | **Days** — one change window, verify enforcement |
|
||||
| **Standard** | On a chain, neither reachable nor exploitable yet | **Sprint** — batch; patch velocity fits here |
|
||||
| **Dark** | On a chain but reachability **or** exploit = unknown | **Unsized** — route to discovery; characterise first |
|
||||
|
||||
---
|
||||
|
||||
## How to run it in an engagement
|
||||
|
||||
1. **Open the tool** and clear the sample (or keep it as a worked reference). Switch to the **Discovery** tab — it lists, per layer, the questions and commands that surface edges (external scan for entries, the Connect sync account for the cloud↔on-prem bridge, BloodHound `shortestPath` for privilege, "what stops the business operating?" for jewels, flat-network checks for blast radius). This is the unknown-territory protocol.
|
||||
2. **Capture as you go.** Every finding from the [assessment team guide](../assessment-templates/assessment-team-guide.md) becomes a node; every "an attacker could move from X to Y" becomes a move. Mark entries and jewels. Leave reachability/exploit as *unknown* when you genuinely don't know — that flags the dark quanta to chase.
|
||||
3. **Read the chain.** The centre panel draws the attack graph and highlights the shortest existential path in red. The right panel sizes the quanta. If no path is found, either the estate is genuinely segmented there (note it as a win) or you haven't mapped the connecting moves yet — in unknown territory, assume the latter until proven.
|
||||
4. **Export.** `Export report .md` produces a kill-chain section, quantum-bucketed remediation, and a priority table ready to paste into the diagnostic deliverable. `Save .json` lets you resume or hand off.
|
||||
5. **Close the loop.** After remediation, reload the `.json` and ask the antifragile question the framework demands: *did the chain get shorter?* A severed link or a collapsed privilege should visibly lengthen the shortest path or remove it entirely.
|
||||
|
||||
---
|
||||
|
||||
## What it is and is not
|
||||
|
||||
It is a **synthesis and prioritisation instrument** — it makes the consultant's kill-chain judgement explicit, repeatable, and exportable, and it removes the human error of eyeballing the cheapest path. It is deliberately **offline and dependency-free** (Pillar 4, Sovereign Intelligence: the attack graph of a client estate must never leave the consultant's machine for a vendor cloud).
|
||||
|
||||
It is **not** a scanner and not an autonomous agent. It does not discover assets for you — it structures what you discover. The discovery still comes from the tools in the [assessment team guide](../assessment-templates/assessment-team-guide.md) and the [zero-budget discovery](zero-budget-vulnerability-discovery.md) playbooks; the autonomous hours-lane execution lives in [AI-Assisted TVM](ai-assisted-tvm.md). This tool is the bridge between them: it turns raw discovery into a sized, prioritised chain that the rest of the programme acts on.
|
||||
|
||||
---
|
||||
|
||||
## Roadmap (build-later)
|
||||
|
||||
The current tool is a self-contained synthesis instrument. Natural extensions, in priority order:
|
||||
|
||||
1. **Import from BloodHound / Purple Knight** — ingest exported attack paths directly as nodes and moves, rather than hand-entry.
|
||||
2. **PULSAR / ASTRAL signal overlay** — pull live reachability and config-drift signal so "reachable?" is answered by observation, not assertion (Book I: validate by observation).
|
||||
3. **Chain-shortening tracker** — store successive `.json` snapshots and chart kill-chain length over time, making the antifragile feedback loop a number on a dashboard.
|
||||
4. **Multi-chain view** — surface the top-N existential paths, not just the cheapest, so secondary chains (the [sample engagement](sample-engagement-mid-market.md) on-prem path) aren't hidden behind the primary.
|
||||
|
||||
---
|
||||
|
||||
*Specified for [Book VII — Vulnerability Management](../books/06-vulnerability-management.md) and the [Quantum Vulnerability Management](../core/quantum-vulnerability-management.md) framework. The tool: [`tools/kill-chain-assessment.html`](../tools/kill-chain-assessment.html).*
|
||||
@@ -0,0 +1,251 @@
|
||||
# ORION — Technical Proposition
|
||||
|
||||
> *"The kill chain exists before you have access to a single system. It's already drawn — in the org chart, the procurement history, the sector's threat landscape, and the things people will tell you in a room if you ask the right questions. ORION is the instrument for reading that chain on day zero, before a single tool has touched the estate."*
|
||||
|
||||
**Codename:** ORION (the Hunter — it hunts the kill chain). Celestial, consistent with ASTRAL / PULSAR / AURORA. Rename freely.
|
||||
|
||||
**Status:** Technical proposition — pre-build. This document exists to be argued with before any code is written.
|
||||
|
||||
**One line:** ORION is the pre-engagement intake, interview, and threat-intelligence layer that produces the input the [Kill Chain Assessment app](kill-chain-assessment-app.md) (L1) consumes — turning structured human answers and public intelligence into a *hypothesised* attack graph, without ever touching client infrastructure.
|
||||
|
||||
---
|
||||
|
||||
## 1. Why this needs to exist
|
||||
|
||||
The L1 [Kill Chain Assessment app](kill-chain-assessment-app.md) is a synthesis instrument: you feed it nodes and attacker moves you've already discovered, and it computes the shortest existential path and sizes the [quanta](../core/quantum-vulnerability-management.md). It assumes you already have findings — BloodHound paths, Entra checks, the [assessment team guide](../assessment-templates/assessment-team-guide.md) output.
|
||||
|
||||
But on **day zero of a new engagement** you have none of that. You may not even have access yet — the contract may not permit infrastructure contact, the change-advisory board hasn't met, the client's legal team is still reviewing the scope. And yet this is exactly the moment the consultant most needs a hypothesis: *where is this company's kill chain likely to run, what should we ask, and what should we look at first when access arrives?*
|
||||
|
||||
Today that reasoning lives entirely in the experienced consultant's head. It is the single least reproducible, least scalable part of the practice — a senior consultant walks in, asks fifteen sharp questions, and forms a mental model of the likely kill chain; a junior consultant asks the obvious questions and misses it. ORION makes that reasoning **explicit, structured, intel-informed, and repeatable** — and it does so in the window before fieldwork is even possible.
|
||||
|
||||
ORION is, deliberately, the "What If" tool of the assessment world (Book I). It produces a *declared* picture — what the client says, what public intel suggests — which is precisely the picture the rest of the engagement exists to validate by observation. Naming that honestly is the whole design (see §7).
|
||||
|
||||
---
|
||||
|
||||
## 2. The hard boundary: ORION never touches client infrastructure
|
||||
|
||||
This is the defining constraint and the primary selling point, not a limitation to apologise for.
|
||||
|
||||
ORION works from exactly two input classes:
|
||||
|
||||
1. **What humans tell it** — structured intake and questionnaire responses from the client.
|
||||
2. **Passive public intelligence** — sector threat landscape, CISA KEV, vendor advisories, exploited-CVE feeds, public OSINT about the named technology stack. **Passive only**: ORION reads public and threat-intelligence sources. It does *not* perform active external scanning — that is a separate, consented capability (see [Perimeter Scanning Capability](perimeter-scanning-capability.md)) and explicitly out of ORION's scope.
|
||||
|
||||
What this buys:
|
||||
|
||||
- **Zero onboarding friction.** No credentials, no agent, no firewall change, no data-processing agreement for telemetry. ORION can run during the sales conversation, in the pre-contract phase, or in a sector where the client cannot yet grant access.
|
||||
- **No incident risk.** A tool that touches nothing breaks nothing and triggers no alerts. It can never be the cause of an outage or a "who ran that scan?" conversation.
|
||||
- **Clean legal posture.** The only client data ORION holds is what the client deliberately typed into a questionnaire. That is a categorically simpler privacy and liability position than any tool that ingests infrastructure data.
|
||||
|
||||
The boundary is also the honest limit: because ORION observes nothing, everything it produces is a hypothesis (§7).
|
||||
|
||||
---
|
||||
|
||||
## 3. The three-stage workflow
|
||||
|
||||
### Stage 1 — Intake (minutes)
|
||||
|
||||
A short structured form establishes the engagement's shape. The consultant fills this, usually from the first call:
|
||||
|
||||
- Sector and sub-sector (drives the threat-landscape lookup and the regulatory profile)
|
||||
- Size, geography, and regulatory exposure (NIS2 / DORA / GDPR / sector-specific)
|
||||
- Technology footprint at a coarse level: M365 (E3/E5/BP), hybrid AD vs cloud-only, major cloud, OT/ICS presence, internet-facing services they'll admit to
|
||||
- Business-level crown jewels: "what stops the company operating?" — ERP, payment rails, OT control, the customer database
|
||||
- Known history: prior incidents, prior pentest, known pain points
|
||||
|
||||
### Stage 2 — Generate the tailored questionnaire (the core trick)
|
||||
|
||||
ORION's LLM expands the intake into a **detailed, role-targeted, adaptive questionnaire**, and this is where it earns its keep. The questionnaire is:
|
||||
|
||||
- **Role-segmented** — separate tracks for the identity/AD admin, the M365 admin, the network/OT lead, and the business owner. Each person answers only what they'd know.
|
||||
- **Adaptive** — questions branch on prior answers. Hybrid AD declared → the Entra Connect sync-account and DCSync questions appear. OT declared → Purdue-model and remote-vendor-access questions appear. Cloud-only → the questionnaire skips on-prem forest-recovery questions entirely.
|
||||
- **Framed against the kill chain, not compliance** — every question maps to a candidate node or edge ("Do any standing Domain Admins log into normal workstations for email?" targets a known privilege-path edge), not to a control checkbox. This is the inversion the whole practice rests on.
|
||||
|
||||
The client fills it via a shared per-engagement link, partially and over time, with their own people answering their own sections.
|
||||
|
||||
### Stage 3 — Synthesis → hypothesised kill chain → L1 export
|
||||
|
||||
From the responses plus the threat intel, ORION proposes:
|
||||
|
||||
- **Candidate entry points** (internet-facing services, legacy auth, the contractor-access pattern), each with the intel that suggests it.
|
||||
- **Candidate crown jewels** (from the business answers).
|
||||
- **Hypothesised moves** between them, each with a *mechanism*, a *confidence*, and a *rationale citing its source* ("hybrid AD + unrotated KRBTGT declared → likely Entra-Connect→on-prem DCSync edge").
|
||||
- **A prioritised "look here first" list** for when fieldwork begins — what to point BloodHound, the Entra review, and the L1 app at on day one.
|
||||
|
||||
The synthesis exports directly to the **L1 Kill Chain Assessment app's `.json` schema**, so the consultant opens L1 with the hypothesised graph already drawn and spends fieldwork *validating and correcting* it rather than building from a blank canvas. ORION hypothesises; L1 plus fieldwork confirm or kill each hypothesis by observation.
|
||||
|
||||
---
|
||||
|
||||
## 4. Threat-intelligence layer
|
||||
|
||||
ORION continuously contextualises the client against the *current* threat environment — the dimension a static questionnaire can't capture and the one that feeds the [quantum](../core/quantum-vulnerability-management.md) sort key's "exploit availability" axis:
|
||||
|
||||
- **CISA KEV and exploited-CVE feeds** — for the client's named technologies, what is being exploited *now*.
|
||||
- **Vendor advisories** — current critical advisories for their declared stack (the VPN appliance, the mail gateway, the ERP).
|
||||
- **Sector threat landscape** — which actors and ransomware groups are currently targeting their vertical, drawn from public reporting.
|
||||
|
||||
Each intel item carries **provenance** (source, date, URL) because ORION's output is advisory and the consultant must be able to trace and re-verify every claim. Threat intel ages fast; ORION timestamps everything and treats stale intel as a prompt to re-check, never as fact.
|
||||
|
||||
---
|
||||
|
||||
## 5. Architecture
|
||||
|
||||
Deliberately mirrors CISO Assistant and the AURORA model so it's familiar to operate and fits the suite.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ ORION (Docker Compose, consultant self-hosted) │
|
||||
│ │
|
||||
│ ┌────────────┐ ┌──────────────┐ ┌───────────────────┐ │
|
||||
│ │ Web UI │ │ API backend │ │ PostgreSQL │ │
|
||||
│ │ (SvelteKit │◄─►│ (FastAPI or │◄─►│ engagements, │ │
|
||||
│ │ or React) │ │ Django/DRF) │ │ responses, │ │
|
||||
│ └────────────┘ └──────┬───────┘ │ hypotheses │ │
|
||||
│ client fills │ └───────────────────┘ │
|
||||
│ questionnaire │ │
|
||||
│ via shared link ▼ │
|
||||
│ ┌──────────────────────┐ │
|
||||
│ │ LLM abstraction │ pluggable backend │
|
||||
│ │ layer │──► Ollama (default) │
|
||||
│ └──────────────────────┘──► Azure OpenAI (opt) │
|
||||
│ │ └──► llm.cqre.net (opt) │
|
||||
│ ▼ │
|
||||
│ ┌──────────────────────┐ │
|
||||
│ │ Threat-intel │ passive fetch only: │
|
||||
│ │ connector module │──► CISA KEV, advisories│
|
||||
│ └──────────────────────┘──► curated OSINT/search│
|
||||
│ │ │
|
||||
│ ┌──────────┴───────────┐ ┌─────────────────┐ │
|
||||
│ │ L1 export adapter │──►│ kill-chain .json│ │
|
||||
│ └──────────────────────┘ └─────────────────┘ │
|
||||
│ ┌──────────────────────┐ │
|
||||
│ │ MCP server │ AURORA / Claude can │
|
||||
│ │ (query ORION) │ query engagements │
|
||||
│ └──────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
NO connection to client infrastructure
|
||||
```
|
||||
|
||||
Components:
|
||||
|
||||
- **Backend** — FastAPI (Python) or Django REST, matching CISO Assistant's proven stack. Houses the questionnaire engine, synthesis orchestration, and export.
|
||||
- **Frontend** — SvelteKit or React. Two surfaces: the consultant console and the client-facing questionnaire (shareable per-engagement link, no client login burden beyond a token).
|
||||
- **LLM abstraction layer** — single internal interface, swappable backend. **Default: local Ollama** so sensitive intake data never leaves the box (§6). Optional: Azure OpenAI (EU) or managed `llm.cqre.net`, exactly as ASTRAL/AURORA offer.
|
||||
- **Questionnaire engine — questions-as-data** — adopting CISO Assistant's "frameworks as data, not code" principle: questionnaire templates, branching rules, and node/edge mappings live in the database as editable data, so new sector packs and question sets ship without code changes.
|
||||
- **Threat-intel connector** — passive fetchers for KEV, advisories, and curated search, each normalised into a provenance-tagged `ThreatIntelItem`.
|
||||
- **L1 export adapter** — emits the exact `.json` schema the L1 app imports.
|
||||
- **MCP server** — exposes ORION engagement state to AURORA and to AI assistants, consistent with the rest of the suite.
|
||||
|
||||
### Data model (sketch)
|
||||
|
||||
| Entity | Holds | Notes |
|
||||
|--------|-------|-------|
|
||||
| `Engagement` | Client, scope, status | Per-engagement isolation boundary |
|
||||
| `IntakeProfile` | Stage-1 answers | Drives questionnaire generation |
|
||||
| `QuestionnaireTemplate` | Questions, branching rules, node/edge mappings | Questions-as-data; sector packs |
|
||||
| `Response` | Client answers, respondent role, timestamp | Sensitive — encrypted at rest |
|
||||
| `ThreatIntelItem` | Intel + source + date + URL | Provenance mandatory |
|
||||
| `Hypothesis` | Candidate node/edge + confidence + rationale + sources | The advisory output; never a "finding" |
|
||||
| `Export` | Generated L1 `.json` snapshots | Versioned, so you can diff intake-time vs post-fieldwork |
|
||||
|
||||
---
|
||||
|
||||
## 6. Sovereignty and data handling
|
||||
|
||||
ORION holds something genuinely sensitive: a client's own description of where they are weak. That is a map of the kill chain drawn by the victim. The data posture must be uncompromising and is a direct expression of Pillar 4 (Sovereign Intelligence — never rent your ability to think) and Pillar 1.
|
||||
|
||||
- **Local LLM by default.** Ollama runs in the same Compose stack; intake and responses never leave the consultant's host unless a backend is *explicitly* switched. The default must be the safe one.
|
||||
- **Encryption at rest** for `Response` and `Hypothesis` data; per-engagement key isolation.
|
||||
- **Retention and deletion.** Each engagement has a retention clock and a hard "right to delete" — when the engagement closes, the client's answers can be destroyed and the destruction evidenced (GDPR-friendly, and the right thing).
|
||||
- **No telemetry, no phone-home.** Consistent with the offline ethos of the L1 tool.
|
||||
- **Untrusted-content handling.** Threat-intel fetched from the web is untrusted input — treated as data, never as instructions to the LLM (prompt-injection defence, §8).
|
||||
|
||||
---
|
||||
|
||||
## 7. The epistemic honesty layer (the most important section)
|
||||
|
||||
ORION's single greatest risk is that its confident, well-written output gets mistaken for fact. The repo's founding principle (Book I) is *validate by observation, never by inspection* — and ORION, by design, observes nothing. So the design must make its own uncertainty impossible to ignore:
|
||||
|
||||
- **Everything ORION emits is a `Hypothesis`, never a `Finding`.** The vocabulary is enforced in the data model and the UI. A finding comes from the [assessment team guide](../assessment-templates/assessment-team-guide.md) fieldwork and lands in the [Findings Backlog](../assessment-templates/findings-backlog.md); a hypothesis comes from ORION and lands in L1 as something *to test*.
|
||||
- **Confidence and provenance on every claim.** No hypothesis without a stated confidence and the source(s) — the client answer or the intel item — that produced it.
|
||||
- **The "ghost-assessment" trap, named.** Just as a ghost CA policy displays correct config while enforcing nothing (Book I corollary), a client questionnaire can describe a control that has rotted into a ghost. ORION's hypotheses inherit the client's blind spots. The output must say so, loudly, and route every load-bearing claim to observation.
|
||||
- **The handoff is explicit.** ORION's deliverable is not "here is your kill chain." It is "here is the kill chain we *expect*, ranked by where to look first — now go and prove or disprove each link." That handoff into L1 and fieldwork is the product, not the hypothesis itself.
|
||||
|
||||
Get this section right and ORION strengthens the practice. Get it wrong and it becomes the most dangerous thing in the toolkit: a confident map of a territory no one checked.
|
||||
|
||||
---
|
||||
|
||||
## 8. LLM guardrails
|
||||
|
||||
- **Human-in-the-loop, always.** ORION proposes; the consultant disposes. No hypothesis auto-promotes to a finding, and ORION takes no action on anything.
|
||||
- **Prompt-injection defence.** Web/threat-intel content is wrapped and labelled as untrusted data; the system prompt instructs the model to treat fetched content as evidence to summarise, never as commands.
|
||||
- **Hallucination control.** Provenance is mandatory; a claim with no traceable source is flagged, not shown as fact. The consultant can click any hypothesis through to its sources.
|
||||
- **Quality floor.** Local models are weaker; the proposition should set an expectation that the default Ollama model is adequate for questionnaire generation and basic synthesis, with Azure OpenAI recommended where deeper reasoning materially helps — and the UI should make the active model and its limits visible.
|
||||
|
||||
---
|
||||
|
||||
## 9. How it fits the engagement
|
||||
|
||||
| Phase | ORION's role |
|
||||
|-------|--------------|
|
||||
| Pre-contract / sales | Stage-1 intake during the first conversation; instant sector threat-landscape briefing as a credibility opener |
|
||||
| [Brownhat Diagnostic](../assessment-templates/nist-csf-baseline.md) intake | Generate and distribute the tailored questionnaire; collect responses before the on-site half-days |
|
||||
| Fieldwork ([assessment team guide](../assessment-templates/assessment-team-guide.md)) | Hand the consultant a hypothesised graph and a "look here first" list; fieldwork validates by observation |
|
||||
| L1 mapping | Import ORION's `.json`; correct and confirm; compute the real shortest existential path |
|
||||
| Reporting | Diff intake-time hypotheses against confirmed findings — a powerful "what you told us vs what we found" narrative for the client |
|
||||
|
||||
---
|
||||
|
||||
## 10. Regulatory alignment (EU)
|
||||
|
||||
| Regulation | Requirement | ORION relevance |
|
||||
|------------|-------------|-----------------|
|
||||
| **NIS2** Art. 21 | Risk analysis, supply-chain and access governance | Structured intake produces documented evidence of risk-analysis scoping at engagement start |
|
||||
| **DORA** | ICT risk identification | The hypothesised kill chain is an ICT-risk-identification artefact (clearly marked as preliminary) |
|
||||
| **GDPR** Art. 5/32 | Data minimisation, appropriate measures, accountability | Local-LLM default, encryption, retention/deletion — minimal, sovereign handling of the only PII it holds |
|
||||
|
||||
---
|
||||
|
||||
## 11. Phased build (proposed MVP → product)
|
||||
|
||||
1. **Phase 1 — MVP.** Stage-1 intake, LLM questionnaire generation (Ollama), manual-assisted synthesis, L1 `.json` export. No threat intel yet. Proves the core loop.
|
||||
2. **Phase 2 — Threat intel.** KEV / advisory / curated-search connectors with provenance; exploit-availability enrichment of hypotheses.
|
||||
3. **Phase 3 — Adaptive + integrated.** Full branching questionnaire engine (questions-as-data), MCP server, AURORA integration, sector question packs.
|
||||
4. **Phase 4 — Productisation.** Hosted tier, multi-engagement console, RBAC, retention automation.
|
||||
|
||||
---
|
||||
|
||||
## 12. Provisional commercial framing
|
||||
|
||||
Positioned like AURORA — self-hosted and hosted tiers — though pricing is a placeholder pending the build decision:
|
||||
|
||||
| Tier | Self-hosted | Hosted (managed) |
|
||||
|------|-------------|------------------|
|
||||
| Per-consultant / small practice | TBD | TBD |
|
||||
| Practice / multi-seat | TBD | TBD |
|
||||
|
||||
Self-hosters bring their own LLM (Ollama / Azure OpenAI); hosted tier includes a managed model. Note the natural bundling: ORION (pre-engagement) → L1 Kill Chain Assessment (synthesis) → ASTRAL/PULSAR/AURORA (the operational layer once access exists).
|
||||
|
||||
---
|
||||
|
||||
## 13. What ORION is NOT
|
||||
|
||||
- **Not a scanner and not an agent.** It touches no client system, active-scans nothing, and runs nothing in the client environment.
|
||||
- **Not autonomous.** It proposes hypotheses for a consultant; it never acts and never self-promotes a hypothesis to a finding.
|
||||
- **Not a replacement for fieldwork or for L1.** It is the layer *before* them — it tells you where to look, it does not tell you what is true.
|
||||
- **Not a compliance questionnaire tool.** The questions target the kill chain, not a control checklist; CISO Assistant covers the GRC/framework job and ORION should integrate with it, not duplicate it.
|
||||
|
||||
---
|
||||
|
||||
## 14. Open questions for the build decision
|
||||
|
||||
1. **Backend choice** — FastAPI (lighter, our synthesis is bespoke) vs Django/DRF (matches CISO Assistant, more batteries). Leaning FastAPI.
|
||||
2. **Client-facing surface** — shared tokenised link (low friction) vs lightweight client login (more control). Leaning tokenised link with per-engagement expiry.
|
||||
3. **Where is the OSINT/active line drawn exactly?** Confirm ORION stays strictly passive and that any external scanning is deferred to the consented [Perimeter Scanning Capability](perimeter-scanning-capability.md).
|
||||
4. **CISO Assistant integration depth** — loose (export/import) vs deep (shared data model). Loose first.
|
||||
5. **Default Ollama model and the quality floor** — which local model is "good enough" for questionnaire generation, and where do we tell consultants to switch to Azure OpenAI.
|
||||
6. **Hypothesis accuracy expectations** — how do we measure and communicate that ORION's day-zero map is a starting hypothesis, and track how often it was right once fieldwork closed the loop?
|
||||
|
||||
---
|
||||
|
||||
*Companion to the [Kill Chain Assessment app](kill-chain-assessment-app.md) (L1), [Book VII — Vulnerability Management](../books/06-vulnerability-management.md), and the [Quantum Vulnerability Management](../core/quantum-vulnerability-management.md) framework. Positioned in the suite alongside [ASTRAL, PULSAR, and AURORA](cqre-product-suite.md).*
|
||||
@@ -17,6 +17,51 @@ The antifragile answer is a two-layer architecture: **network access** (Tailscal
|
||||
|
||||
---
|
||||
|
||||
## When overlay management networks help — and when they don't
|
||||
|
||||
**Enterprises with their own data centres** already have the physical substrate for a proper management network: dedicated VLANs, hardware segmentation, jump boxes. Adding an overlay management network introduces a new Tier 0 component (the coordinator) on top of infrastructure that already solves the problem. The complexity cost outweighs the benefit. Traditional management VLAN segmentation, done properly, is the right answer.
|
||||
|
||||
**SME clients with multi-cloud resources, containers, and DevOps workloads** have a different problem: there is no physical network to segment. Resources are scattered across Azure, AWS, a colo, and maybe on-prem. The management plane does not exist yet — you are building it. An overlay is how you build it, and it is the right answer for this context.
|
||||
|
||||
**The T0/T1 split** — applying the tier model to the overlay itself:
|
||||
|
||||
- **T0 systems** (domain controllers, ADCS, Entra Connect sync server — the identity control plane): use **Nebula**. No coordinator in the runtime path — once certificates are distributed, the overlay functions with zero external dependencies. The Nebula CA is the only Tier 0 component, and it can be kept offline. This means no coordinator to compromise, no external API call, no cloud service availability dependency for reaching your most critical systems.
|
||||
- **T1 systems** (member servers, cloud workloads, Kubernetes clusters, multi-cloud management): use **Tailscale** (or Headscale for sovereign requirements). Per-node ACLs, Entra OIDC integration, per-session MFA via key expiry and IdP enforcement. The coordinator trust concern is more acceptable at T1 — a compromised coordinator affects T1 access, not T0.
|
||||
|
||||
**The T0 node count is not scary.** For a 5,000-person organisation, the realistic T0 Nebula population is:
|
||||
|
||||
| Component | Count |
|
||||
|-----------|-------|
|
||||
| Domain Controllers | 4–8 |
|
||||
| Entra Connect / Cloud Sync server | 1–2 |
|
||||
| ADCS issuing CA | 1–2 |
|
||||
| AD FS servers (if not yet removed) | 0–4 |
|
||||
| Cloud admin VMs / PAWs | 5–10 |
|
||||
| **Total** | **~15–25 nodes** |
|
||||
|
||||
Certificate management for 15–25 nodes is a documented procedure, not an operational burden. The CA signing ceremony happens a few times a year when a PAW is replaced or an admin leaves. This is tractable.
|
||||
|
||||
---
|
||||
|
||||
## The PAW problem and the cloud admin VM
|
||||
|
||||
Physical PAWs are the right principle. They almost never get deployed. Hardware procurement, second device on the desk, behaviour change — the project dies before it starts.
|
||||
|
||||
The **cloud-hosted admin workstation** preserves the essential security properties without the hardware problem:
|
||||
|
||||
- A Windows 365 or Azure Virtual Desktop VM provisioned from a hardened template
|
||||
- Used only for privileged tasks (no email, no general browsing)
|
||||
- Connected to the Nebula T0 overlay (for DC access) and Tailscale T1 overlay (for server/cloud access)
|
||||
- Accessed by the admin from their normal device via browser or RDP client
|
||||
- Privileged credentials live in the cloud VM, not on the admin's local device
|
||||
- Compromise response: wipe the VM, reprovision from template in 20 minutes
|
||||
|
||||
The security property that matters — privileged credentials do not touch the device used for email and browsing — is preserved. An attacker who compromises the admin's local device gets a browser session to a cloud VM that requires phishing-resistant MFA to reach. They do not get cached credentials, session tokens, or WireGuard keys for the management overlay.
|
||||
|
||||
**When to use a physical PAW instead:** clients with a strong security culture and genuine appetite for the operational overhead, OT/ICS environments where the management workstation may need to be air-gapped, or engagements where the threat model includes a sophisticated attacker who would attempt to compromise the RDP session interactively.
|
||||
|
||||
---
|
||||
|
||||
## The Two Layers
|
||||
|
||||
### Layer 1: Network Access — Tailscale / Headscale + WireGuard
|
||||
@@ -130,6 +175,30 @@ This catches more clients than it appears. A manufacturing company with 800 empl
|
||||
|
||||
---
|
||||
|
||||
### Nebula — T0 Management Overlay
|
||||
|
||||
| Attribute | Detail |
|
||||
|-----------|--------|
|
||||
| **What it does** | WireGuard-based overlay mesh with no coordinator in the runtime path. Nodes authenticate via pre-distributed certificates signed by a local CA. Lighthouse nodes handle NAT traversal only — they are not in the authentication path. |
|
||||
| **Why it is right for T0** | No external runtime dependency. A compromised or unavailable coordinator cannot affect T0 access. The CA (the actual trust anchor) can be kept offline and brought up only for certificate issuance. |
|
||||
| **Trade-off vs Tailscale** | No dynamic node management (adding/removing a node requires a CA operation and cert redistribution); no cloud-managed control plane; higher initial setup complexity; certificate revocation requires distributing an updated blocklist |
|
||||
| **Why the trade-off is acceptable for T0** | T0 node population is small (15–25 nodes) and stable. Revocation events (lost PAW, departing admin) are rare and known immediately. The operational overhead is a documented ceremony run a few times a year, not a recurring burden. |
|
||||
| **Antifragile pillar** | Structural Decoupling, Sovereign Intelligence |
|
||||
| **When to deploy** | T0 systems (DCs, sync server, ADCS) in any estate; air-gapped or restricted environments; clients where the management plane must have zero external runtime dependencies |
|
||||
|
||||
**Nebula CA management — the one non-trivial operation:**
|
||||
|
||||
The Nebula CA private key is the trust anchor for the entire T0 overlay. It must be treated accordingly:
|
||||
- Air-gapped machine (a dedicated laptop that is never networked, or a hardware security module)
|
||||
- Documented signing ceremony: who is authorised to sign a new certificate, what approval is required, what the procedure is
|
||||
- Named individuals (minimum two) who know the procedure and can perform it
|
||||
- CA key backup: encrypted, stored separately from the signing machine, tested
|
||||
- Short certificate lifetimes (90–180 days) so revocation is handled implicitly by non-renewal as much as by explicit blocklist distribution
|
||||
|
||||
This is the same discipline as an offline root CA — because that is functionally what it is.
|
||||
|
||||
---
|
||||
|
||||
### Smallstep — Certificate-Based SSH Access
|
||||
|
||||
| Attribute | Detail |
|
||||
@@ -145,20 +214,34 @@ This catches more clients than it appears. A manufacturing company with 800 empl
|
||||
## The Decision Framework
|
||||
|
||||
```
|
||||
Does the client have legacy VPN sprawl or flat-network vendor access?
|
||||
├── YES → Deploy Layer 1 (network access) first
|
||||
│ ├── Wants managed service + commercial support → Tailscale (partnership)
|
||||
Does the client have their own data centre with physical network infrastructure?
|
||||
├── YES → Traditional management VLAN segmentation + jump box
|
||||
│ Overlay adds complexity without proportional benefit here
|
||||
└── NO / Multi-cloud / Scattered resources → Overlay is the right management plane
|
||||
|
||||
Does the client need a T0 management overlay (DC, ADCS, sync server access)?
|
||||
├── YES → Nebula (no external runtime dependency, CA offline)
|
||||
│ └── Admin workstation: cloud admin VM (W365/AVD) or physical PAW, enrolled in Nebula
|
||||
│
|
||||
Does the client need a T1 overlay (servers, cloud workloads, K8s, DevOps)?
|
||||
├── YES → Layer 1 (network access)
|
||||
│ ├── Wants managed service + commercial support → Tailscale + Entra OIDC + key expiry MFA
|
||||
│ └── Wants full sovereignty / data residency → Headscale + WireGuard
|
||||
│
|
||||
Does the client need protocol-aware session recording / JIT / DB access?
|
||||
├── YES → Add Layer 2 (PAM)
|
||||
│ ├── < 100 employees AND < $10M revenue → Teleport CE (free, self-hosted)
|
||||
│ ├── Larger org / needs support → Teleport Enterprise (commercial)
|
||||
│ └── SSH-only, budget-constrained → Smallstep (certificates only)
|
||||
│ ├── Larger org / needs support → Teleport Enterprise (commercial, verify current pricing)
|
||||
│ └── SSH-only, budget-constrained → Smallstep (certificates only, no session recording)
|
||||
│
|
||||
Does the client need both layers?
|
||||
├── MOST CLIENTS → Tailscale (network) + Teleport CE/Enterprise (PAM)
|
||||
└── OT/CRITICAL INFRA → Headscale (sovereign network) + Teleport (recorded vendor access)
|
||||
Typical SME multi-cloud client:
|
||||
├── T0: Nebula + cloud admin VMs
|
||||
├── T1: Tailscale + Entra OIDC
|
||||
└── Session recording: Teleport CE if eligible, otherwise accept the gap and compensate with
|
||||
cloud VM audit logging and Tailscale connection logs
|
||||
|
||||
OT / Critical infrastructure:
|
||||
└── Headscale (sovereign T1) + Nebula (T0 where applicable) + Teleport (vendor session recording)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -0,0 +1,15 @@
|
||||
# Tools
|
||||
|
||||
Standalone, runnable instruments that support the engagement — as distinct from the markdown frameworks and playbooks elsewhere in the repository.
|
||||
|
||||
| Tool | What it does | How to run |
|
||||
|------|--------------|------------|
|
||||
| [`kill-chain-assessment.html`](kill-chain-assessment.html) | Maps an unknown estate into an attack graph, computes the shortest existential path (the kill chain), and sizes every node into a remediation quantum. The synthesis instrument for the first act of every engagement. | Open in any browser. Offline, no install, no network. State persists locally; exports to `.json` and `.md`. |
|
||||
|
||||
## Design constraints for tools in this directory
|
||||
|
||||
- **Offline and sovereign.** Client attack-surface data must never leave the consultant's machine for a vendor cloud (Antifragile Manifest, Pillar 4). Tools here are single-file and dependency-free wherever possible.
|
||||
- **Exportable.** Output drops into the engagement deliverables — the [diagnostic report](../assessment-templates/nist-csf-baseline.md) and the [Findings Backlog](../assessment-templates/findings-backlog.md) — not into a proprietary format.
|
||||
- **Explicit, not magic.** A tool makes the consultant's judgement repeatable; it does not replace it.
|
||||
|
||||
See the [Kill Chain Assessment App spec](../playbooks/kill-chain-assessment-app.md) for the model behind the first tool.
|
||||
@@ -0,0 +1,642 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Kill Chain Assessment — Brownhat / CQRE</title>
|
||||
<style>
|
||||
:root{
|
||||
--bg:#0d1117; --panel:#161b22; --panel2:#1c2330; --line:#30363d; --line2:#3d4654;
|
||||
--ink:#e6edf3; --muted:#9aa6b2; --faint:#6e7781;
|
||||
--p0:#ff4d4f; --p1:#ff9f0a; --p2:#3fb950; --dark:#a371f7; --entry:#58a6ff; --jewel:#f7c948;
|
||||
--accent:#58a6ff; --accent2:#1f6feb;
|
||||
--crit:#ff4d4f; --sev:#ff9f0a; --std:#3fb950; --darkq:#a371f7; --house:#6e7781;
|
||||
}
|
||||
*{box-sizing:border-box}
|
||||
body{margin:0;background:var(--bg);color:var(--ink);font:14px/1.5 -apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,Helvetica,Arial,sans-serif}
|
||||
header{padding:16px 22px;border-bottom:1px solid var(--line);display:flex;align-items:center;gap:16px;flex-wrap:wrap;background:linear-gradient(180deg,#11161d,#0d1117)}
|
||||
header h1{font-size:18px;margin:0;letter-spacing:.3px}
|
||||
header .tag{font-size:11px;color:var(--faint);border:1px solid var(--line);padding:2px 8px;border-radius:20px}
|
||||
header .sub{color:var(--muted);font-size:12.5px;margin-left:auto;max-width:520px;text-align:right}
|
||||
.wrap{display:grid;grid-template-columns:340px 1fr 360px;gap:0;height:calc(100vh - 59px)}
|
||||
.col{overflow-y:auto;padding:16px}
|
||||
.col.left{border-right:1px solid var(--line)}
|
||||
.col.right{border-left:1px solid var(--line);background:#0b0f14}
|
||||
h2{font-size:12px;text-transform:uppercase;letter-spacing:1px;color:var(--muted);margin:4px 0 10px;font-weight:600}
|
||||
h2 .hint{text-transform:none;letter-spacing:0;font-weight:400;color:var(--faint);display:block;font-size:11.5px;margin-top:3px}
|
||||
.panel{background:var(--panel);border:1px solid var(--line);border-radius:10px;padding:13px;margin-bottom:14px}
|
||||
label{display:block;font-size:11.5px;color:var(--muted);margin:9px 0 3px}
|
||||
input,select,textarea,button{font:inherit;color:var(--ink)}
|
||||
input[type=text],select,textarea{width:100%;background:var(--panel2);border:1px solid var(--line2);border-radius:7px;padding:7px 9px}
|
||||
input[type=text]:focus,select:focus,textarea:focus{outline:none;border-color:var(--accent)}
|
||||
textarea{resize:vertical;min-height:34px}
|
||||
.row{display:flex;gap:8px}
|
||||
.row>*{flex:1}
|
||||
.chk{display:flex;align-items:center;gap:7px;margin:8px 0;font-size:12.5px;color:var(--ink)}
|
||||
.chk input{width:auto}
|
||||
button{cursor:pointer;background:var(--panel2);border:1px solid var(--line2);border-radius:7px;padding:8px 12px;transition:.12s}
|
||||
button:hover{border-color:var(--accent);color:#fff}
|
||||
button.primary{background:var(--accent2);border-color:var(--accent2);color:#fff;font-weight:600}
|
||||
button.primary:hover{background:#388bfd}
|
||||
button.ghost{background:transparent}
|
||||
button.danger:hover{border-color:var(--p0);color:var(--p0)}
|
||||
.btnrow{display:flex;gap:8px;flex-wrap:wrap;margin-top:10px}
|
||||
.btnrow button{flex:1;min-width:0}
|
||||
.pill{display:inline-block;font-size:10px;font-weight:700;letter-spacing:.5px;padding:2px 7px;border-radius:20px;text-transform:uppercase}
|
||||
.pill.entry{background:rgba(88,166,255,.16);color:var(--entry);border:1px solid var(--entry)}
|
||||
.pill.jewel{background:rgba(247,201,72,.14);color:var(--jewel);border:1px solid var(--jewel)}
|
||||
.node-item{background:var(--panel2);border:1px solid var(--line);border-radius:8px;padding:9px 10px;margin-bottom:7px;cursor:pointer}
|
||||
.node-item:hover{border-color:var(--accent)}
|
||||
.node-item.sel{border-color:var(--accent);box-shadow:0 0 0 1px var(--accent) inset}
|
||||
.node-item .nm{font-weight:600;display:flex;justify-content:space-between;align-items:center;gap:6px}
|
||||
.node-item .meta{font-size:11px;color:var(--faint);margin-top:3px;display:flex;gap:6px;flex-wrap:wrap}
|
||||
.edge-item{font-size:12px;background:var(--panel2);border:1px solid var(--line);border-radius:7px;padding:7px 9px;margin-bottom:6px;display:flex;justify-content:space-between;gap:8px;align-items:flex-start}
|
||||
.edge-item .x{cursor:pointer;color:var(--faint);flex-shrink:0}
|
||||
.edge-item .x:hover{color:var(--p0)}
|
||||
.tabs{display:flex;gap:4px;margin-bottom:12px;border-bottom:1px solid var(--line)}
|
||||
.tabs button{border:none;border-bottom:2px solid transparent;border-radius:0;background:none;color:var(--muted);padding:8px 12px}
|
||||
.tabs button.on{color:#fff;border-bottom-color:var(--accent)}
|
||||
svg{width:100%;display:block}
|
||||
.empty{color:var(--faint);font-size:12.5px;text-align:center;padding:30px 10px;border:1px dashed var(--line2);border-radius:10px}
|
||||
.kc-box{background:var(--panel);border:1px solid var(--line);border-radius:10px;padding:14px;margin-bottom:14px}
|
||||
.kc-step{display:flex;align-items:center;gap:10px;padding:7px 0}
|
||||
.kc-arrow{color:var(--p0);font-size:18px;text-align:center;margin:-2px 0}
|
||||
.kc-node{flex:1;background:var(--panel2);border:1px solid var(--line2);border-left:3px solid var(--p0);border-radius:6px;padding:7px 10px}
|
||||
.kc-node .n{font-weight:600;font-size:13px}
|
||||
.kc-node .m{font-size:11px;color:var(--muted)}
|
||||
.kc-mech{font-size:11px;color:var(--faint);font-style:italic;padding-left:14px}
|
||||
.stat{display:flex;justify-content:space-between;padding:5px 0;border-bottom:1px solid var(--line);font-size:13px}
|
||||
.stat:last-child{border:none}
|
||||
.stat b{font-variant-numeric:tabular-nums}
|
||||
.q{border-radius:8px;border:1px solid var(--line);padding:10px 12px;margin-bottom:9px;background:var(--panel)}
|
||||
.q .qh{display:flex;justify-content:space-between;align-items:center;font-weight:700;font-size:12px;letter-spacing:.5px;text-transform:uppercase}
|
||||
.q.crit{border-left:4px solid var(--crit)} .q.crit .qh{color:var(--crit)}
|
||||
.q.sev{border-left:4px solid var(--sev)} .q.sev .qh{color:var(--sev)}
|
||||
.q.std{border-left:4px solid var(--std)} .q.std .qh{color:var(--std)}
|
||||
.q.darkq{border-left:4px solid var(--darkq)} .q.darkq .qh{color:var(--darkq)}
|
||||
.q .ql{font-size:12.5px;margin-top:7px}
|
||||
.q .qi{padding:4px 0;border-top:1px solid var(--line);margin-top:5px}
|
||||
.q .qi:first-of-type{border:none}
|
||||
.q .qi .qn{font-weight:600}
|
||||
.q .qi .qd{font-size:11px;color:var(--muted)}
|
||||
.q .budget{font-size:10.5px;color:var(--faint);font-weight:400;text-transform:none;letter-spacing:0}
|
||||
.discovery h3{font-size:12.5px;margin:12px 0 5px;color:var(--accent)}
|
||||
.discovery ul{margin:0 0 6px;padding-left:18px;color:var(--muted);font-size:12px}
|
||||
.discovery li{margin-bottom:3px}
|
||||
.discovery code{background:var(--panel2);border:1px solid var(--line);border-radius:4px;padding:1px 5px;color:#e6edf3;font-size:11px}
|
||||
.note{font-size:11.5px;color:var(--faint);margin-top:6px}
|
||||
.legend{display:flex;gap:12px;flex-wrap:wrap;font-size:11px;color:var(--muted);margin-bottom:8px}
|
||||
.legend span{display:flex;align-items:center;gap:5px}
|
||||
.dot{width:10px;height:10px;border-radius:50%}
|
||||
.topbtns{display:flex;gap:8px}
|
||||
.file-in{display:none}
|
||||
::-webkit-scrollbar{width:10px;height:10px}
|
||||
::-webkit-scrollbar-thumb{background:#222b36;border-radius:6px}
|
||||
::-webkit-scrollbar-track{background:transparent}
|
||||
.muted{color:var(--muted)} .small{font-size:11.5px}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<header>
|
||||
<h1>⛓ Kill Chain Assessment</h1>
|
||||
<span class="tag">Brownhat · CQRE</span>
|
||||
<div class="topbtns">
|
||||
<button class="ghost" onclick="loadSample()">Load sample</button>
|
||||
<button class="ghost" onclick="exportJSON()">Save .json</button>
|
||||
<button class="ghost" onclick="document.getElementById('imp').click()">Open .json</button>
|
||||
<button class="primary" onclick="exportMD()">Export report .md</button>
|
||||
<input type="file" id="imp" class="file-in" accept=".json" onchange="importJSON(event)">
|
||||
</div>
|
||||
<div class="sub">Map unknown territory into nodes and attacker moves. The tool finds the shortest path from a foothold to an existential asset — that path <b>is</b> the kill chain — and sizes each node into a remediation quantum.</div>
|
||||
</header>
|
||||
|
||||
<div class="wrap">
|
||||
<!-- LEFT: capture -->
|
||||
<div class="col left">
|
||||
<div class="tabs">
|
||||
<button id="t-node" class="on" onclick="tab('node')">Nodes</button>
|
||||
<button id="t-edge" onclick="tab('edge')">Moves</button>
|
||||
<button id="t-disc" onclick="tab('disc')">Discovery</button>
|
||||
</div>
|
||||
|
||||
<!-- NODE form -->
|
||||
<div id="pane-node">
|
||||
<div class="panel">
|
||||
<h2>Add / edit node<span class="hint">An asset, foothold, identity, or system in the estate.</span></h2>
|
||||
<label>Name</label>
|
||||
<input type="text" id="n-name" placeholder="e.g. Entra ID Connect sync server">
|
||||
<div class="row">
|
||||
<div>
|
||||
<label>Layer</label>
|
||||
<select id="n-type">
|
||||
<option value="entry">Entry / exposure</option>
|
||||
<option value="identity">Identity</option>
|
||||
<option value="privilege">Privilege</option>
|
||||
<option value="device">Device / endpoint</option>
|
||||
<option value="data">Data / collaboration</option>
|
||||
<option value="infra">Infrastructure / OT</option>
|
||||
<option value="recovery">Recovery / backup</option>
|
||||
</select>
|
||||
</div>
|
||||
<div>
|
||||
<label>Tier</label>
|
||||
<select id="n-tier">
|
||||
<option value="">— unknown —</option>
|
||||
<option value="T0">T0 (control plane)</option>
|
||||
<option value="T1">T1 (servers/apps)</option>
|
||||
<option value="T2">T2 (workstations)</option>
|
||||
</select>
|
||||
</div>
|
||||
</div>
|
||||
<div class="chk"><input type="checkbox" id="n-entry"><label style="margin:0;color:var(--entry)">Adversary entry point (internet-reachable / unauth foothold)</label></div>
|
||||
<div class="chk"><input type="checkbox" id="n-jewel"><label style="margin:0;color:var(--jewel)">Crown jewel (existential — org cannot operate if lost)</label></div>
|
||||
<div class="row">
|
||||
<div>
|
||||
<label>Reachable by adversary?</label>
|
||||
<select id="n-reach"><option value="unknown">Unknown</option><option value="yes">Yes</option><option value="no">No</option></select>
|
||||
</div>
|
||||
<div>
|
||||
<label>Exploit / path available?</label>
|
||||
<select id="n-expl"><option value="unknown">Unknown</option><option value="yes">Yes</option><option value="no">No</option></select>
|
||||
</div>
|
||||
</div>
|
||||
<div class="chk"><input type="checkbox" id="n-comp"><label style="margin:0">Compensating control already in front of it (EDR, WAF, segmentation)</label></div>
|
||||
<label>Finding / note (optional)</label>
|
||||
<textarea id="n-note" placeholder="What's wrong here, evidence, CVE…"></textarea>
|
||||
<div class="btnrow">
|
||||
<button class="primary" onclick="saveNode()">Save node</button>
|
||||
<button class="ghost" onclick="clearNodeForm()">Clear</button>
|
||||
</div>
|
||||
</div>
|
||||
<h2>Nodes <span id="n-count" class="muted small"></span></h2>
|
||||
<div id="node-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- EDGE form -->
|
||||
<div id="pane-edge" style="display:none">
|
||||
<div class="panel">
|
||||
<h2>Add attacker move<span class="hint">A directed step: "from here, an attacker can reach there."</span></h2>
|
||||
<label>From</label>
|
||||
<select id="e-from"></select>
|
||||
<label>To</label>
|
||||
<select id="e-to"></select>
|
||||
<label>Mechanism (how)</label>
|
||||
<input type="text" id="e-mech" placeholder="e.g. DCSync via sync-account rights">
|
||||
<label>Adversary effort: <span id="e-wlabel">3 — moderate</span></label>
|
||||
<input type="range" id="e-weight" min="1" max="5" value="3" style="width:100%" oninput="document.getElementById('e-wlabel').textContent=effortLabel(this.value)">
|
||||
<div class="note">Lower effort = easier for the attacker. The kill chain is the <i>lowest-effort</i> path to a crown jewel.</div>
|
||||
<div class="btnrow"><button class="primary" onclick="saveEdge()">Add move</button></div>
|
||||
</div>
|
||||
<h2>Moves <span id="e-count" class="muted small"></span></h2>
|
||||
<div id="edge-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- DISCOVERY -->
|
||||
<div id="pane-disc" style="display:none">
|
||||
<div class="panel discovery">
|
||||
<h2>Discovering the chain in unknown territory<span class="hint">What to ask and run to surface the edges you can't see yet. Each answer becomes a node or a move.</span></h2>
|
||||
|
||||
<h3>1 · Find the entry points (reachability)</h3>
|
||||
<ul>
|
||||
<li>What does the internet see? External scan / Shodan / attack-surface mapping → every internet-facing service is a candidate entry node.</li>
|
||||
<li>Internet-facing VPN, RDP, mail, web apps, appliances — firmware current? MFA enforced?</li>
|
||||
<li>Legacy auth still enabled? (bypasses MFA — a silent entry edge)</li>
|
||||
</ul>
|
||||
|
||||
<h3>2 · Find the identity bridges (Book II)</h3>
|
||||
<ul>
|
||||
<li><code>Entra Connect sync account</code> — does it hold DCSync rights on-prem? That's a cloud→on-prem edge.</li>
|
||||
<li>Federation / PTA / PHS path, writeback, seamless SSO — map the bridge.</li>
|
||||
</ul>
|
||||
|
||||
<h3>3 · Find privilege paths (Book III)</h3>
|
||||
<ul>
|
||||
<li>BloodHound: <code>shortestPath</code> to Domain Admins from non-admins — every path is a chain of edges.</li>
|
||||
<li>Kerberoastable / AS-REP-roastable high-priv accounts; KRBTGT last-set date.</li>
|
||||
<li>App registrations with <code>RoleManagement.ReadWrite.Directory</code>, <code>Mail.ReadWrite</code> — OAuth consent edges.</li>
|
||||
</ul>
|
||||
|
||||
<h3>4 · Find the crown jewels (existential nodes)</h3>
|
||||
<ul>
|
||||
<li>Ask the business, not IT: "what stops the company operating?" ERP, payment rails, OT control, the customer DB.</li>
|
||||
<li>Backups & recovery — are they reachable from the estate they protect? If yes, that's an edge into your lifeboat.</li>
|
||||
</ul>
|
||||
|
||||
<h3>5 · Map blast radius (the edges between)</h3>
|
||||
<ul>
|
||||
<li>Flat network? NTLM relay, lateral movement → dense edges, short chains.</li>
|
||||
<li>Segmentation, least privilege, T0 isolation → sparse edges, long chains. Note where they're <i>missing</i>.</li>
|
||||
</ul>
|
||||
|
||||
<p class="note">Anything you can't characterise (reachable? unknown) becomes a <span style="color:var(--darkq)">dark quantum</span> — capture the node anyway and mark reachability/exploit "unknown". An uncharacterised asset is the dangerous kind.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- CENTER: graph + chain -->
|
||||
<div class="col center">
|
||||
<h2>Attack graph & kill chain</h2>
|
||||
<div class="legend">
|
||||
<span><span class="dot" style="background:var(--entry)"></span>entry</span>
|
||||
<span><span class="dot" style="background:var(--jewel)"></span>crown jewel</span>
|
||||
<span><span class="dot" style="background:var(--p0)"></span>on shortest chain (P0)</span>
|
||||
<span><span class="dot" style="background:var(--p1)"></span>on a chain (P1)</span>
|
||||
<span><span class="dot" style="background:var(--p2)"></span>off-chain (P2)</span>
|
||||
</div>
|
||||
<div class="panel" style="padding:6px"><div id="graph"></div></div>
|
||||
<div id="chain-out"></div>
|
||||
</div>
|
||||
|
||||
<!-- RIGHT: results -->
|
||||
<div class="col right">
|
||||
<h2>Assessment</h2>
|
||||
<div class="panel" id="summary"></div>
|
||||
<h2>Remediation quanta<span class="hint">Sized by time-to-existential-impact, not CVSS.</span></h2>
|
||||
<div id="quanta"></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script>
|
||||
/* ---------------- state ---------------- */
|
||||
let nodes = []; // {id,name,type,tier,entry,jewel,reach,expl,comp,note}
|
||||
let edges = []; // {id,from,to,mech,w}
|
||||
let editingId = null;
|
||||
let uid = () => 'n'+Math.random().toString(36).slice(2,8);
|
||||
|
||||
const STORE='brownhat-killchain-v1';
|
||||
function persist(){ try{localStorage.setItem(STORE,JSON.stringify({nodes,edges}));}catch(e){} }
|
||||
function restore(){ try{const s=JSON.parse(localStorage.getItem(STORE));if(s&&s.nodes){nodes=s.nodes;edges=s.edges||[];}}catch(e){} }
|
||||
|
||||
function effortLabel(v){return {1:'1 — trivial',2:'2 — easy',3:'3 — moderate',4:'4 — hard',5:'5 — very hard'}[v];}
|
||||
|
||||
/* ---------------- tabs ---------------- */
|
||||
function tab(t){
|
||||
['node','edge','disc'].forEach(x=>{
|
||||
document.getElementById('pane-'+x).style.display = x===t?'block':'none';
|
||||
document.getElementById('t-'+x).classList.toggle('on',x===t);
|
||||
});
|
||||
if(t==='edge') refreshEdgeSelects();
|
||||
}
|
||||
|
||||
/* ---------------- node CRUD ---------------- */
|
||||
function saveNode(){
|
||||
const name=document.getElementById('n-name').value.trim();
|
||||
if(!name){alert('Name the node first.');return;}
|
||||
const data={
|
||||
name,
|
||||
type:document.getElementById('n-type').value,
|
||||
tier:document.getElementById('n-tier').value,
|
||||
entry:document.getElementById('n-entry').checked,
|
||||
jewel:document.getElementById('n-jewel').checked,
|
||||
reach:document.getElementById('n-reach').value,
|
||||
expl:document.getElementById('n-expl').value,
|
||||
comp:document.getElementById('n-comp').checked,
|
||||
note:document.getElementById('n-note').value.trim()
|
||||
};
|
||||
if(editingId){ Object.assign(nodes.find(n=>n.id===editingId),data); }
|
||||
else { nodes.push(Object.assign({id:uid()},data)); }
|
||||
clearNodeForm(); render();
|
||||
}
|
||||
function editNode(id){
|
||||
const n=nodes.find(x=>x.id===id); if(!n)return;
|
||||
editingId=id;
|
||||
document.getElementById('n-name').value=n.name;
|
||||
document.getElementById('n-type').value=n.type;
|
||||
document.getElementById('n-tier').value=n.tier||'';
|
||||
document.getElementById('n-entry').checked=n.entry;
|
||||
document.getElementById('n-jewel').checked=n.jewel;
|
||||
document.getElementById('n-reach').value=n.reach;
|
||||
document.getElementById('n-expl').value=n.expl;
|
||||
document.getElementById('n-comp').checked=n.comp;
|
||||
document.getElementById('n-note').value=n.note||'';
|
||||
tab('node'); window.scrollTo(0,0);
|
||||
}
|
||||
function delNode(id){
|
||||
if(!confirm('Delete this node and its moves?'))return;
|
||||
nodes=nodes.filter(n=>n.id!==id);
|
||||
edges=edges.filter(e=>e.from!==id&&e.to!==id);
|
||||
if(editingId===id)clearNodeForm();
|
||||
render();
|
||||
}
|
||||
function clearNodeForm(){
|
||||
editingId=null;
|
||||
['n-name','n-note'].forEach(i=>document.getElementById(i).value='');
|
||||
document.getElementById('n-type').value='entry';
|
||||
document.getElementById('n-tier').value='';
|
||||
['n-entry','n-jewel','n-comp'].forEach(i=>document.getElementById(i).checked=false);
|
||||
document.getElementById('n-reach').value='unknown';
|
||||
document.getElementById('n-expl').value='unknown';
|
||||
}
|
||||
|
||||
/* ---------------- edge CRUD ---------------- */
|
||||
function refreshEdgeSelects(){
|
||||
const opts=nodes.map(n=>`<option value="${n.id}">${esc(n.name)}</option>`).join('');
|
||||
document.getElementById('e-from').innerHTML=opts;
|
||||
document.getElementById('e-to').innerHTML=opts;
|
||||
}
|
||||
function saveEdge(){
|
||||
const from=document.getElementById('e-from').value, to=document.getElementById('e-to').value;
|
||||
if(!from||!to){alert('Add at least two nodes first.');return;}
|
||||
if(from===to){alert('A move must go between two different nodes.');return;}
|
||||
edges.push({id:uid(),from,to,mech:document.getElementById('e-mech').value.trim(),w:+document.getElementById('e-weight').value});
|
||||
document.getElementById('e-mech').value='';
|
||||
render();
|
||||
}
|
||||
function delEdge(id){ edges=edges.filter(e=>e.id!==id); render(); }
|
||||
|
||||
/* ---------------- analysis: Dijkstra shortest existential path ---------------- */
|
||||
function analyse(){
|
||||
const entryIds=nodes.filter(n=>n.entry).map(n=>n.id);
|
||||
const jewelIds=new Set(nodes.filter(n=>n.jewel).map(n=>n.id));
|
||||
const adj={}; nodes.forEach(n=>adj[n.id]=[]);
|
||||
edges.forEach(e=>{ if(adj[e.from]) adj[e.from].push(e); });
|
||||
|
||||
// multi-source Dijkstra from all entry points
|
||||
const dist={}, prev={}, prevEdge={};
|
||||
nodes.forEach(n=>dist[n.id]=Infinity);
|
||||
const pq=[];
|
||||
entryIds.forEach(id=>{dist[id]=0; pq.push([0,id]);});
|
||||
while(pq.length){
|
||||
pq.sort((a,b)=>a[0]-b[0]);
|
||||
const [d,u]=pq.shift();
|
||||
if(d>dist[u])continue;
|
||||
(adj[u]||[]).forEach(e=>{
|
||||
const nd=d+e.w;
|
||||
if(nd<dist[e.to]){dist[e.to]=nd;prev[e.to]=u;prevEdge[e.to]=e;pq.push([nd,e.to]);}
|
||||
});
|
||||
}
|
||||
// best jewel = reachable jewel with min dist
|
||||
let best=null;
|
||||
jewelIds.forEach(j=>{ if(dist[j]<Infinity && (!best||dist[j]<dist[best])) best=j; });
|
||||
// reconstruct shortest chain
|
||||
let chain=[],chainEdges=[];
|
||||
if(best!=null){
|
||||
let cur=best;
|
||||
while(cur!=null){ chain.unshift(cur); if(prevEdge[cur]){chainEdges.unshift(prevEdge[cur]);cur=prev[cur];} else cur=null; }
|
||||
}
|
||||
const onShortest=new Set(chain);
|
||||
|
||||
// nodes on ANY existential path: reachable from entry AND can reach a jewel
|
||||
const reachFromEntry=new Set();
|
||||
(function(){const st=[...entryIds];entryIds.forEach(i=>reachFromEntry.add(i));
|
||||
while(st.length){const u=st.pop();(adj[u]||[]).forEach(e=>{if(!reachFromEntry.has(e.to)){reachFromEntry.add(e.to);st.push(e.to);}});}})();
|
||||
// reverse reachability to a jewel
|
||||
const radj={}; nodes.forEach(n=>radj[n.id]=[]); edges.forEach(e=>{if(radj[e.to])radj[e.to].push(e.from);});
|
||||
const canReachJewel=new Set();
|
||||
(function(){const st=[...jewelIds];jewelIds.forEach(i=>canReachJewel.add(i));
|
||||
while(st.length){const u=st.pop();(radj[u]||[]).forEach(f=>{if(!canReachJewel.has(f)){canReachJewel.add(f);st.push(f);}});}})();
|
||||
const onAnyChain=new Set(nodes.filter(n=>reachFromEntry.has(n.id)&&canReachJewel.has(n.id)).map(n=>n.id));
|
||||
|
||||
return {chain,chainEdges,onShortest,onAnyChain,dist,best,entryIds,jewelIds,reachable:reachFromEntry};
|
||||
}
|
||||
|
||||
/* priority + quantum per node */
|
||||
function priority(n,a){
|
||||
if(a.onShortest.has(n.id))return 'P0';
|
||||
if(a.onAnyChain.has(n.id))return 'P1';
|
||||
return 'P2';
|
||||
}
|
||||
function quantum(n,a){
|
||||
const onChain = a.onShortest.has(n.id)||a.onAnyChain.has(n.id);
|
||||
if(!onChain) return 'house';
|
||||
if(n.reach==='unknown'||n.expl==='unknown') return 'dark';
|
||||
if(a.onShortest.has(n.id) && n.reach==='yes' && n.expl==='yes' && !n.comp) return 'crit';
|
||||
if(n.reach==='yes' || n.expl==='yes') return 'sev';
|
||||
return 'std';
|
||||
}
|
||||
const QMETA={
|
||||
crit:{label:'Critical quantum',budget:'hours · compensating control, not the patch',cls:'crit'},
|
||||
sev:{label:'Severe quantum',budget:'days · batched into one change window',cls:'sev'},
|
||||
std:{label:'Standard quantum',budget:'sprint · drained in finishable batches',cls:'std'},
|
||||
dark:{label:'Dark quantum',budget:'unsized · route to discovery',cls:'darkq'},
|
||||
house:{label:'Housekeeping',budget:'off every kill chain — not urgent',cls:'std'}
|
||||
};
|
||||
|
||||
/* ---------------- render ---------------- */
|
||||
function esc(s){return (s||'').replace(/[&<>"]/g,c=>({'&':'&','<':'<','>':'>','"':'"'}[c]));}
|
||||
const TYPELBL={entry:'Entry',identity:'Identity',privilege:'Privilege',device:'Device',data:'Data',infra:'Infra/OT',recovery:'Recovery'};
|
||||
|
||||
function render(){
|
||||
persist();
|
||||
renderNodeList(); renderEdgeList(); refreshEdgeSelects();
|
||||
const a = analyse();
|
||||
renderGraph(a); renderChain(a); renderSummary(a); renderQuanta(a);
|
||||
}
|
||||
|
||||
function renderNodeList(){
|
||||
document.getElementById('n-count').textContent = nodes.length?`(${nodes.length})`:'';
|
||||
const el=document.getElementById('node-list');
|
||||
if(!nodes.length){el.innerHTML='<div class="empty">No nodes yet. Add the footholds and assets you find — or “Load sample”.</div>';return;}
|
||||
const a=analyse();
|
||||
el.innerHTML=nodes.map(n=>{
|
||||
const p=priority(n,a);
|
||||
const pc=p==='P0'?'var(--p0)':p==='P1'?'var(--p1)':'var(--p2)';
|
||||
return `<div class="node-item ${editingId===n.id?'sel':''}" onclick="editNode('${n.id}')">
|
||||
<div class="nm"><span>${esc(n.name)}</span>
|
||||
<span style="display:flex;gap:5px;align-items:center">
|
||||
${n.entry?'<span class="pill entry">entry</span>':''}
|
||||
${n.jewel?'<span class="pill jewel">jewel</span>':''}
|
||||
<span style="color:${pc};font-weight:700;font-size:11px">${(a.onShortest.has(n.id)||a.onAnyChain.has(n.id))?p:'—'}</span>
|
||||
<span class="x" onclick="event.stopPropagation();delNode('${n.id}')" style="cursor:pointer;color:var(--faint)">✕</span>
|
||||
</span>
|
||||
</div>
|
||||
<div class="meta"><span>${TYPELBL[n.type]||n.type}</span>${n.tier?`<span>· ${n.tier}</span>`:''}
|
||||
<span>· reach:${n.reach}</span><span>· exploit:${n.expl}</span>${n.comp?'<span>· compensated</span>':''}</div>
|
||||
</div>`;
|
||||
}).join('');
|
||||
}
|
||||
|
||||
function renderEdgeList(){
|
||||
document.getElementById('e-count').textContent = edges.length?`(${edges.length})`:'';
|
||||
const el=document.getElementById('edge-list');
|
||||
if(!edges.length){el.innerHTML='<div class="empty">No moves yet. A move is one attacker step from one node to another.</div>';return;}
|
||||
const nm=id=>{const n=nodes.find(x=>x.id===id);return n?esc(n.name):'?';};
|
||||
el.innerHTML=edges.map(e=>`<div class="edge-item">
|
||||
<div><b>${nm(e.from)}</b> → <b>${nm(e.to)}</b><br>
|
||||
<span class="muted small">${esc(e.mech)||'(mechanism unspecified)'} · effort ${e.w}</span></div>
|
||||
<span class="x" onclick="delEdge('${e.id}')">✕</span></div>`).join('');
|
||||
}
|
||||
|
||||
function renderGraph(a){
|
||||
const g=document.getElementById('graph');
|
||||
if(!nodes.length){g.innerHTML='<div class="empty" style="margin:10px">The attack graph renders here.</div>';return;}
|
||||
// simple layered layout by distance-from-entry (BFS depth), entries left → jewels right
|
||||
const depth={}; nodes.forEach(n=>depth[n.id]=n.entry?0:null);
|
||||
const adj={};nodes.forEach(n=>adj[n.id]=[]);edges.forEach(e=>{if(adj[e.from])adj[e.from].push(e.to);});
|
||||
let q=nodes.filter(n=>n.entry).map(n=>n.id),guard=0;
|
||||
while(q.length&&guard++<999){const u=q.shift();(adj[u]||[]).forEach(v=>{if(depth[v]==null||depth[v]>depth[u]+1){depth[v]=depth[u]+1;q.push(v);}});}
|
||||
let maxd=0;nodes.forEach(n=>{if(depth[n.id]==null)depth[n.id]=999;maxd=Math.max(maxd,depth[n.id]===999?0:depth[n.id]);});
|
||||
// orphans (no depth) put in a trailing column
|
||||
const cols={};nodes.forEach(n=>{const d=depth[n.id]===999?maxd+1:depth[n.id];(cols[d]=cols[d]||[]).push(n);});
|
||||
const colKeys=Object.keys(cols).map(Number).sort((x,y)=>x-y);
|
||||
const W=Math.max(640,colKeys.length*180), colW=W/colKeys.length;
|
||||
let maxRows=0;colKeys.forEach(k=>maxRows=Math.max(maxRows,cols[k].length));
|
||||
const H=Math.max(220,maxRows*72+40);
|
||||
const pos={};
|
||||
colKeys.forEach((k,ci)=>{cols[k].forEach((n,ri)=>{const rows=cols[k].length;
|
||||
pos[n.id]={x:colW*ci+colW/2,y:H/(rows+1)*(ri+1)};});});
|
||||
const col=n=>{if(a.onShortest.has(n.id))return'var(--p0)';if(a.onAnyChain.has(n.id))return'var(--p1)';if(n.jewel)return'var(--jewel)';if(n.entry)return'var(--entry)';return'#3fb95066';};
|
||||
const onChainEdge=new Set(a.chainEdges.map(e=>e.id));
|
||||
let svg=`<svg viewBox="0 0 ${W} ${H}" preserveAspectRatio="xMidYMid meet">
|
||||
<defs><marker id="arr" markerWidth="9" markerHeight="9" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6 Z" fill="#5b6675"/></marker>
|
||||
<marker id="arrR" markerWidth="10" markerHeight="10" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6 Z" fill="var(--p0)"/></marker></defs>`;
|
||||
edges.forEach(e=>{const a1=pos[e.from],b=pos[e.to];if(!a1||!b)return;
|
||||
const hot=onChainEdge.has(e.id);
|
||||
const mx=(a1.x+b.x)/2,my=(a1.y+b.y)/2-18;
|
||||
svg+=`<path d="M${a1.x},${a1.y} Q${mx},${my} ${b.x},${b.y}" fill="none" stroke="${hot?'var(--p0)':'#39414d'}" stroke-width="${hot?2.4:1.2}" marker-end="url(#${hot?'arrR':'arr'})" opacity="${hot?1:.7}"/>`;
|
||||
});
|
||||
nodes.forEach(n=>{const p=pos[n.id];if(!p)return;const c=col(n);
|
||||
const r=n.jewel||n.entry?20:16;
|
||||
svg+=`<g>
|
||||
<circle cx="${p.x}" cy="${p.y}" r="${r}" fill="${c}" fill-opacity="${a.onShortest.has(n.id)?0.95:0.18}" stroke="${c}" stroke-width="2"/>
|
||||
${n.jewel?`<text x="${p.x}" y="${p.y+4}" text-anchor="middle" font-size="14">★</text>`:''}
|
||||
${n.entry?`<text x="${p.x}" y="${p.y+4}" text-anchor="middle" font-size="12">▶</text>`:''}
|
||||
<text x="${p.x}" y="${p.y+r+13}" text-anchor="middle" font-size="11" fill="#c9d4df">${esc(n.name.length>22?n.name.slice(0,21)+'…':n.name)}</text>
|
||||
</g>`;});
|
||||
svg+='</svg>';
|
||||
g.innerHTML=svg;
|
||||
}
|
||||
|
||||
function renderChain(a){
|
||||
const el=document.getElementById('chain-out');
|
||||
if(!a.entryIds.length||!a.jewelIds.size){
|
||||
el.innerHTML=`<div class="kc-box"><b>No kill chain yet.</b><div class="note">Mark at least one node as an <span style="color:var(--entry)">entry point</span> and one as a <span style="color:var(--jewel)">crown jewel</span>, then connect them with moves.</div></div>`;return;}
|
||||
if(!a.chain.length){
|
||||
el.innerHTML=`<div class="kc-box"><b style="color:var(--p2)">No path found from any entry point to a crown jewel.</b><div class="note">Either the estate is genuinely segmented here (good — note it), or you haven't mapped the connecting moves yet. In unknown territory, assume the latter until proven.</div></div>`;return;}
|
||||
const nm=id=>nodes.find(n=>n.id===id);
|
||||
let html=`<div class="kc-box"><h2 style="color:var(--p0);margin-top:0">⛓ The kill chain<span class="hint">Lowest-effort path from foothold to existential impact. Total adversary effort: ${a.dist[a.best]}.</span></h2>`;
|
||||
a.chain.forEach((id,i)=>{
|
||||
const n=nm(id);
|
||||
html+=`<div class="kc-step"><div class="kc-node">
|
||||
<div class="n">${esc(n.name)} ${n.entry?'<span class="pill entry">entry</span>':''} ${n.jewel?'<span class="pill jewel">jewel</span>':''}</div>
|
||||
<div class="m">${TYPELBL[n.type]||n.type}${n.tier?' · '+n.tier:''}${n.note?' · '+esc(n.note):''}</div>
|
||||
</div></div>`;
|
||||
if(i<a.chainEdges.length){const e=a.chainEdges[i];
|
||||
html+=`<div class="kc-arrow">↓</div><div class="kc-mech">${esc(e.mech)||'move'} · effort ${e.w}</div>`;}
|
||||
});
|
||||
html+=`<div class="note" style="margin-top:10px">Every node on this path is a <b style="color:var(--p0)">P0</b>. Fix the chain first — break any single link and the existential path is severed. After the incident, ask: did this chain get <i>shorter</i>?</div></div>`;
|
||||
el.innerHTML=html;
|
||||
}
|
||||
|
||||
function renderSummary(a){
|
||||
const counts={P0:0,P1:0,P2:0};
|
||||
nodes.forEach(n=>{counts[priority(n,a)]++;});
|
||||
const qc={crit:0,sev:0,std:0,dark:0,house:0};
|
||||
nodes.forEach(n=>qc[quantum(n,a)]++);
|
||||
document.getElementById('summary').innerHTML=`
|
||||
<div class="stat"><span>Nodes mapped</span><b>${nodes.length}</b></div>
|
||||
<div class="stat"><span>Attacker moves</span><b>${edges.length}</b></div>
|
||||
<div class="stat"><span>Entry points</span><b>${a.entryIds.length}</b></div>
|
||||
<div class="stat"><span>Crown jewels</span><b>${a.jewelIds.size}</b></div>
|
||||
<div class="stat"><span style="color:var(--p0)">Kill-chain length</span><b style="color:var(--p0)">${a.chain.length||'—'}</b></div>
|
||||
<div class="stat"><span style="color:var(--p0)">P0 nodes (on shortest chain)</span><b style="color:var(--p0)">${counts.P0}</b></div>
|
||||
<div class="stat"><span style="color:var(--p1)">P1 nodes (on a chain)</span><b style="color:var(--p1)">${counts.P1}</b></div>
|
||||
<div class="stat"><span style="color:var(--darkq)">Dark quanta (unsized)</span><b style="color:var(--darkq)">${qc.dark}</b></div>`;
|
||||
}
|
||||
|
||||
function renderQuanta(a){
|
||||
const buckets={crit:[],sev:[],std:[],dark:[]};
|
||||
nodes.forEach(n=>{const q=quantum(n,a);if(buckets[q])buckets[q].push(n);});
|
||||
const order=['crit','sev','std','dark'];
|
||||
let html='';
|
||||
order.forEach(k=>{
|
||||
const list=buckets[k];if(!list.length)return;
|
||||
const m=QMETA[k];
|
||||
html+=`<div class="q ${m.cls}"><div class="qh"><span>${m.label}</span><span class="budget">${m.budget}</span></div>`;
|
||||
list.forEach(n=>{
|
||||
const action = k==='crit'?'Sever reachability / compensating control now'
|
||||
: k==='sev'?'Remediate in next change window, verify enforcement'
|
||||
: k==='std'?'Batch into sprint; this is where patch velocity fits'
|
||||
: 'Characterise: establish reachability & exploitability';
|
||||
html+=`<div class="qi"><div class="qn">${esc(n.name)}</div><div class="qd">${action}${n.note?' — '+esc(n.note):''}</div></div>`;
|
||||
});
|
||||
html+='</div>';
|
||||
});
|
||||
if(!html) html='<div class="empty">Quanta appear once nodes sit on a kill chain. Map entries, jewels, and the moves between.</div>';
|
||||
document.getElementById('quanta').innerHTML=html;
|
||||
}
|
||||
|
||||
/* ---------------- import / export ---------------- */
|
||||
function exportJSON(){
|
||||
dl('kill-chain-assessment.json', JSON.stringify({nodes,edges,exported:new Date().toISOString()},null,2));
|
||||
}
|
||||
function importJSON(ev){
|
||||
const f=ev.target.files[0];if(!f)return;
|
||||
const r=new FileReader();
|
||||
r.onload=()=>{try{const s=JSON.parse(r.result);nodes=s.nodes||[];edges=s.edges||[];clearNodeForm();render();}catch(e){alert('Could not read that file.');}};
|
||||
r.readAsText(f); ev.target.value='';
|
||||
}
|
||||
function exportMD(){
|
||||
const a=analyse();const nm=id=>{const n=nodes.find(x=>x.id===id);return n?n.name:'?';};
|
||||
let md=`# Kill Chain Assessment\n\n_Generated ${new Date().toLocaleString()} · Brownhat / CQRE_\n\n`;
|
||||
md+=`## Summary\n\n- Nodes mapped: ${nodes.length}\n- Attacker moves: ${edges.length}\n- Entry points: ${a.entryIds.length}\n- Crown jewels: ${a.jewelIds.size}\n- Kill-chain length: ${a.chain.length||'—'}\n\n`;
|
||||
if(a.chain.length){
|
||||
md+=`## The kill chain (shortest existential path)\n\nLowest-effort path from foothold to existential impact (total adversary effort ${a.dist[a.best]}):\n\n\`\`\`\n`;
|
||||
a.chain.forEach((id,i)=>{md+=`${nm(id)}`;if(i<a.chainEdges.length)md+=`\n → [${a.chainEdges[i].mech||'move'} · effort ${a.chainEdges[i].w}]\n`;});
|
||||
md+=`\n\`\`\`\n\nEvery node on this path is a **P0**. Break any single link to sever the existential path.\n\n`;
|
||||
} else {
|
||||
md+=`## The kill chain\n\nNo path from an entry point to a crown jewel was mapped. Either the estate is segmented here, or the connecting moves are not yet discovered.\n\n`;
|
||||
}
|
||||
// quanta
|
||||
const buckets={crit:[],sev:[],std:[],dark:[]};nodes.forEach(n=>{const q=quantum(n,a);if(buckets[q])buckets[q].push(n);});
|
||||
md+=`## Remediation quanta\n\n`;
|
||||
[['crit','Critical quantum — hours (compensating control, not the patch)'],
|
||||
['sev','Severe quantum — days (one change window)'],
|
||||
['std','Standard quantum — sprint (patch velocity fits here)'],
|
||||
['dark','Dark quantum — unsized (route to discovery)']].forEach(([k,t])=>{
|
||||
if(!buckets[k].length)return;
|
||||
md+=`### ${t}\n\n`;
|
||||
buckets[k].forEach(n=>{md+=`- **${n.name}**${n.tier?` (${n.tier})`:''}${n.note?` — ${n.note}`:''} _(reach:${n.reach}, exploit:${n.expl}${n.comp?', compensated':''})_\n`;});
|
||||
md+=`\n`;
|
||||
});
|
||||
// findings table
|
||||
md+=`## All nodes by priority\n\n| Node | Layer | Tier | Priority | Quantum | Reach | Exploit |\n|---|---|---|---|---|---|---|\n`;
|
||||
const pri=n=>priority(n,a);
|
||||
nodes.slice().sort((x,y)=>({P0:0,P1:1,P2:2}[pri(x)]-{P0:0,P1:1,P2:2}[pri(y)])).forEach(n=>{
|
||||
md+=`| ${n.name} | ${TYPELBL[n.type]||n.type} | ${n.tier||'—'} | ${(a.onShortest.has(n.id)||a.onAnyChain.has(n.id))?pri(n):'off-chain'} | ${QMETA[quantum(n,a)].label} | ${n.reach} | ${n.expl} |\n`;
|
||||
});
|
||||
md+=`\n---\n\n_See Book VII — Vulnerability Management and the Quantum Vulnerability Management framework for how to size and drain these quanta._\n`;
|
||||
dl('kill-chain-assessment.md', md);
|
||||
}
|
||||
function dl(name,content){
|
||||
const b=new Blob([content],{type:'text/plain'});const u=URL.createObjectURL(b);
|
||||
const a=document.createElement('a');a.href=u;a.download=name;a.click();URL.revokeObjectURL(u);
|
||||
}
|
||||
|
||||
/* ---------------- sample (repo: mid-market engagement) ---------------- */
|
||||
function loadSample(){
|
||||
if(nodes.length && !confirm('Replace current assessment with the sample engagement?'))return;
|
||||
nodes=[
|
||||
mk('Stale contractor credential','identity','',{entry:1,reach:'yes',expl:'yes',note:'Active 6 months after offboarding; no MFA'}),
|
||||
mk('Internet-facing VPN (legacy firmware)','entry','',{entry:1,reach:'yes',expl:'yes',note:'Cisco ASA, firmware 18mo stale, no MFA'}),
|
||||
mk('M365 / Entra ID','identity','T1',{reach:'yes',expl:'yes',note:'34% sign-ins without MFA; CA in report-only'}),
|
||||
mk('SharePoint / Teams / Exchange','data','T1',{reach:'yes',expl:'no',note:'All collaboration data + email'}),
|
||||
mk('Entra admin account','privilege','T0',{reach:'yes',expl:'yes',note:'Reachable via password spray'}),
|
||||
mk('Entra Connect sync account','privilege','T0',{reach:'yes',expl:'yes',note:'Has DCSync rights on-prem'}),
|
||||
mk('On-prem Active Directory','privilege','T0',{jewel:0,reach:'yes',expl:'yes',note:'KRBTGT never rotated (847d)'}),
|
||||
mk('SAP ERP','infra','T1',{jewel:1,reach:'unknown',expl:'unknown',note:'Financial + operational; default creds on secondary instance'}),
|
||||
mk('Backups (same segment as ERP)','recovery','T1',{jewel:1,reach:'yes',expl:'yes',comp:0,note:'Never restore-tested; reachable from estate'})
|
||||
];
|
||||
const id=n=>nodes.find(x=>x.name.startsWith(n)).id;
|
||||
edges=[
|
||||
ed('Stale contractor','M365','Credential valid, no MFA',1),
|
||||
ed('Internet-facing VPN','On-prem','VPN auth → internal network',1),
|
||||
ed('M365','SharePoint','Token grants data access',1),
|
||||
ed('M365','Entra admin','Password spray → privilege escalation',2),
|
||||
ed('Entra admin','Entra Connect','Admin controls sync identity',2),
|
||||
ed('Entra Connect','On-prem','DCSync via sync-account rights',2),
|
||||
ed('On-prem','SAP ERP','Domain creds reused on ERP',3),
|
||||
ed('On-prem','Backups','Backups reachable from domain',1),
|
||||
ed('SAP ERP','Backups','Same network segment',1)
|
||||
];
|
||||
function mk(name,type,tier,o){return Object.assign({id:uid(),name,type,tier,entry:!!o.entry,jewel:!!o.jewel,reach:o.reach||'unknown',expl:o.expl||'unknown',comp:!!o.comp,note:o.note||''},{});}
|
||||
function ed(a,b,mech,w){return {id:uid(),from:id(a),to:id(b),mech,w};}
|
||||
clearNodeForm();render();
|
||||
}
|
||||
|
||||
/* ---------------- boot ---------------- */
|
||||
restore();
|
||||
if(!nodes.length) loadSample(); else render();
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
Reference in New Issue
Block a user