Compare commits
2 Commits
0d52474c30
...
5264f7b439
| Author | SHA1 | Date | |
|---|---|---|---|
| 5264f7b439 | |||
| 3226e53f95 |
@@ -8,6 +8,9 @@ This directory contains diagnostic tools, maturity models, and assessment resour
|
|||||||
|
|
||||||
| Template | Purpose |
|
| Template | Purpose |
|
||||||
|----------|---------|
|
|----------|---------|
|
||||||
|
| [Engagement Checklist](engagement-checklist.md) | **Point-in-time, regularly updated.** Controls to inspect on every M365+AD engagement, organized by domain. Not scored — a structured inspection list. Review January 2027. |
|
||||||
|
| [Adversarial Validation Checklist](adversarial-validation-checklist.md) | **Phase 2 — mature estates.** Every item is a test, not an inspection. Opening/closing metrics, eight detection simulations, CA ghost policy tests, attack path verification. Review January 2027. |
|
||||||
|
| [Self-Service Cadence](self-service-cadence.md) | **Client leave-behind.** Monthly portal checks and quarterly tool runs (PingCastle, Purple Knight, CAExporter, PowerShell scripts) an admin can run between engagements. Includes "call us" triggers. Customise per client before handing over. |
|
||||||
| [Assessment Team Guide](assessment-team-guide.md) | Technical execution guide for the Brownhat Diagnostic: tool sequence (ASTRAL, PULSAR, BloodHound, Elysium, Purple Knight, CAExporter), what to look for, kill chain synthesis, report structure, common mistakes. |
|
| [Assessment Team Guide](assessment-team-guide.md) | Technical execution guide for the Brownhat Diagnostic: tool sequence (ASTRAL, PULSAR, BloodHound, Elysium, Purple Knight, CAExporter), what to look for, kill chain synthesis, report structure, common mistakes. |
|
||||||
| [Findings Backlog](findings-backlog.md) | Single source of truth for all findings across every module and diagnostic. The input queue for the housekeeping stream. Pragmatic alternative to a formal risk register for organisations that do not have one. |
|
| [Findings Backlog](findings-backlog.md) | Single source of truth for all findings across every module and diagnostic. The input queue for the housekeeping stream. Pragmatic alternative to a formal risk register for organisations that do not have one. |
|
||||||
| [NIST CSF 2.0 Baseline Assessment](nist-csf-baseline.md) | The Brownhat Diagnostic: structured 2-half-day workshop, gap analysis, kill chain identification |
|
| [NIST CSF 2.0 Baseline Assessment](nist-csf-baseline.md) | The Brownhat Diagnostic: structured 2-half-day workshop, gap analysis, kill chain identification |
|
||||||
|
|||||||
@@ -0,0 +1,319 @@
|
|||||||
|
# Adversarial Validation Checklist
|
||||||
|
|
||||||
|
> *For clients who have done the foundational work. Everything here is tested, not inspected.*
|
||||||
|
|
||||||
|
**Last updated:** June 2026
|
||||||
|
**Engagement type:** Phase 2 — mature estates
|
||||||
|
**Field guide:** [Adversarial Validation Field Guide](../books/field-guide-adversarial-validation.md)
|
||||||
|
**Next review:** January 2027
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How to use this
|
||||||
|
|
||||||
|
This checklist assumes the foundational controls are in place. The question is not "does this control exist" — it is "does this control work." Every item is a test. If an item cannot be tested in the current engagement window, mark it as untested and note it as a finding: **an untested control is a broken control, you simply do not know it yet.**
|
||||||
|
|
||||||
|
Before any test: confirm written authorization. Before the first test: capture baseline metrics (BloodHound path count, Entra role assignment export, CA policy JSON export). After the engagement: record the "after" metrics.
|
||||||
|
|
||||||
|
**Notation:**
|
||||||
|
`[VERIFY]` — confirm the claim against observed behavior
|
||||||
|
`[SIMULATE]` — run the attack or failure scenario, authorized and controlled
|
||||||
|
`[MEASURE]` — produce a number; the number is the finding, not pass/fail
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Opening metrics (capture before first test)
|
||||||
|
|
||||||
|
- `[MEASURE]` BloodHound paths to Domain Admin (all paths; then filtered to paths reachable from standard user compromise)
|
||||||
|
- `[MEASURE]` Count of active (non-eligible) Global Admin assignments excluding break-glass
|
||||||
|
- `[MEASURE]` Count of active (non-eligible) Domain Admin assignments
|
||||||
|
- `[MEASURE]` Service principals with escalation-grade Graph permissions (application permissions)
|
||||||
|
- `[MEASURE]` CA policies verified to enforce (by prior observation) vs. total CA policies in scope
|
||||||
|
- `[MEASURE]` Distinct device IDs in sign-in logs (last 30 days) vs. Intune enrolled device count
|
||||||
|
- `[MEASURE]` Alert volume per day (last 30 days) vs. alerts with documented human response
|
||||||
|
- `[MEASURE]` Structural changes produced by the last five closed security incidents or alerts
|
||||||
|
- `[MEASURE]` Anonymous link count across SharePoint/OneDrive (existing, regardless of current tenant setting)
|
||||||
|
- `[MEASURE]` Backup MTTR from last documented restore (if any; if none, record "never tested")
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section 1 — Identity: the wall
|
||||||
|
|
||||||
|
### 1.1 Firebreak integrity
|
||||||
|
|
||||||
|
- `[VERIFY]` Pull all Global Admin members and check `onPremisesSyncEnabled` for each. Any `true` value is a P0. "We moved them to cloud-only" is the claim; this is the verification.
|
||||||
|
- `[VERIFY]` Trace every path from a simulated on-prem compromise (sync server connector account) to a cloud privileged role. Draw the graph. Each path is a hole in the wall.
|
||||||
|
- `[VERIFY]` For each cloud admin: what MFA device are they using, and is that device also used for email and browsing? A Tier 2 device authenticating a Tier 0 role is a tier violation through the MFA layer.
|
||||||
|
- `[VERIFY]` Does any admin's MFA authenticator app depend on a phone number or device that is outside the client's MDM? (MFA backup codes stored in iCloud are a personal device dependency for a privileged role.)
|
||||||
|
|
||||||
|
### 1.2 Break-glass: real test
|
||||||
|
|
||||||
|
- `[SIMULATE]` Sign in to the break-glass Global Admin account.
|
||||||
|
- `[MEASURE]` Time from sign-in to alert received by named responder.
|
||||||
|
- `[VERIFY]` Alert reaches the named responder (not just fires into a queue). Responder acknowledges.
|
||||||
|
- `[VERIFY]` Break-glass sign-in works with zero on-prem dependency (test while sync is stopped, or while on a network with no DC visibility).
|
||||||
|
- `[VERIFY]` Break-glass credentials can be retrieved from their storage location without the systems they are recovering (test retrieval physically or procedurally).
|
||||||
|
|
||||||
|
### 1.3 PIM enforcement
|
||||||
|
|
||||||
|
- `[VERIFY]` For Global Administrator role PIM settings: what is the MFA method required on activation? Confirm it is phishing-resistant (FIDO2 or certificate). Push-approve is a finding.
|
||||||
|
- `[SIMULATE]` Activate an eligible GA role from a personal device or a non-compliant device. Is it blocked by a CA policy scoped to role activation?
|
||||||
|
- `[SIMULATE]` Request activation requiring approval. Does the approval notification reach the approver with meaningful context (what role, for whom, what justification)? Does the approver act within SLA?
|
||||||
|
- `[MEASURE]` Maximum activation time box for GA and Privileged Role Admin. Record in hours. 24-hour window = functionally standing privilege during business hours.
|
||||||
|
- `[VERIFY]` Are there any GA assignments that are active (permanent) and are not break-glass accounts? Pull the list; any result is a PIM compliance gap from configuration drift.
|
||||||
|
|
||||||
|
### 1.4 AD FS (if still running)
|
||||||
|
|
||||||
|
- `[MEASURE]` Token-signing certificate age in days since last rotation.
|
||||||
|
- `[SIMULATE]` Golden SAML tabletop: if the private key were obtained, what alert (if any) would fire? Walk through the detection path. Document what is visible and what is not.
|
||||||
|
- `[VERIFY]` Is there a signed migration plan with a named date? If not, document as P0 finding — migration tooling is mature; absence of a plan is a decision, not a default.
|
||||||
|
|
||||||
|
### 1.5 Connector account monitoring
|
||||||
|
|
||||||
|
- `[SIMULATE]` Authenticate as the Entra connector account (Directory Synchronization Accounts) from a host other than the sync server. Does an alert fire?
|
||||||
|
- `[MEASURE]` Time from test authentication to alert receipt.
|
||||||
|
- `[VERIFY]` If no alert fires: the most DCSync-capable account in the estate is unmonitored. Document as P0.
|
||||||
|
|
||||||
|
### 1.6 Seamless SSO / AZUREADSSOACC
|
||||||
|
|
||||||
|
- `[VERIFY]` `Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet` — compare to approximate tenant go-live date. If matching: never rotated.
|
||||||
|
- `[VERIFY]` If Seamless SSO is not needed for the current device estate (Entra-joined devices on modern auth): document removal as a quick win.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section 2 — Privilege: attack paths
|
||||||
|
|
||||||
|
### 2.1 BloodHound / attack path analysis
|
||||||
|
|
||||||
|
- `[MEASURE]` Total BloodHound paths to Domain Admin.
|
||||||
|
- `[MEASURE]` Shortest path (fewest hops) to Domain Admin from a standard user account. Enumerate the specific path.
|
||||||
|
- `[MEASURE]` Number of paths involving Kerberoastable service accounts.
|
||||||
|
- `[MEASURE]` Number of paths involving ADCS templates (add ACL collection to BloodHound run).
|
||||||
|
- `[VERIFY]` Has anyone on the client team reviewed BloodHound output in the last 90 days? If not, the path count from the last review is the stale baseline, not the current state.
|
||||||
|
|
||||||
|
### 2.2 Kerberoasting: attack and detection
|
||||||
|
|
||||||
|
- `[SIMULATE]` Run Invoke-Kerberoast or Rubeus kerberoast (authorized, test account as origin).
|
||||||
|
- `[VERIFY]` Did Defender for Identity, Sentinel, or any SIEM alert on the TGS request pattern?
|
||||||
|
- `[MEASURE]` Time from attack to alert receipt (if alert fires).
|
||||||
|
- `[SIMULATE]` Attempt to crack the harvested hashes offline. Record which accounts crack and approximate crack time.
|
||||||
|
- Finding: accounts that crack quickly + no detection = P0 on both the account and the detection gap.
|
||||||
|
|
||||||
|
### 2.3 ADCS
|
||||||
|
|
||||||
|
- `[VERIFY]` Run `certipy find` or `Certify.exe find /vulnerable` against the CA. Document any ESC findings.
|
||||||
|
- `[VERIFY]` Is the ADCS server on a dedicated Tier 0 or hardened host, or on a standard server? Check who has local admin access.
|
||||||
|
- `[VERIFY]` Are there published certificate templates with "Supply subject in request" and enrollment permissions broader than the intended service? (ESC1 pattern)
|
||||||
|
- `[SIMULATE]` If ESC1 is found: demonstrate the exploit path (in authorized test context — enroll a cert for a test admin account using the vulnerable template). Show the client the domain admin cert in hand.
|
||||||
|
|
||||||
|
### 2.4 Service principal dark matter
|
||||||
|
|
||||||
|
- `[VERIFY]` For each service principal with escalation-grade application permissions: ask the room to identify the current owner and current use case. Document every "I don't know."
|
||||||
|
- `[VERIFY]` For each: check `lastSignInDateTime` for the service principal. Unused principal + dangerous permissions + non-expiring secret = standing credential that can be activated any time.
|
||||||
|
- `[VERIFY]` Are there app registrations with admin consent granted for `Mail.Read`, `Files.ReadWrite.All`, or equivalent — where the granting user or admin is no longer at the organization?
|
||||||
|
- `[SIMULATE]` Attempt to use a service principal with dangerous Graph permissions to escalate: assign a role, add an app role assignment, or read all users. Confirm the permission is real and enforced (not just declared).
|
||||||
|
|
||||||
|
### 2.5 Standing privilege beyond PIM
|
||||||
|
|
||||||
|
- `[VERIFY]` Pull active (not eligible) role assignments for GA, PRA, Security Admin, Exchange Admin. Any active assignment not in the break-glass inventory is a drift finding.
|
||||||
|
- `[VERIFY]` Pull Domain Admins and Enterprise Admins. Count them. Ask the client how many they believe exist. Present the actual count. In most estates, the actual count exceeds the belief.
|
||||||
|
- `[VERIFY]` Are there administrator accounts with no associated human — service accounts running with Domain Admin because "it was easier at the time"?
|
||||||
|
|
||||||
|
### 2.6 Local privilege on endpoints
|
||||||
|
|
||||||
|
- `[VERIFY]` Pull local Administrators group membership across a sample of endpoints (10+ devices). Are there accounts beyond the expected (LAPS-managed local admin, Entra-joined device admin, EPM)?
|
||||||
|
- `[VERIFY]` Is Windows LAPS deployed and confirmed working? Retrieve a LAPS password for a test device through Intune or the AD attribute. Confirm rotation has occurred (password age < 30 days or per policy).
|
||||||
|
- `[VERIFY]` If EPM is deployed: test an elevation request for a controlled binary. Is it logged? Is the log reviewed by anyone?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section 3 — Devices: compliance signal gap
|
||||||
|
|
||||||
|
### 3.1 CA policy enforcement (test each separately)
|
||||||
|
|
||||||
|
For each CA policy in scope, write the expected outcome before looking at the configuration. Then test:
|
||||||
|
|
||||||
|
- `[SIMULATE]` **Legacy auth block:** Authenticate using Basic Auth from a test account (Exchange ActiveSync, SMTP auth, or equivalent). Expected: blocked. Result: ___
|
||||||
|
- `[SIMULATE]` **Compliant device gate:** Sign in from a known non-compliant device (personal device, or a managed device taken out of compliance). Expected: blocked from sensitive workloads. Result: ___
|
||||||
|
- `[SIMULATE]` **Admin sign-in location gate:** Attempt a PIM role activation from a device outside the named compliant/PAW scope. Expected: blocked. Result: ___
|
||||||
|
- `[SIMULATE]` **MFA enforcement:** Sign in as a test user from a new device with no registered session. Expected: MFA challenged. Confirm the MFA method that fires (push-approve vs. FIDO2). Result: ___
|
||||||
|
- `[VERIFY]` For any policy that fails to enforce despite correct displayed configuration: recreate from scratch, re-test. Document if ghost policy confirmed.
|
||||||
|
- `[VERIFY]` Are there CA policies in report-only mode that should be enabled? Report-only is a test state, not a permanent posture.
|
||||||
|
- `[VERIFY]` Break-glass accounts excluded from blocking policies — test the break-glass sign-in path specifically under the conditions a blocking policy would normally fire.
|
||||||
|
|
||||||
|
### 3.2 Compliance signal quality
|
||||||
|
|
||||||
|
- `[SIMULATE]` Induce a non-compliant state on a test managed device. Record the timestamp.
|
||||||
|
- `[MEASURE]` Time from non-compliance induction to Intune state update.
|
||||||
|
- `[MEASURE]` Time from non-compliance induction to CA token revocation / session block.
|
||||||
|
- `[VERIFY]` Is CAE (Continuous Access Evaluation) active for critical workloads? If yes, measure revocation time for a CAE-supported app vs. a non-CAE app. Present the gap.
|
||||||
|
- `[SIMULATE]` Root / jailbreak a test device. Does the jailbreak detection in the compliance policy trigger? How long?
|
||||||
|
|
||||||
|
### 3.3 Fleet reality check
|
||||||
|
|
||||||
|
- `[MEASURE]` Distinct device IDs in sign-in logs (last 30 days).
|
||||||
|
- `[MEASURE]` Intune enrolled device count.
|
||||||
|
- `[MEASURE]` Devices in sign-in logs with device compliance state "non-compliant" or "unknown."
|
||||||
|
- `[VERIFY]` Are there legacy-auth sign-ins in the logs that bypass device compliance evaluation entirely? Filter by Client App = non-modern entries. Each entry is a device control bypass.
|
||||||
|
- `[VERIFY]` Pick 5 devices from the sign-in log that are not in Intune. What data do they have access to? What CA policy, if any, applies to them?
|
||||||
|
|
||||||
|
### 3.4 Update rings and rollback
|
||||||
|
|
||||||
|
- `[VERIFY]` Are update rings configured with a named pilot group and a broad group with deferral?
|
||||||
|
- `[VERIFY]` Is there a named person with the process to halt a broad ring update push? Do they know the procedure? Have they tested it?
|
||||||
|
- `[SIMULATE]` (If authorized and non-disruptive) Push a test configuration change to the pilot ring only. Confirm it stays in the pilot ring and does not propagate to broad without explicit promotion.
|
||||||
|
|
||||||
|
### 3.5 MAM boundary (per platform)
|
||||||
|
|
||||||
|
- `[SIMULATE]` On iOS: copy text from managed Outlook to an unmanaged app. Blocked or not?
|
||||||
|
- `[SIMULATE]` On Android: same test. (Do separately — behavior is not symmetric.)
|
||||||
|
- `[SIMULATE]` On iOS: "Open in" from a managed email attachment to Files app or an unmanaged viewer.
|
||||||
|
- `[SIMULATE]` On either platform: save to local storage or backup to iCloud/Google Drive.
|
||||||
|
- `[VERIFY]` For any gap found: confirm it reproduces after device reset. If it does, escalate to vendor. If it does not, investigate configuration.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section 4 — Data: does protection travel
|
||||||
|
|
||||||
|
### 4.1 Label encryption in the wild
|
||||||
|
|
||||||
|
- `[SIMULATE]` Forward a Highly Confidential test document to an external test email address. Open it from a mail client with no tenant authentication. Does encryption prevent access?
|
||||||
|
- `[SIMULATE]` Download the same document to an unmanaged device. Does encryption require re-authentication to the tenant?
|
||||||
|
- `[SIMULATE]` Share the document via an anonymous link. Access from an unauthenticated browser. Does it open?
|
||||||
|
- `[SIMULATE]` Copy/paste content from the document on a managed device under a MAM policy. Is it blocked?
|
||||||
|
- `[VERIFY]` For any path where the document opens without authentication: this is an exfiltration route. Document the specific path, the expected control that should have blocked it, and the observed result.
|
||||||
|
|
||||||
|
### 4.2 DLP enforcement
|
||||||
|
|
||||||
|
- `[SIMULATE]` Send an email from a test account containing content matching a high-value DLP rule (credit card number pattern, national ID format, or the client's custom regex for crown-jewel content). Does DLP intercept it? What action fires (block, override, audit-only)?
|
||||||
|
- `[SIMULATE]` Upload the same content to a personal OneDrive or cloud storage from a managed device. Does DLP fire?
|
||||||
|
- `[VERIFY]` For DLP rules that fire in audit-only mode: what happens to the audit events? Are they reviewed? By whom? How often?
|
||||||
|
- `[VERIFY]` What is the false positive rate for high-sensitivity DLP rules? High false positive rates mean users have learned to override; the rule is not a control.
|
||||||
|
|
||||||
|
### 4.3 Anonymous links (existing population)
|
||||||
|
|
||||||
|
- `[MEASURE]` Full count of anonymous links across the tenant. (Not the current sharing setting — the existing links that predate any restriction.)
|
||||||
|
- `[VERIFY]` Confirm at least one existing anonymous link resolves from an unauthenticated browser. It does — almost certainly. This proves the declared sharing restriction is forward-looking, not retroactive.
|
||||||
|
- `[VERIFY]` Can the client produce the anonymous link list and revoke all entries in under 30 minutes? Test the revocation capability, not just the list.
|
||||||
|
|
||||||
|
### 4.4 Email exfiltration paths
|
||||||
|
|
||||||
|
- `[SIMULATE]` Create a test Inbox rule on a test account forwarding to an external test address. Does anything alert? When?
|
||||||
|
- `[VERIFY]` `Get-RemoteDomain Default | Select-Object AutoForwardEnabled` — if False, test whether the Inbox rule still forwards. Document the result (transport-level and client-rule forwarding behave differently).
|
||||||
|
- `[VERIFY]` `Get-TransportRule` for any rules with external redirect or blind copy. For each: who created it, when, and is there a documented owner?
|
||||||
|
- `[MEASURE]` Time from Inbox rule creation to detection alert (if any).
|
||||||
|
|
||||||
|
### 4.5 Guest access and reshare chain
|
||||||
|
|
||||||
|
- `[MEASURE]` Total guest count. Guests not signed in for 90+ days. Ratio of stale to active.
|
||||||
|
- `[VERIFY]` Do guests have access beyond their original project scope? Pick 5 random active guests and enumerate their group and site memberships.
|
||||||
|
- `[SIMULATE]` Share a test document to a test external guest. Have the guest reshare to a second external test account. Can the client observe the second hop? Can they revoke it?
|
||||||
|
- `[VERIFY]` Are access reviews running for guests? What is the default action on reviewer non-response?
|
||||||
|
|
||||||
|
### 4.6 Audit log forensics readiness
|
||||||
|
|
||||||
|
- `[VERIFY]` Confirm audit logging is enabled (Purview > Audit — look for the "Start recording" banner; if it appears, logging is off).
|
||||||
|
- `[SIMULATE]` Run a forensic reconstruction: given a specific test user account, reconstruct everywhere they accessed data in the last 7 days. Can you produce a coherent picture from the audit log alone?
|
||||||
|
- `[MEASURE]` How far back does the audit log extend for the current licensing tier? Test by querying for a known event at the boundary date.
|
||||||
|
- `[VERIFY]` Are admin operations (CA policy changes, role assignments, app consent grants) present in the audit log? Run a query for admin events from the last 30 days and spot-check for completeness.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section 5 — Detection: the eight simulations
|
||||||
|
|
||||||
|
For each simulation: run it, record whether the alert fired, record the time from event to human acknowledgment, and record whether the responder acted. The SLA comparison is the finding.
|
||||||
|
|
||||||
|
| Simulation | Alert fires? | Time to human | Action taken | Finding |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| Break-glass sign-in | | | | |
|
||||||
|
| New Global Admin assigned | | | | |
|
||||||
|
| DCSync from non-DC host | | | | |
|
||||||
|
| Kerberoasting (TGS pattern) | | | | |
|
||||||
|
| Impossible travel (admin account) | | | | |
|
||||||
|
| External auto-forward rule created | | | | |
|
||||||
|
| Mass download from SharePoint | | | | |
|
||||||
|
| OAuth consent grant (sensitive scope) | | | | |
|
||||||
|
|
||||||
|
### 5.1 Alert queue health
|
||||||
|
|
||||||
|
- `[MEASURE]` Alert volume per day (last 30 days).
|
||||||
|
- `[MEASURE]` Alerts with documented human response.
|
||||||
|
- `[MEASURE]` Alerts suppressed or auto-closed without human review.
|
||||||
|
- `[MEASURE]` Alerts open for more than 48 hours.
|
||||||
|
- `[VERIFY]` For every alert category: is there a named owner? An alert category with no named owner is an unread alert category.
|
||||||
|
- `[VERIFY]` Pick 5 alerts from the last 30 days that were closed. For each: what action was taken, and what structural change resulted?
|
||||||
|
|
||||||
|
### 5.2 The feedback loop test
|
||||||
|
|
||||||
|
- `[MEASURE]` Last 5 closed security incidents: structural changes produced (count removals, access reductions, severed couplings — not reminders, training, or "noted in risk register").
|
||||||
|
- `[VERIFY]` Is there a post-incident process that explicitly asks: "what structural thing changes as a result of this?"
|
||||||
|
- `[VERIFY]` Is the post-incident process blameless on people (encouraging surfacing) and ruthless on structure (demanding a removal or change)?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section 6 — Recovery
|
||||||
|
|
||||||
|
### 6.1 Backup: restore something
|
||||||
|
|
||||||
|
- `[SIMULATE]` Restore a mailbox (or a mailbox item set) from the third-party backup. Time the operation.
|
||||||
|
- `[MEASURE]` Actual MTTR from test restore vs. policy-declared RTO.
|
||||||
|
- `[VERIFY]` If the actual MTTR exceeds the policy RTO: the policy is a fiction. Document the observed time as the operative figure.
|
||||||
|
- `[VERIFY]` Are backups isolated from the estate they protect? Can a Global Admin delete the backup copies?
|
||||||
|
- `[VERIFY]` Is there a third-party M365 backup at all? If not: M365 native recycle bin + version history is the only recovery mechanism, and this is a P0 for any organization with business-critical M365 data.
|
||||||
|
|
||||||
|
### 6.2 AD forest recovery
|
||||||
|
|
||||||
|
- `[VERIFY]` Does a written AD forest recovery runbook exist?
|
||||||
|
- `[VERIFY]` Is it stored where it can be retrieved when AD is down? (Not SharePoint. Not AD-authenticated storage.)
|
||||||
|
- `[VERIFY]` Has anyone on the team run the procedure — not a tabletop, an actual restore, even in a lab?
|
||||||
|
- `[VERIFY]` Does the runbook include: DC restore sequence, metadata cleanup, double KRBTGT rotation, trust resets?
|
||||||
|
- Finding if all above are no: the first time AD forest recovery is performed will be during the real disaster. Document as a rehearsal scope item.
|
||||||
|
|
||||||
|
### 6.3 Configuration known-good
|
||||||
|
|
||||||
|
- `[VERIFY]` Export current CA policies to JSON. Diff against the opening-of-engagement export. For every difference: is there a change record?
|
||||||
|
- `[VERIFY]` Are there CA policies that changed since the last documented review without a corresponding change order?
|
||||||
|
- `[VERIFY]` If a CA policy was silently modified (intentionally or not), what mechanism would have detected it and when?
|
||||||
|
|
||||||
|
### 6.4 Break-glass independence
|
||||||
|
|
||||||
|
- `[VERIFY]` Cloud admin recovery path works with no on-prem dependency — confirm by testing while sync is stopped or from a network with no DC visibility.
|
||||||
|
- `[VERIFY]` If the primary MFA infrastructure (Microsoft Authenticator, FIDO2 key) is unavailable, is there a recovery path for privileged access that does not itself require privileged access?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Closing metrics (capture after engagement)
|
||||||
|
|
||||||
|
| Metric | Before | After | Delta |
|
||||||
|
|--------|--------|-------|-------|
|
||||||
|
| BloodHound paths to DA (from standard user) | | | |
|
||||||
|
| Active (non-break-glass) Global Admin assignments | | | |
|
||||||
|
| Active (non-break-glass) Domain Admin assignments | | | |
|
||||||
|
| CA policies verified by observation (working) | | | |
|
||||||
|
| Detection signals tested end-to-end (working) | | | |
|
||||||
|
| Anonymous link count | | | |
|
||||||
|
| Unmanaged device sign-in % of total | | | |
|
||||||
|
| Actual backup MTTR (minutes) | | | |
|
||||||
|
| Structural changes from last 5 incidents (before) | | | |
|
||||||
|
| Structural changes produced this engagement | | | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Engagement close verification
|
||||||
|
|
||||||
|
Before marking the engagement complete:
|
||||||
|
|
||||||
|
- Every finding that was verified by observation has a structural change attached (not a risk register entry — a change).
|
||||||
|
- The closing metrics have been calculated and compared to the opening metrics.
|
||||||
|
- The break-glass has been tested and works.
|
||||||
|
- At least one backup restore has been timed and the MTTR recorded.
|
||||||
|
- At least one CA policy has been verified to enforce by a real sign-in with pre-written expected outcomes.
|
||||||
|
- At least one detection signal has been tested end-to-end to a human responder.
|
||||||
|
- The configuration-as-code export (CA policies, role assignments) has been stored and the client has it.
|
||||||
|
- A named date exists for the next adversarial validation cycle.
|
||||||
|
|
||||||
|
The engagement is not complete when the list is walked. It is complete when every finding from observation has become a structural change or a named, dated, owned commitment.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Adversarial Validation Checklist. Updated June 2026. Review alongside the field guide — January 2027.*
|
||||||
@@ -0,0 +1,378 @@
|
|||||||
|
# M365 + AD Engagement Checklist
|
||||||
|
|
||||||
|
> *Not a benchmark. Not scored. A structured inspection list for consultants on active engagements.*
|
||||||
|
|
||||||
|
**Last updated:** June 2026
|
||||||
|
**Companion to:** [Field Guide 2026](../books/field-guide-2026.md) · [Books I–VI](../books/)
|
||||||
|
**Next review:** January 2027
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How to use this
|
||||||
|
|
||||||
|
Work through the relevant sections during the Brownhat Diagnostic or at the start of a module engagement. Each item is a control area — something to inspect and a question to answer honestly. Mark items that surface findings. Mark items that are verified clean. If an item is not applicable, note why.
|
||||||
|
|
||||||
|
This is not a scoring tool. "Found" and "clean" are the only states that matter. A clean item with no evidence of testing is the same as not checked.
|
||||||
|
|
||||||
|
**Notation used below:**
|
||||||
|
- `[LOOK AT]` — inspect and document current state
|
||||||
|
- `[TEST]` — verify by observation, not by reading the config
|
||||||
|
- `[ASK]` — a question that requires a conversation, not just a portal check
|
||||||
|
|
||||||
|
Nothing here replaces the governing question from Book I:
|
||||||
|
> **If this is owned tonight, what is the largest thing an attacker reaches before hitting a wall — and can I draw that wall?**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section A — Hybrid Identity
|
||||||
|
|
||||||
|
### A1. Authentication Method
|
||||||
|
|
||||||
|
- `[LOOK AT]` Which authentication method is actually in use: PHS, PTA, or Federation (AD FS)?
|
||||||
|
- `[LOOK AT]` Does the method shown in the Entra portal match what is documented and what IT staff believe to be true?
|
||||||
|
- `[TEST]` If on-prem AD is simulated as unavailable (pull the sync server), does cloud authentication survive? Which auth method does this actually prove?
|
||||||
|
- `[LOOK AT]` Is PHS running alongside PTA as a failover? (Optionality — cheap insurance)
|
||||||
|
- `[LOOK AT]` If on PTA: how many PTA agents are deployed, and what host/network tier are they on?
|
||||||
|
|
||||||
|
### A2. Sync Engine (Entra Connect / Cloud Sync)
|
||||||
|
|
||||||
|
- `[LOOK AT]` Which sync engine is running: Entra Connect Sync or Entra Cloud Sync?
|
||||||
|
- `[LOOK AT]` What server hosts the sync engine, and what domain/tier is it joined to?
|
||||||
|
- `[LOOK AT]` What account runs the on-prem connector service, and does it have `Replicate Directory Changes All` (DCSync capability)?
|
||||||
|
- `[LOOK AT]` What is the patch / update level of the sync server (OS and sync software)?
|
||||||
|
- `[LOOK AT]` Who has local administrator rights on the sync server?
|
||||||
|
- `[LOOK AT]` What does the Entra connector account (Directory Synchronization Accounts role) have permission to do in the cloud?
|
||||||
|
- `[TEST]` If the connector account is monitored: does an alert fire when it authenticates from an unexpected host?
|
||||||
|
- `[LOOK AT]` Are there active alerts or errors in the sync engine health dashboard?
|
||||||
|
|
||||||
|
### A3. AD FS
|
||||||
|
|
||||||
|
- `[LOOK AT]` Is AD FS deployed and active?
|
||||||
|
- `[ASK]` If yes: why is it still running? What relying party trusts require it, and is there a migration plan?
|
||||||
|
- `[LOOK AT]` When was the token-signing certificate last rotated? Where is the private key stored?
|
||||||
|
- `[LOOK AT]` Is the rollover certificate about to expire?
|
||||||
|
- `[LOOK AT]` Which servers host AD FS, and what network tier and patching cadence do they have?
|
||||||
|
- `[TEST]` Golden SAML tabletop: if the token-signing key were obtained, what would detection see, and how fast could the cert be rotated? Is the procedure written and tested?
|
||||||
|
- `[ASK]` Is there a Entra staged rollout in progress or planned to migrate away from federation?
|
||||||
|
|
||||||
|
### A4. Privileged Account Sync
|
||||||
|
|
||||||
|
- `[LOOK AT]` Are any Domain Admins, Enterprise Admins, or other Tier 0 accounts synced to Entra ID (i.e., present as cloud objects)?
|
||||||
|
- `[LOOK AT]` Are Global Admins or other Entra privileged role holders cloud-only accounts, or synced from on-prem?
|
||||||
|
- `[LOOK AT]` Are admin accounts (on-prem or cloud) using the same device for privileged work as for daily tasks (email, browsing)?
|
||||||
|
|
||||||
|
### A5. Writebacks
|
||||||
|
|
||||||
|
- `[LOOK AT]` Which writebacks are enabled: password writeback, group writeback, device writeback?
|
||||||
|
- `[ASK]` For each: who owns the decision, and is the reverse blast radius (cloud compromise → on-prem impact) documented?
|
||||||
|
- `[LOOK AT]` Is group writeback (v2) enabled? If so, which cloud groups write into AD, and what on-prem resources do they gate?
|
||||||
|
|
||||||
|
### A6. Seamless SSO
|
||||||
|
|
||||||
|
- `[LOOK AT]` Is Seamless SSO enabled?
|
||||||
|
- `[LOOK AT]` When was the `AZUREADSSOACC` Kerberos key last rotated? (`Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet`)
|
||||||
|
- `[ASK]` Is Seamless SSO actually needed, or can it be removed (Entra-joined devices + modern auth typically do not require it)?
|
||||||
|
|
||||||
|
### A7. Sync Scope
|
||||||
|
|
||||||
|
- `[LOOK AT]` Is sync scoped to specific OUs, or is "sync everything" the default?
|
||||||
|
- `[LOOK AT]` Are there synced objects that serve no cloud purpose (decommissioned systems, service accounts, administrative accounts)?
|
||||||
|
|
||||||
|
### A8. Breach Optionality
|
||||||
|
|
||||||
|
- `[ASK]` Is there a written, accessible runbook for severing the AD↔Entra bridge under breach conditions?
|
||||||
|
- `[TEST]` Is the runbook stored somewhere accessible when both AD and SharePoint are unavailable?
|
||||||
|
- `[ASK]` Has anyone walked through the "kill the sync" procedure, and does the team know what breaks per auth method?
|
||||||
|
- `[LOOK AT]` Does the cloud admin path (break-glass Global Admin) work with zero on-prem dependency?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section B — Privileged Access
|
||||||
|
|
||||||
|
### B1. Standing Privilege Inventory
|
||||||
|
|
||||||
|
- `[LOOK AT]` How many identities hold standing (permanent, active) privilege: Global Admin, Privileged Role Admin, Domain Admin, Enterprise Admin?
|
||||||
|
- `[LOOK AT]` Are there any standing Global Admin assignments that are not break-glass accounts? (Should be zero)
|
||||||
|
- `[LOOK AT]` How many Domain Admins and Enterprise Admins exist, and are they all justified with named owners?
|
||||||
|
- `[ASK]` When was the privileged account list last reviewed, and by whom?
|
||||||
|
|
||||||
|
### B2. PIM / JIT
|
||||||
|
|
||||||
|
- `[LOOK AT]` Is Entra PIM deployed and enforced for Entra administrative roles?
|
||||||
|
- `[LOOK AT]` Are Entra roles set to eligible (not active) by default?
|
||||||
|
- `[LOOK AT]` Does PIM activation require phishing-resistant MFA (FIDO2 / certificate), or just push-approve?
|
||||||
|
- `[LOOK AT]` Do crown roles (Privileged Role Administrator, Global Administrator) require approval workflow on PIM activation?
|
||||||
|
- `[LOOK AT]` What is the maximum activation time-box configured? (Should be justified and bounded — 8 hours maximum for a working day)
|
||||||
|
- `[LOOK AT]` Is PIM alert configuration enabled (Roles activated without MFA, Redundant assignments, etc.)?
|
||||||
|
- `[ASK]` For on-prem DA/EA: is there any JIT or time-limited elevation mechanism in place?
|
||||||
|
|
||||||
|
### B3. Service Accounts (On-Prem)
|
||||||
|
|
||||||
|
- `[LOOK AT]` Are there service accounts with SPNs and static passwords older than 12 months? (Kerberoastable)
|
||||||
|
- `[LOOK AT]` Which service accounts are over-permissioned (e.g., Domain Admin, local admin on all servers)?
|
||||||
|
- `[LOOK AT]` Which service accounts have been migrated to gMSA?
|
||||||
|
- `[LOOK AT]` Are there service accounts nobody can identify a current owner for?
|
||||||
|
- `[TEST]` Run a Kerberoast simulation: do ticket requests for service account SPNs generate any detection?
|
||||||
|
|
||||||
|
### B4. Service Principals & App Registrations (Cloud)
|
||||||
|
|
||||||
|
- `[LOOK AT]` Which app registrations hold escalation-grade Graph permissions (application permissions): `RoleManagement.ReadWrite.Directory`, `AppRoleAssignment.ReadWrite.All`, `Application.ReadWrite.All`, `Directory.ReadWrite.All`?
|
||||||
|
- `[LOOK AT]` Which app registrations have non-expiring client secrets?
|
||||||
|
- `[LOOK AT]` Are there orphaned app registrations with no current owner?
|
||||||
|
- `[LOOK AT]` Which apps have tenant-wide admin consent, and is each justified and reviewed?
|
||||||
|
- `[LOOK AT]` Which Azure workloads use client secrets instead of managed identities where managed identities are available?
|
||||||
|
|
||||||
|
### B5. Tier Model / Clean Source
|
||||||
|
|
||||||
|
- `[LOOK AT]` Do Domain Admins / Enterprise Admins authenticate from standard workstations used for email and browsing?
|
||||||
|
- `[LOOK AT]` Is ADCS (Active Directory Certificate Services) deployed? If so, is it on a Tier 0 or hardened host, or on a standard server?
|
||||||
|
- `[LOOK AT]` Are there shared administrative jump boxes that cross tier boundaries (used for both Tier 0 and Tier 1 work)?
|
||||||
|
- `[LOOK AT]` Do cloud admins use the same device for privileged Entra work as for daily activity?
|
||||||
|
|
||||||
|
### B6. Escalation Paths
|
||||||
|
|
||||||
|
- `[LOOK AT]` Are there accounts with `GenericAll`, `WriteDACL`, or `WriteOwner` on high-value AD objects (domain root, DCs, admin groups) that are not themselves Tier 0?
|
||||||
|
- `[LOOK AT]` Are there computers with unconstrained delegation enabled (excluding DCs)?
|
||||||
|
- `[LOOK AT]` When was KRBTGT last rotated? (`Get-ADUser krbtgt -Properties PasswordLastSet`)
|
||||||
|
- `[LOOK AT]` Is LAPS (Windows LAPS preferred) deployed across all workstations and servers? What is the coverage percentage?
|
||||||
|
- `[TEST]` Run BloodHound (or equivalent) and count attack paths to Domain Admin. Note the number as a baseline. Is it going up or down over time?
|
||||||
|
|
||||||
|
### B7. Break-Glass
|
||||||
|
|
||||||
|
- `[LOOK AT]` Do cloud-only break-glass Global Admin accounts exist?
|
||||||
|
- `[LOOK AT]` Is phishing-resistant authentication (FIDO2 or certificate) configured on break-glass accounts?
|
||||||
|
- `[LOOK AT]` Are break-glass accounts excluded from the CA policies that would otherwise enforce device compliance or block sign-in?
|
||||||
|
- `[LOOK AT]` Does any use of the break-glass account trigger an immediate, monitored alert?
|
||||||
|
- `[TEST]` Sign in to the break-glass account in a controlled drill. Does it work? Does the alert fire? Does someone respond?
|
||||||
|
- `[ASK]` Where are the break-glass credentials stored, and can they be retrieved without the systems they recover?
|
||||||
|
|
||||||
|
### B8. Phishing-Resistant MFA for Admins
|
||||||
|
|
||||||
|
- `[LOOK AT]` What MFA method is enforced for Global Admins: FIDO2, certificate-based auth, or push/SMS?
|
||||||
|
- `[LOOK AT]` Push-approve and SMS are not acceptable for administrative accounts. If they are in use, that is a P0.
|
||||||
|
- `[LOOK AT]` Is there a CA policy restricting privileged role activation to compliant/managed devices or named PAWs?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section C — Devices & Endpoint
|
||||||
|
|
||||||
|
### C1. Fleet Reality
|
||||||
|
|
||||||
|
- `[LOOK AT]` Reconcile: Intune enrolled devices vs. Entra registered devices vs. sign-in log device population. What is the gap?
|
||||||
|
- `[LOOK AT]` How many sign-in events in the last 30 days came from non-compliant or unmanaged devices (device compliance state = unknown or non-compliant in sign-in logs)?
|
||||||
|
- `[LOOK AT]` Are there legacy-protocol sign-ins (Basic Auth) that bypass Conditional Access entirely? (Sign-in logs, filter Client App = "Exchange ActiveSync," "Other clients")
|
||||||
|
- `[LOOK AT]` How many BYOD / personal devices are accessing corporate data through the web client or OWA (known-unmanaged population)?
|
||||||
|
|
||||||
|
### C2. Join State and Management Mode
|
||||||
|
|
||||||
|
- `[LOOK AT]` Are devices Entra-joined, hybrid Entra-joined, or Entra-registered (BYOD)?
|
||||||
|
- `[LOOK AT]` Is hybrid Entra join still in use? If so, which on-prem dependencies actually require it?
|
||||||
|
- `[LOOK AT]` Is there a roadmap to go cloud-native (Entra join + Intune) for devices currently on hybrid join?
|
||||||
|
- `[LOOK AT]` Are there GPO and Intune co-management conflicts producing inconsistent configuration?
|
||||||
|
|
||||||
|
### C3. Conditional Access Enforcement
|
||||||
|
|
||||||
|
- `[TEST]` For every CA policy that enforces device compliance or blocks legacy auth: run real sign-ins with expected outcomes written down beforehand. Does the observed result match?
|
||||||
|
- `[TEST]` If a policy looks correct but does not enforce: recreate from scratch, re-test. Document ghost policy findings.
|
||||||
|
- `[LOOK AT]` Is there a CA policy blocking legacy authentication protocols across all apps? (This is the single highest-leverage CA policy — if not in place, that is P0)
|
||||||
|
- `[LOOK AT]` Is there a CA policy requiring MFA for all admin role activations?
|
||||||
|
- `[LOOK AT]` Is there a CA policy requiring compliant or managed device for access to sensitive workloads?
|
||||||
|
- `[LOOK AT]` Are break-glass accounts and emergency service accounts correctly excluded from blocking CA policies?
|
||||||
|
- `[TEST]` Lock yourself out in report-only mode (simulate a compliance failure on an admin account). Confirm break-glass bypasses the policy. Confirm a legitimate admin gets the expected failure and knows the escalation path.
|
||||||
|
|
||||||
|
### C4. Compliance Signal Quality
|
||||||
|
|
||||||
|
- `[LOOK AT]` What is the compliance check-in cadence? (The window where a fallen-out device still holds a "compliant" token)
|
||||||
|
- `[LOOK AT]` Is Continuous Access Evaluation (CAE) enabled for workloads that support it? (Narrows the stale-token window)
|
||||||
|
- `[ASK]` Is root/jailbreak detection in compliance policy, and how is it treated — as a hard block or a risk signal? Is it believed to be a wall or a tripwire?
|
||||||
|
- `[TEST]` Spoof compliance on a test device (root a test device). How long until the signal flips? Does CA revoke access?
|
||||||
|
|
||||||
|
### C5. Endpoint Privilege
|
||||||
|
|
||||||
|
- `[LOOK AT]` Do standard users have standing local admin on their endpoints?
|
||||||
|
- `[LOOK AT]` Is Endpoint Privilege Management (EPM) deployed, or is there a JIT elevation mechanism for tasks requiring admin rights?
|
||||||
|
- `[LOOK AT]` Is Windows LAPS deployed across the fleet? Is legacy LAPS still in use (to be migrated)?
|
||||||
|
- `[LOOK AT]` Are there shared local admin accounts with common passwords across multiple machines?
|
||||||
|
|
||||||
|
### C6. Update and Patch Velocity
|
||||||
|
|
||||||
|
- `[LOOK AT]` Is Windows Autopatch in use (for update ring management)?
|
||||||
|
- `[LOOK AT]` Are Intune update rings configured with pilot, broad, and deferral stages?
|
||||||
|
- `[ASK]` Is there a named person with the authority and procedure to halt a broad update ring push? Has this been tested?
|
||||||
|
- `[LOOK AT]` What is the current patch lag for the fleet (how many devices are 30+ days behind on OS updates)?
|
||||||
|
|
||||||
|
### C7. MAM / App Protection (BYOD)
|
||||||
|
|
||||||
|
- `[TEST]` On iOS: attempt copy/paste from managed Outlook/Teams to an unmanaged app. Does it block?
|
||||||
|
- `[TEST]` On Android: same test, separately — behavior is not symmetric with iOS.
|
||||||
|
- `[TEST]` Attempt to "Open in" from a managed attachment to an unmanaged app on each platform.
|
||||||
|
- `[TEST]` Attempt to save to local storage or sync to a personal cloud (iCloud, Google Drive).
|
||||||
|
- `[LOOK AT]` Are managed browsers enforced for SharePoint/OWA access on BYOD, or can users access via any browser?
|
||||||
|
|
||||||
|
### C8. Autopilot and Enrollment Trust
|
||||||
|
|
||||||
|
- `[LOOK AT]` Is the Autopilot device list audited? Are there stale or unknown device registrations?
|
||||||
|
- `[LOOK AT]` Are enrollment restrictions in place to prevent unauthorized device enrollment?
|
||||||
|
- `[TEST]` Time a wipe-and-reprovision on a corporate device via Autopilot. Is the "replaceable in an hour" claim accurate?
|
||||||
|
- `[LOOK AT]` Is the PRT (Primary Refresh Token) TPM-bound on Windows devices?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section D — Data & Collaboration
|
||||||
|
|
||||||
|
### D1. Sharing Posture
|
||||||
|
|
||||||
|
- `[LOOK AT]` What is the tenant-level external sharing setting in SharePoint Admin Center?
|
||||||
|
- `[LOOK AT]` Are "Anyone with the link" anonymous shares enabled at the tenant level?
|
||||||
|
- `[TEST]` Enumerate existing anonymous links across the tenant. Can you produce the list? How large is it?
|
||||||
|
- `[LOOK AT]` Are per-site sharing settings more permissive than the tenant default? (Sites can override upward)
|
||||||
|
- `[LOOK AT]` Are sharing expiration policies configured for anonymous and external links?
|
||||||
|
- `[TEST]` Share a document to a test external guest and attempt to reshare onward. Can you track the second-hop share?
|
||||||
|
|
||||||
|
### D2. Guest Access
|
||||||
|
|
||||||
|
- `[LOOK AT]` How many active guests exist in the tenant?
|
||||||
|
- `[LOOK AT]` How many guests have not signed in for 90+ days?
|
||||||
|
- `[LOOK AT]` Are access reviews configured for guest accounts? What is the review cadence and the default action on non-response?
|
||||||
|
- `[LOOK AT]` Do guests have broader access than the project they were invited for (i.e., access to Teams/channels beyond their original scope)?
|
||||||
|
- `[LOOK AT]` Are external identities governed by specific B2B collaboration settings, or is the default (all external domains) allowed?
|
||||||
|
|
||||||
|
### D3. Email Security
|
||||||
|
|
||||||
|
- `[TEST]` Enumerate external auto-forwarding rules at the transport level (`Get-TransportRule`). Are there any active rules forwarding externally without a documented business owner?
|
||||||
|
- `[TEST]` Enumerate Inbox rules on executive / privileged user mailboxes forwarding externally. (`Get-InboxRule`)
|
||||||
|
- `[LOOK AT]` Is the global "allow automatic forwarding" setting disabled in Remote Domains for the Default domain?
|
||||||
|
- `[LOOK AT]` Are anti-phishing policies configured? Is impersonation protection enabled for executives and key domains?
|
||||||
|
- `[LOOK AT]` Is DKIM signing enabled for all sending domains?
|
||||||
|
- `[LOOK AT]` Is DMARC configured (policy `reject` or `quarantine`), and is the SPF record current?
|
||||||
|
|
||||||
|
### D4. Crown Jewels
|
||||||
|
|
||||||
|
- `[ASK]` Can the client name the five data sets that, if exfiltrated, would cause the most damage?
|
||||||
|
- `[LOOK AT]` Where do the crown jewels live (SharePoint sites, mailboxes, OneDrive, Teams channels)?
|
||||||
|
- `[LOOK AT]` Who has access to the crown-jewel locations? Is access reviewed periodically?
|
||||||
|
- `[LOOK AT]` Are the crown-jewel locations labeled with sensitivity labels that carry encryption?
|
||||||
|
- `[LOOK AT]` Are audit logs turned on and retained long enough to reconstruct access to crown-jewel locations?
|
||||||
|
|
||||||
|
### D5. Sensitivity Labels and DLP
|
||||||
|
|
||||||
|
- `[LOOK AT]` Are sensitivity labels deployed in the tenant? What is the coverage across the most-used content types (email, files)?
|
||||||
|
- `[LOOK AT]` Are labels configured with encryption for the highest sensitivity tiers?
|
||||||
|
- `[LOOK AT]` Is auto-labeling deployed for known crown-jewel content types (if licensed for M365 E5 Compliance)?
|
||||||
|
- `[LOOK AT]` Is DLP deployed? Is it scoped to specific known-value patterns (regulated data, PII, crown-jewel keywords) or applied as a broad dragnet generating noise?
|
||||||
|
- `[TEST]` Exfiltrate a labeled test document via email to an external address. Does DLP fire? Does the label encryption hold on the received document?
|
||||||
|
|
||||||
|
### D6. Collaboration Sprawl
|
||||||
|
|
||||||
|
- `[LOOK AT]` Is there ungoverned self-service creation of Teams and SharePoint sites?
|
||||||
|
- `[LOOK AT]` Are there orphaned or inactive Teams/sites that still hold data and have no active owner?
|
||||||
|
- `[LOOK AT]` Are there Teams channels or SharePoint sites with "Everyone" or broad internal membership grants on sensitive data?
|
||||||
|
- `[LOOK AT]` Is late-joiners' access to Team history governed (a user joining a Team today can read all prior messages by default)?
|
||||||
|
|
||||||
|
### D7. OAuth App Consent
|
||||||
|
|
||||||
|
- `[LOOK AT]` Is user consent for OAuth apps restricted (users cannot consent to app permission requests without admin approval)?
|
||||||
|
- `[LOOK AT]` Are there existing grants for apps holding `Mail.Read`, `Files.ReadWrite.All`, or equivalent sensitive scopes by non-first-party apps?
|
||||||
|
- `[LOOK AT]` Is Microsoft's app governance module (Purview) enabled? Are risky app alerts configured?
|
||||||
|
|
||||||
|
### D8. Audit Logging
|
||||||
|
|
||||||
|
- `[LOOK AT]` Is Unified Audit Logging enabled (confirm in Purview Compliance Center > Audit)?
|
||||||
|
- `[LOOK AT]` What is the audit retention period, given the client's licensing?
|
||||||
|
- `[TEST]` Run a sample audit query on a known recent activity and verify log entries are present. Do not assume the log is on without testing it.
|
||||||
|
- `[LOOK AT]` Are admin operations (role assignment changes, app consent, CA policy changes) captured in the audit log?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section E — Recovery & Detection
|
||||||
|
|
||||||
|
### E1. Backup and Recovery
|
||||||
|
|
||||||
|
- `[ASK]` What is the recovery path if a Global Admin deletes all Exchange Online mailboxes and SharePoint sites? Be specific about process, tool, and time estimate.
|
||||||
|
- `[LOOK AT]` Is there a third-party M365 backup solution covering Exchange, SharePoint, OneDrive, and Teams?
|
||||||
|
- `[LOOK AT]` Are M365 backups isolated from the estate they protect (immutable, separate authentication domain)?
|
||||||
|
- `[TEST]` When was the last successful restore from backup, and how long did it take? Restore a test mailbox or a file share and time it. This is the MTTR.
|
||||||
|
- `[LOOK AT]` Are on-prem AD backups (System State) taken regularly, stored offline, and verified?
|
||||||
|
- `[TEST]` Can the current backup restore an AD domain if all DCs are destroyed? Has anyone run the forest recovery procedure, even in a lab?
|
||||||
|
|
||||||
|
### E2. Configuration-as-Code (Known-Good Baseline)
|
||||||
|
|
||||||
|
- `[LOOK AT]` Have CA policies been exported to code/JSON (e.g., using CAExporter)?
|
||||||
|
- `[LOOK AT]` Has the Entra role assignment state been captured as a document?
|
||||||
|
- `[LOOK AT]` Has the Intune baseline configuration been exported?
|
||||||
|
- `[LOOK AT]` Is there a diff between the opening state and current state for any changes made during the engagement?
|
||||||
|
- `[ASK]` If the tenant CA policies were silently modified by an attacker, would anyone know? Is there drift detection against the known-good?
|
||||||
|
|
||||||
|
### E3. Recovery Path Independence
|
||||||
|
|
||||||
|
- `[LOOK AT]` Does any part of the recovery runbook depend on the system it recovers (e.g., runbook stored in SharePoint, backup auth via the compromised AD)?
|
||||||
|
- `[LOOK AT]` Are recovery credentials (break-glass, backup admin accounts) accessible independently of the estate?
|
||||||
|
- `[LOOK AT]` Is the AD forest recovery runbook stored offline or in a location that survives domain destruction?
|
||||||
|
- `[ASK]` If both AD and M365 were simultaneously unavailable, what is the recovery sequencing? Is that decision documented?
|
||||||
|
|
||||||
|
### E4. Detection: Signal Quality
|
||||||
|
|
||||||
|
- `[LOOK AT]` Break-glass account use: is there an alert? Is it monitored by a named person?
|
||||||
|
- `[LOOK AT]` New Global Admin assignment: does an alert fire?
|
||||||
|
- `[LOOK AT]` DCSync from a non-DC host: is this detected (Defender for Identity or SIEM rule)?
|
||||||
|
- `[LOOK AT]` Impossible-travel sign-in for admin accounts: is Entra ID Protection user risk policy configured and alerting?
|
||||||
|
- `[LOOK AT]` External auto-forward rule creation: is this generating an alert?
|
||||||
|
- `[LOOK AT]` Mass download from SharePoint/OneDrive: is there a Defender for Cloud Apps or Purview policy detecting it?
|
||||||
|
- `[LOOK AT]` New OAuth consent grant to sensitive scopes: is this alerting?
|
||||||
|
- `[LOOK AT]` PIM activation outside business hours: is this logged and reviewed?
|
||||||
|
- `[TEST]` For each configured detection: simulate the event (in a controlled, authorized test context) and confirm the alert fires, is received by a named person, and generates a response within the expected SLA.
|
||||||
|
|
||||||
|
### E5. Detection: Noise and Action
|
||||||
|
|
||||||
|
- `[ASK]` How many alerts does the monitoring system generate per day? How many are triaged vs. suppressed vs. missed?
|
||||||
|
- `[ASK]` For the last three security incidents or notable alerts: what structural change resulted? If the answer is "we sent an awareness email" or "we noted it," the feedback loop is broken.
|
||||||
|
- `[LOOK AT]` Is there a named owner for each alert category? An alert without a named owner is an unread alert.
|
||||||
|
- `[ASK]` Is there a blameless post-incident process? Do people surface incidents, or do they bury them to avoid blame?
|
||||||
|
|
||||||
|
### E6. Game-Days and Drills
|
||||||
|
|
||||||
|
- `[ASK]` When was the last deliberate test of recovery or detection (a drill, tabletop, or game-day)?
|
||||||
|
- `[TEST]` Break-glass drill: sign in, confirm it works, confirm the alert fires. Document the test and the result.
|
||||||
|
- `[TEST]` CA policy enforcement drill: force a non-compliant state on a test user. Confirm the expected outcome and that break-glass bypasses the gate.
|
||||||
|
- `[ASK]` Has the client ever run a ransomware tabletop that assumes Tier 0 is owned? What did they find?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Section F — Quick-Win Inventory
|
||||||
|
|
||||||
|
Use this section to capture findings that can be addressed in the same session or within the engagement without additional scoping.
|
||||||
|
|
||||||
|
Each of the following, if found to be the case, is a fix that typically takes under an hour and has immediate blast-radius reduction. Do not leave these open for the next engagement.
|
||||||
|
|
||||||
|
| Control | Condition that makes it a quick win |
|
||||||
|
|---------|-------------------------------------|
|
||||||
|
| Tenant-level anonymous sharing | "Anyone" links enabled at tenant level — one toggle |
|
||||||
|
| External auto-forwarding | Global block not set — one Exchange setting |
|
||||||
|
| Legacy auth CA policy | No policy blocking legacy auth — deploy baseline CA policy |
|
||||||
|
| Break-glass alert | Break-glass use not alerting — configure alert rule |
|
||||||
|
| Global admins audit | Standing synced GAs — identify and initiate migration |
|
||||||
|
| KRBTGT age | Password not set in 365+ days — document and schedule rotation |
|
||||||
|
| Stale admin accounts | Disabled or unchecked admin accounts — disable and document |
|
||||||
|
| Audit log | Not enabled — turn on (one click in Purview) |
|
||||||
|
| PIM not deployed | P2 licensed but PIM off — scope activation as P1 |
|
||||||
|
| No CA blocking admin sign-in from personal devices | Missing policy — create report-only immediately, test and enable |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Engagement Close — Structural Change Verification
|
||||||
|
|
||||||
|
At the close of each engagement or module, confirm:
|
||||||
|
|
||||||
|
1. Which items above were found to be fragile?
|
||||||
|
2. For each: what **structural change** was made (not documented, not accepted, but changed)?
|
||||||
|
3. Which items were tested by observation (not just inspected)?
|
||||||
|
4. Which items are open and in the risk register with a named owner and a timeline?
|
||||||
|
5. Has the configuration-as-code baseline been exported and stored?
|
||||||
|
6. Has the break-glass been tested?
|
||||||
|
7. Is there a named date for the next review of this checklist?
|
||||||
|
|
||||||
|
The work is not complete when the list is walked. It is complete when fragility found has become structure changed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Engagement Checklist. Updated June 2026. Review and update alongside the Field Guide — January 2027.*
|
||||||
@@ -0,0 +1,380 @@
|
|||||||
|
# Self-Service Security Cadence
|
||||||
|
|
||||||
|
> *What you run between our engagements. When something in here surprises you, that's when you call us.*
|
||||||
|
|
||||||
|
**Last updated:** June 2026
|
||||||
|
**Produced by:** [engagement name / consultant name]
|
||||||
|
**For:** [client name] — [named admin / IT lead]
|
||||||
|
**Next full engagement:** [date or "TBD"]
|
||||||
|
**Next review of this document:** January 2027
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What this is
|
||||||
|
|
||||||
|
We ran the adversarial validation. We fixed the structural issues we found. The work does not stop when we leave.
|
||||||
|
|
||||||
|
This document is your recurring checklist — things you can run yourself, with the tools we set up, on a regular cadence. None of it requires a security background. Most of it takes under an hour per month. The point is to catch drift before it becomes a problem, and to know when to call us before it becomes a crisis.
|
||||||
|
|
||||||
|
**The most important thing:** when something in here produces a result that surprises you, do not sit on it. Log it, screenshot it, and send it to us. The earlier we see a problem the cheaper it is to fix.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tools you need (all installed during the engagement)
|
||||||
|
|
||||||
|
| Tool | What it does | Where to get it |
|
||||||
|
|------|-------------|-----------------|
|
||||||
|
| **PingCastle** | Scans Active Directory and produces a security report with a score and specific findings | [pingcastle.com](https://www.pingcastle.com) — free Community edition |
|
||||||
|
| **Purple Knight** | Scans Active Directory for indicators of exposure — simpler output than PingCastle, good complement | [purple-knight.com](https://www.purple-knight.com) — free |
|
||||||
|
| **CAExporter** | Exports all Conditional Access policies to JSON files you can compare over time | [github.com/vibecoding/CAExporter](https://github.com/vibecoding/CAExporter) |
|
||||||
|
| **Microsoft Graph PowerShell** | The PowerShell module for the scripts in this document | `Install-Module Microsoft.Graph` |
|
||||||
|
| **Microsoft 365 Defender portal** | alerts.microsoft.com — your alert queue and Secure Score | |
|
||||||
|
| **Microsoft Entra portal** | entra.microsoft.com — your identity dashboard | |
|
||||||
|
|
||||||
|
The scripts in this document are saved in `[location agreed during engagement — e.g., C:\SecurityRunbook\Scripts\]`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monthly checks — 30 to 45 minutes, portal-based
|
||||||
|
|
||||||
|
Do these on the first working day of each month. They require no special tools — just a browser logged in as a Global Admin or Security Reader.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### M1. Microsoft Secure Score
|
||||||
|
|
||||||
|
**Where:** [Microsoft 365 Defender portal](https://security.microsoft.com) > Secure Score
|
||||||
|
|
||||||
|
**What to do:**
|
||||||
|
1. Note the current score.
|
||||||
|
2. Compare to last month's score (the history graph shows it).
|
||||||
|
3. Look at the "Recommended actions" tab — filter to "Not addressed."
|
||||||
|
4. Any new items that appeared since last month? Note them.
|
||||||
|
|
||||||
|
**What you are looking for:** Score going down month-over-month without a known reason. New recommended actions you did not create. Completed actions that have reverted to "not addressed" (this means configuration drifted back).
|
||||||
|
|
||||||
|
**Call us if:** Score drops more than 5 points in a month without a documented reason, or if a completed action you remember implementing shows as "not addressed."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### M2. Entra ID Recommendations
|
||||||
|
|
||||||
|
**Where:** [Entra portal](https://entra.microsoft.com) > Overview > Recommendations
|
||||||
|
|
||||||
|
**What to do:**
|
||||||
|
1. Look at all open recommendations.
|
||||||
|
2. Note any that are new since last month.
|
||||||
|
3. Note the impact rating (High / Medium / Low) on new ones.
|
||||||
|
|
||||||
|
**What you are looking for:** New high-impact recommendations that appeared since last month. Specifically watch for anything related to admin accounts, Conditional Access, legacy authentication, or risky sign-ins.
|
||||||
|
|
||||||
|
**Call us if:** Any new High-impact recommendation appears. We will help you assess whether to act immediately or schedule it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### M3. Sign-in risk review
|
||||||
|
|
||||||
|
**Where:** Entra portal > Identity Protection > Risky sign-ins
|
||||||
|
|
||||||
|
**What to do:**
|
||||||
|
1. Filter to the last 30 days.
|
||||||
|
2. Look at sign-ins with risk level "High" that were not dismissed or remediated.
|
||||||
|
3. For any admin account (Global Admin, Exchange Admin, Security Admin) with any risky sign-in event — investigate before dismissing.
|
||||||
|
|
||||||
|
**What you are looking for:** Admin accounts appearing in the risky sign-in list. Any high-risk sign-in that auto-remediated (meaning the user passed an MFA challenge) where the geography or device does not make sense.
|
||||||
|
|
||||||
|
**Call us if:** Any admin account has a risky sign-in event. Any high-risk event that was remediated from an unexpected location.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### M4. Alert queue health
|
||||||
|
|
||||||
|
**Where:** Microsoft 365 Defender portal > Incidents & alerts > Alerts
|
||||||
|
|
||||||
|
**What to do:**
|
||||||
|
1. Filter to "New" and "In progress" alerts.
|
||||||
|
2. How many are sitting open for more than 48 hours?
|
||||||
|
3. Are there categories of alert that appear repeatedly? (Recurring alerts on the same user or asset are a pattern, not noise.)
|
||||||
|
|
||||||
|
**What you are looking for:** Alert queue growing over time without being worked. The same alert firing repeatedly on the same account or resource. Any alert tagged as "High severity" that is more than 24 hours old without assignment.
|
||||||
|
|
||||||
|
**Call us if:** A High-severity alert is more than 24 hours old and you do not know what to do with it. Or if the same alert keeps firing on the same account.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### M5. New admin assignments
|
||||||
|
|
||||||
|
**Where:** Entra portal > Identity > Roles & admins > All roles > Global Administrator > Assignments
|
||||||
|
|
||||||
|
**What to do:**
|
||||||
|
1. Check the current member list against last month's.
|
||||||
|
2. Any new members? Were they expected?
|
||||||
|
3. Check at minimum: Global Administrator, Exchange Administrator, Security Administrator, SharePoint Administrator.
|
||||||
|
|
||||||
|
**What you are looking for:** Anyone in a privileged role who should not be, or who appeared without a formal request.
|
||||||
|
|
||||||
|
**Call us if:** Any new privileged role assignment you did not authorize or do not recognize.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### M6. Break-glass confirmation (30 seconds)
|
||||||
|
|
||||||
|
**What to do:**
|
||||||
|
1. Confirm the break-glass account credentials are still in the agreed storage location.
|
||||||
|
2. Confirm the contact for "break-glass alert fired" is still the right person.
|
||||||
|
|
||||||
|
Do not log in to the break-glass account during this check — any sign-in triggers an alert. Just confirm the credentials are accessible.
|
||||||
|
|
||||||
|
**Call us if:** Credentials cannot be found. Or if the break-glass alert fires without a drill scheduled.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quarterly checks — 2 to 3 hours, tools required
|
||||||
|
|
||||||
|
Do these in the first week of each quarter (January, April, July, October). These require running the installed tools and saving the output.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Q1. PingCastle AD scan
|
||||||
|
|
||||||
|
**How to run:**
|
||||||
|
1. Log in to the domain controller (or any domain-joined machine) as a Domain Admin.
|
||||||
|
2. Run `PingCastle.exe --healthcheck --server <your-domain-FQDN>`.
|
||||||
|
3. It produces an HTML report. Save it to `[agreed location]` with the date in the filename: `PingCastle-2026-Q3.html`.
|
||||||
|
4. Open the report and note the score and any findings marked "Critical" or "High."
|
||||||
|
5. Compare to the previous quarter's report — is the score going up or down?
|
||||||
|
|
||||||
|
**What you are looking for:** Score trending down quarter-over-quarter. New Critical or High findings that were not present last quarter. Specifically watch the "Stale Objects" section (accounts nobody uses) and the "Privileged Access" section.
|
||||||
|
|
||||||
|
**Call us if:** The score drops more than 10 points since last quarter. Any new Critical finding. Any finding in the "Privileged Access" category that was clean last quarter.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Q2. Purple Knight AD scan
|
||||||
|
|
||||||
|
**How to run:**
|
||||||
|
1. Download and run Purple Knight on a domain-joined machine with Domain Admin credentials.
|
||||||
|
2. It is a GUI tool — click through the scan, wait for it to finish.
|
||||||
|
3. Save the PDF report with the date: `PurpleKnight-2026-Q3.pdf`.
|
||||||
|
4. Look at the "Identity Security Indicators" with status "Exposed" or "Critical."
|
||||||
|
5. Compare to the previous quarter.
|
||||||
|
|
||||||
|
**What you are looking for:** New exposed indicators that did not appear last quarter. Any indicator flagged as Critical. The tool is organized by MITRE ATT&CK category — pay particular attention to "Credential Access" and "Privilege Escalation."
|
||||||
|
|
||||||
|
**Call us if:** Any new Critical indicator. Or if the same Medium indicators keep appearing quarter after quarter without being resolved (this means the fix did not stick).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Q3. KRBTGT and AZUREADSSOACC age check
|
||||||
|
|
||||||
|
**How to run:** Open PowerShell as Domain Admin and run the following:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Write-Host "=== KRBTGT ===" -ForegroundColor Cyan
|
||||||
|
Get-ADUser krbtgt -Properties PasswordLastSet |
|
||||||
|
Select-Object @{N="Account";E={"krbtgt"}},
|
||||||
|
PasswordLastSet,
|
||||||
|
@{N="AgeDays";E={((Get-Date) - $_.PasswordLastSet).Days}}
|
||||||
|
|
||||||
|
Write-Host "=== AZUREADSSOACC ===" -ForegroundColor Cyan
|
||||||
|
Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet -ErrorAction SilentlyContinue |
|
||||||
|
Select-Object @{N="Account";E={"AZUREADSSOACC"}},
|
||||||
|
PasswordLastSet,
|
||||||
|
@{N="AgeDays";E={((Get-Date) - $_.PasswordLastSet).Days}}
|
||||||
|
```
|
||||||
|
|
||||||
|
Record the age in days in your tracking spreadsheet.
|
||||||
|
|
||||||
|
**What you are looking for:** KRBTGT older than 365 days = P1 (schedule rotation with us). KRBTGT older than 180 days = note and plan. AZUREADSSOACC never rotated since initial sync setup = note.
|
||||||
|
|
||||||
|
**Call us if:** KRBTGT is over 365 days old and there is no scheduled rotation. Or if either account shows a password age younger than expected (meaning someone rotated it without telling you — that is a finding too).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Q4. Cloud-only Global Admins check
|
||||||
|
|
||||||
|
**How to run:**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Connect-MgGraph -Scopes "Directory.Read.All"
|
||||||
|
|
||||||
|
$gaRoleId = (Get-MgDirectoryRole -Filter "displayName eq 'Global Administrator'").Id
|
||||||
|
$gaMembers = Get-MgDirectoryRoleMember -DirectoryRoleId $gaRoleId
|
||||||
|
|
||||||
|
Write-Host "=== Global Admins ===" -ForegroundColor Cyan
|
||||||
|
$gaMembers | ForEach-Object {
|
||||||
|
$user = Get-MgUser -UserId $_.Id -Property DisplayName,UserPrincipalName,OnPremisesSyncEnabled
|
||||||
|
[PSCustomObject]@{
|
||||||
|
Name = $user.DisplayName
|
||||||
|
UPN = $user.UserPrincipalName
|
||||||
|
SyncedFromAD = $user.OnPremisesSyncEnabled
|
||||||
|
}
|
||||||
|
} | Format-Table -AutoSize
|
||||||
|
```
|
||||||
|
|
||||||
|
Any row where `SyncedFromAD` is `True` is a P0 — call us immediately.
|
||||||
|
|
||||||
|
**What you are looking for:** Any Global Admin that is synced from on-prem AD. Any new GA you did not create.
|
||||||
|
|
||||||
|
**Call us if:** Any synced GA appears. Any GA you do not recognize.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Q5. Service principal secrets check — expiring and never-expiring
|
||||||
|
|
||||||
|
**How to run:**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Connect-MgGraph -Scopes "Application.Read.All"
|
||||||
|
|
||||||
|
$today = Get-Date
|
||||||
|
$warningDays = 60
|
||||||
|
|
||||||
|
Write-Host "=== Non-expiring secrets ===" -ForegroundColor Red
|
||||||
|
Get-MgApplication -All | ForEach-Object {
|
||||||
|
$app = $_
|
||||||
|
$app.PasswordCredentials | Where-Object { $_.EndDateTime -eq $null } | ForEach-Object {
|
||||||
|
[PSCustomObject]@{ App = $app.DisplayName; Secret = $_.DisplayName; Expires = "NEVER" }
|
||||||
|
}
|
||||||
|
} | Format-Table
|
||||||
|
|
||||||
|
Write-Host "=== Secrets expiring within $warningDays days ===" -ForegroundColor Yellow
|
||||||
|
Get-MgApplication -All | ForEach-Object {
|
||||||
|
$app = $_
|
||||||
|
$app.PasswordCredentials | Where-Object {
|
||||||
|
$_.EndDateTime -ne $null -and $_.EndDateTime -lt $today.AddDays($warningDays)
|
||||||
|
} | ForEach-Object {
|
||||||
|
[PSCustomObject]@{ App = $app.DisplayName; Secret = $_.DisplayName; Expires = $_.EndDateTime }
|
||||||
|
}
|
||||||
|
} | Sort-Object Expires | Format-Table
|
||||||
|
```
|
||||||
|
|
||||||
|
**What you are looking for:** Non-expiring secrets on any app registration. Secrets about to expire (these will break an application if not rotated — but they also need reviewing: is the app still needed?).
|
||||||
|
|
||||||
|
**Call us if:** You find a non-expiring secret on an app you do not recognize. Or if you find an expiring secret and do not know which application or service it belongs to.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Q6. Stale guest review
|
||||||
|
|
||||||
|
**How to run:**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Connect-MgGraph -Scopes "User.Read.All", "AuditLog.Read.All"
|
||||||
|
|
||||||
|
$cutoff = (Get-Date).AddDays(-90)
|
||||||
|
|
||||||
|
Get-MgUser -Filter "userType eq 'Guest'" -All -Property DisplayName,Mail,CreatedDateTime,SignInActivity |
|
||||||
|
ForEach-Object {
|
||||||
|
$lastSignIn = $_.SignInActivity.LastSignInDateTime
|
||||||
|
[PSCustomObject]@{
|
||||||
|
Name = $_.DisplayName
|
||||||
|
Email = $_.Mail
|
||||||
|
Created = $_.CreatedDateTime
|
||||||
|
LastSignIn = $lastSignIn
|
||||||
|
DaysSinceSignIn = if ($lastSignIn) { ((Get-Date) - $lastSignIn).Days } else { "Never" }
|
||||||
|
}
|
||||||
|
} |
|
||||||
|
Sort-Object DaysSinceSignIn -Descending |
|
||||||
|
Format-Table -AutoSize
|
||||||
|
```
|
||||||
|
|
||||||
|
**What you are looking for:** Guests who have not signed in for 90+ days. Guests you do not recognize (external parties from concluded projects or former vendors).
|
||||||
|
|
||||||
|
**Call us if:** The count of stale guests is growing quarter-over-quarter and nobody is pruning them. Or if a guest account appears that belongs to an external party from a concluded engagement and still has active access.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Q7. Anonymous link count
|
||||||
|
|
||||||
|
**How to run:** Connect using PnP PowerShell (installed during engagement):
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Connect-PnPOnline -Url "https://[tenant]-admin.sharepoint.com" -Interactive
|
||||||
|
|
||||||
|
$sites = Get-PnPTenantSite -IncludeOneDriveSites
|
||||||
|
|
||||||
|
$anonLinks = foreach ($site in $sites) {
|
||||||
|
Connect-PnPOnline -Url $site.Url -Interactive
|
||||||
|
Get-PnPSharingLinks | Where-Object { $_.SharingLinkType -eq "Anonymous" } |
|
||||||
|
ForEach-Object { [PSCustomObject]@{ Site = $site.Url; Link = $_.ShareLink; Expires = $_.ExpirationDateTime } }
|
||||||
|
}
|
||||||
|
|
||||||
|
Write-Host "Total anonymous links: $($anonLinks.Count)" -ForegroundColor Yellow
|
||||||
|
$anonLinks | Sort-Object Site | Format-Table
|
||||||
|
```
|
||||||
|
|
||||||
|
Record the count. Save the export.
|
||||||
|
|
||||||
|
**What you are looking for:** Count increasing quarter-over-quarter (means new anonymous links are being created despite the policy). Links with no expiration date.
|
||||||
|
|
||||||
|
**Call us if:** Count is increasing despite the restriction we put in place. Or if you find anonymous links on sites that hold sensitive data (HR, Finance, M&A).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Q8. CA policy diff — detect drift
|
||||||
|
|
||||||
|
**How to run:**
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# CAExporter is set up from the engagement — run from its directory
|
||||||
|
.\CAExporter.ps1 -ExportPath "C:\SecurityRunbook\CA-Exports\CA-$(Get-Date -Format 'yyyy-MM-dd')"
|
||||||
|
```
|
||||||
|
|
||||||
|
Then compare this quarter's export folder to last quarter's using any file diff tool (WinMerge, VS Code with the "compare folders" extension, or simply `Compare-Object` in PowerShell):
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
$old = Get-ChildItem "C:\SecurityRunbook\CA-Exports\CA-2026-04-01" -File | Select-Object -ExpandProperty Name
|
||||||
|
$new = Get-ChildItem "C:\SecurityRunbook\CA-Exports\CA-2026-07-01" -File | Select-Object -ExpandProperty Name
|
||||||
|
|
||||||
|
Compare-Object $old $new
|
||||||
|
```
|
||||||
|
|
||||||
|
Then for any policy that changed, open the JSON files and compare manually. The changed lines are the configuration drift.
|
||||||
|
|
||||||
|
**What you are looking for:** Policies deleted since last quarter. Policies whose parameters changed (exclusions added, scope narrowed, MFA grant changed to "grant without controls"). New policies in report-only mode that should have been enabled.
|
||||||
|
|
||||||
|
**Call us if:** Any CA policy has changed without a corresponding change record. A policy that was enforcing is now in report-only mode. A new exclusion was added to a critical policy (legacy auth block, admin MFA, device compliance).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## "Call us" trigger list
|
||||||
|
|
||||||
|
These are the situations where you stop, take a screenshot, and contact us — even outside a scheduled check:
|
||||||
|
|
||||||
|
| What you see | How urgent | What to do first |
|
||||||
|
|---|---|---|
|
||||||
|
| Break-glass alert fires unexpectedly | Immediate | Disable any active sessions for the break-glass account, then call us |
|
||||||
|
| New Global Admin you did not create | Immediate | Do not remove it yet — screenshot first, then call us |
|
||||||
|
| Synced account in Global Admin role | Same day | Do not change anything — screenshot and call us |
|
||||||
|
| DCSync alert from Defender for Identity | Immediate | Isolate the source host from the network if possible, then call us |
|
||||||
|
| External auto-forward rule found on any executive mailbox | Same day | Disable the rule, check for mail forwarded, call us |
|
||||||
|
| PingCastle score drops more than 10 points | Within 48 hours | Send us the report alongside the previous quarter's |
|
||||||
|
| Any alert sitting at High severity for more than 24 hours you do not know how to triage | Within 24 hours | Screenshot, note what the alert says, call us |
|
||||||
|
| Backup restore fails or produces corrupt data | Same day | Do not delete anything — call us |
|
||||||
|
| Something that feels wrong but is not on this list | Use your judgement | A wrong feeling is data. Document what you noticed and send it. We will tell you if it is nothing. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tracking spreadsheet columns
|
||||||
|
|
||||||
|
Keep a simple spreadsheet (Excel or SharePoint list) with one row per check per quarter:
|
||||||
|
|
||||||
|
| Date | Check | Result / Count | vs. Last Quarter | Action taken | Escalated to consultant? |
|
||||||
|
|------|-------|---------------|-----------------|--------------|--------------------------|
|
||||||
|
|
||||||
|
The trend matters more than any individual value. A metric that is consistently getting worse is a finding even if no single value crosses a threshold.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## When to schedule the next full engagement
|
||||||
|
|
||||||
|
Use this as a rule of thumb:
|
||||||
|
|
||||||
|
- **Annual:** Full adversarial validation (the engagement that produced this document). Recommended even if the monthly and quarterly checks are clean — they catch drift, not adversarial paths.
|
||||||
|
- **Triggered:** Any time a "call us immediately" event fires, or PingCastle / Purple Knight produces a new Critical finding.
|
||||||
|
- **Project-triggered:** Before any major change to the estate — AD migration, new cloud service onboarding, M365 license change, acquisition or merger, significant IT staff change.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Self-service cadence for [client name]. Produced June 2026. Review and update January 2027 alongside the field guide update.*
|
||||||
@@ -0,0 +1,194 @@
|
|||||||
|
# The Antifragile Handbook for M365 & Active Directory
|
||||||
|
|
||||||
|
## Book I — Principles & Judgement
|
||||||
|
|
||||||
|
> *Move fast and fix things.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why this book exists
|
||||||
|
|
||||||
|
This is not a benchmark. It will not give you a number to report to a steering committee. It will not tell you that your tenant is 87% compliant, because that number is a lie that makes everyone feel safe while the building burns. Compliance frameworks — CIS, NIST, ISO, the lot — answer one question: *did you do the things on the list?* That is a useful question. It is not the important one. The important question is: **when this gets attacked, does it get weaker, stay the same, or get stronger?** A system that gets stronger from being stressed is antifragile. Almost no M365 + AD estate is antifragile by default. Most are the opposite: a flat domain synced to a cloud tenant, where one phished helpdesk account quietly becomes domain dominance becomes Global Admin. That is fragility wearing a compliance certificate. A consultant trained on benchmarks knows *what* the settings should be. A consultant trained on this book knows *which settings matter, why, and what breaks if they're wrong* — and can walk into a tenant they've never seen and find the thing that will actually kill the client. That is the difference between a technician and an independent professional. We are trying to raise the second kind.
|
||||||
|
|
||||||
|
### What "move fast and fix things" actually means
|
||||||
|
|
||||||
|
It is a deliberate edit of the old Silicon Valley creed. The original assumed things were whole and that breaking them was the cost of speed. Our world is the reverse: **the things are already broken.** Legacy auth is still on. Service accounts from 2014 still have domain admin. Nobody has tested the break-glass account since it was created. Speed, here, is not recklessness — it is refusing to let a thirty-page risk-acceptance process protect a fragility that a teenager with a phishing kit will remove for free. So:
|
||||||
|
|
||||||
|
- **Fast** — bias to action. A fix shipped this week beats a perfect fix discussed for a quarter. Fragility compounds while you deliberate.
|
||||||
|
- **Fix** — actually change the structure, not the documentation. A risk you *accepted* is a risk you still have.
|
||||||
|
- **Things that matter** — and this is the whole craft — the discrimination to know that disabling legacy auth outranks renaming forty GPOs to match a naming standard. Most of the checklist is noise. Find the signal.
|
||||||
|
|
||||||
|
### How compliance still fits (read this before you get smug)
|
||||||
|
|
||||||
|
We are not anti-compliance. We are anti-*thoughtless* compliance. Your clients have auditors, contracts, and regulators, and you will still help them pass. The relationship is this:
|
||||||
|
|
||||||
|
> **Compliance is a floor and a by-product. It is never the target.**
|
||||||
|
|
||||||
|
If you build an antifragile estate, you will pass CIS almost by accident, and you will be able to explain *why* every control exists — which is more than most auditors can. But you will also do things no benchmark asks for (game-days, kill-switch drills, deliberate removal of features) and you will *skip* things benchmarks demand when they add fragility or cost without reducing blast radius. When you skip, you skip **on the record, with a written reason**. That is the difference between independent judgement and laziness.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The governing question
|
||||||
|
|
||||||
|
Before the principles, the one question that sits above all of them. Ask it of every account, every trust, every sync, every app registration:
|
||||||
|
|
||||||
|
> **If this is owned tonight, what is the largest thing an attacker reaches before hitting a wall — and can I draw that wall?**
|
||||||
|
|
||||||
|
If you cannot draw the wall, there is no wall. In M365 + AD the wall is almost always missing in the same place: the **identity bridge** between on-prem AD and Entra ID. Internalise this and half the job is done.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Principles
|
||||||
|
|
||||||
|
Nine of them. They overlap on purpose — antifragility is a way of seeing, not a checklist (the irony would be unbearable). Each comes with **judgement prompts**: the questions an independent consultant asks instead of looking up the "correct" value. Learn the questions, not the answers. The answers change with every tenant; the questions don't.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 1. Via Negativa — subtract before you add
|
||||||
|
|
||||||
|
The strongest control is the thing that no longer exists. It cannot be misconfigured, cannot be exploited, cannot drift, and costs nothing to maintain. Benchmarks are addition machines — every control is something *more* to deploy and watch. Start the other way: what can we **delete**? In M365 + AD, the highest-leverage deletions are usually: legacy/basic auth, NTLM and unconstrained delegation, standing privileged role assignments, dormant service accounts and their static secrets, unused federation, public folders, orphaned app registrations with tenant-wide consent, and "temporary" firewall or CA exclusions that became permanent. **Judgement prompts**
|
||||||
|
|
||||||
|
- If I removed this control/feature/account, would *anyone* notice within 90 days? If not, why does it exist?
|
||||||
|
- What is the oldest thing here still running, and who decided it should keep running — or did nobody decide?
|
||||||
|
- Every exclusion is a tiny hole punched in a wall. List the exclusions. Who asked for each, and is that person still here?
|
||||||
|
- Am I about to *add* a control to compensate for something I could *remove* instead?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. The Barbell — protect the irreplaceable, let the rest stay cheap
|
||||||
|
|
||||||
|
Compliance scoring spreads effort evenly: every control worth the same point. Reality is not evenly distributed. A handful of things are irreplaceable — tenant root, Tier 0 / domain controllers, break-glass accounts, backups, the sync engine. Everything else is, in principle, rebuildable. Put **paranoid, expensive, redundant** protection on the irreplaceable few. Let everything else be **cheap, fast, and replaceable** — even disposable. Do not spend your political capital hardening a kiosk laptop while a Global Admin has no phishing-resistant MFA. The middle — moderate protection spread thinly over everything — is where budgets and attention go to die. **Judgement prompts**
|
||||||
|
|
||||||
|
- Name the five things in this estate that, if lost, cannot be rebuilt. Are they protected differently from everything else, or the same?
|
||||||
|
- Where is effort being spent evenly that should be spent asymmetrically?
|
||||||
|
- Is anything in the "cheap and replaceable" bucket actually load-bearing in disguise? (The "temporary" script on someone's laptop that runs payroll.)
|
||||||
|
- Could I afford to let this thing be *destroyed* and just rebuild it? If yes, stop gold-plating it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. Blast Radius is the metric — not the control count
|
||||||
|
|
||||||
|
This is the governing question turned into a habit. Compliance counts inputs (controls present). Antifragility measures **propagation** (how far a compromise travels). A tenant with 200 controls and a flat AD→Entra trust is more fragile than a tenant with 50 controls and a real tier boundary. The defining fragility of hybrid M365 is **coupling**: Password Hash Sync or PTA, Entra Connect running as a quasi-Tier-0 service, AD admins who are also cloud admins, devices that are both domain-joined and the user's MFA device. Each coupling means one compromise becomes two. Antifragile design **decouples** — it turns the identity bridge from a conduit into a firebreak. **Judgement prompts**
|
||||||
|
|
||||||
|
- Draw the attack path from a single phished standard user to Global Admin. How many *independent* barriers are there? Independent, not "two MFA prompts from the same provider."
|
||||||
|
- Which single account, if compromised, ends the engagement? How many are there? (If the answer is more than zero, that's the project.)
|
||||||
|
- If on-prem AD fell completely, would the cloud survive — and vice versa? Or are they one organism wearing two badges?
|
||||||
|
- What runs the sync, and what could that identity reach? Trace it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4. Optionality — buy cheap escape hatches
|
||||||
|
|
||||||
|
Pay a small, certain cost now for the *option* to survive an uncertain disaster later. Break-glass accounts, a tested "kill the sync" runbook, a way to revoke all tokens at once, an offline copy of recovery keys, a documented path to a clean tenant. These look like waste to an auditor and like wisdom on the worst day of the client's year. Optionality is the opposite of optimisation. An optimised system has no slack and shatters at the first surprise. Deliberately keep some slack. **Judgement prompts**
|
||||||
|
|
||||||
|
- When the primary path fails, what's the second path — and has anyone walked it?
|
||||||
|
- If we had to sever AD from Entra in the next 30 minutes to contain a breach, *how*? Is that written down where someone panicking can find it?
|
||||||
|
- Break-glass: does it exist, is it phishing-resistant, is it excluded from the CA policy that would otherwise lock it out, and when was it last *used* in a drill (not just created)?
|
||||||
|
- What are we optimising so hard that we've removed all room to manoeuvre?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5. Stress it on purpose — hormesis, not hope
|
||||||
|
|
||||||
|
Muscle, bone, and immune systems get stronger from controlled stress and weaker from protection. Systems are the same. **An untested control is a broken control** — you simply don't know it yet. The benchmark says "the setting is configured." The antifragile consultant says "we revoked the token at 14:00 on a Tuesday and watched what actually happened." Run game-days. Disable a CA policy and observe the fallout in a controlled window. Simulate Entra Connect failure. Pull a Global Admin's session. Kill a DC. You *want* to discover brittleness on a quiet afternoon, cheaply, with the right people watching — not at 3 a.m. during a real intrusion. **The corollary: declared state is not enforced state.** Underneath "untested = broken" sits a harder truth about *why* you must test — every representation the platform hands you (a config blade, an inventory record, a compliance dashboard, a green tick) is a **claim about reality, not reality itself**, and the two diverge silently and routinely. Two examples that should haunt you:
|
||||||
|
|
||||||
|
- A Conditional Access policy can display a flawless configuration and **enforce nothing** — the evaluated object has desynced from the one you're looking at. Every config review, export-diff, and benchmark audit passes. Only a real sign-in reveals it fails open. (Worked example in Book IV.)
|
||||||
|
- A CMDB or device inventory shows a clean, managed fleet while the sign-in logs show a different, larger, partly-unknown population actually touching the data. The inventory is a wish; the authentication record is the fact. (Worked example in Book IV.)
|
||||||
|
|
||||||
|
So the rule that governs the whole craft: **verify by observation, never by inspection.** Trust what the system *does* under test over what any artefact *says* it does. Reading the config is not knowing the behaviour; counting the inventory is not knowing the fleet. Where the representation and the observed behaviour disagree, the behaviour is the truth and the representation is the bug. **Judgement prompts**
|
||||||
|
|
||||||
|
- What here has never once been tested by actually breaking it?
|
||||||
|
- What do we *believe* is true about this estate that we've never verified by observation? (Belief is not evidence. The portal showing a green tick is not the same as the control firing under attack.)
|
||||||
|
- Which "facts" about this estate come from a *representation* (config screen, CMDB, dashboard) rather than from *observed behaviour*? Which have we confirmed the system actually does, versus merely says?
|
||||||
|
- Where would a silent divergence between declared and enforced state hurt most — and how would we even notice it?
|
||||||
|
- When did this client last deliberately break something to learn from it? If "never," that's the most important finding in your report.
|
||||||
|
- What's the smallest, safest experiment that would tell us whether X is real?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 6. Every incident must change the structure
|
||||||
|
|
||||||
|
This is the actual definition of antifragile — *gaining from disorder.* A robust system survives a shock unchanged. An antifragile system comes out **structurally different and harder to hit the same way twice.** Pain that closes a ticket without changing the architecture is wasted pain, and it guarantees the same incident again. After every incident, near-miss, failed game-day, or even a noisy false positive: what *structural* thing changes? Not "we reminded users to be careful." A removed permission, a severed coupling, a new firebreak, a deleted feature. **Judgement prompts**
|
||||||
|
|
||||||
|
- For the last three incidents (or alerts) here — what changed in the *structure* afterwards? If the answer is "a training reminder," nothing changed.
|
||||||
|
- Does this organisation treat incidents as embarrassments to bury or as fuel? (Blameless on people, ruthless on structure.)
|
||||||
|
- Are we fixing the instance or the class? Patching this account, or removing the pattern that made it possible?
|
||||||
|
- What did the last false positive *teach* us that we threw away?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 7. Convexity — prefer bounded cost, unbounded upside
|
||||||
|
|
||||||
|
Choose controls whose downside is small and known, and whose upside is large and broad. Conditional Access is convex: cheap to run, fails gently, and one good policy blocks whole classes of attack. A sprawling, hand-tuned DLP ruleset is concave: expensive to maintain, brittle, and it fails in surprising, expensive ways at the worst moment. Favour the convex. Be deeply suspicious of any control that needs constant tending to keep working. **Judgement prompts**
|
||||||
|
|
||||||
|
- When this control fails, does it fail *safe and quietly*, or *open and catastrophically*? (Fail-open is concave and usually a trap.)
|
||||||
|
- How much ongoing care does this need to keep working? High-maintenance controls rot the moment attention moves on.
|
||||||
|
- Does this control block a *class* of attacks or just one specific instance? Prefer the class.
|
||||||
|
- Are we buying a complex product to solve a problem that one CA policy and a deletion would solve?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 8. Lindy — trust what has survived
|
||||||
|
|
||||||
|
The longer a mechanism has survived, the longer it's likely to keep working. Boring, time-tested controls (least privilege, network segmentation done right, hardware-backed keys, tiered admin) beat the newest preview blade in the portal. New features arrive with unknown failure modes and unknown attack surface; they have not yet been stress-tested by the world. Use them when they earn it, not because they're new. Equally: an attack technique that has worked for fifteen years (NTLM relay, Kerberoasting, consent phishing) will probably work next year — prioritise accordingly. **Judgement prompts**
|
||||||
|
|
||||||
|
- Is this control time-tested, or are we the QA team for a feature that shipped last month?
|
||||||
|
- What are the oldest, most reliable attacks against this estate — and have we actually closed them, or chased novel ones while the classics stay open?
|
||||||
|
- If this shiny feature vanished tomorrow, would we be exposed? If yes, we built on sand.
|
||||||
|
- Are we solving a 2015 problem with a 2026 product because the product is new?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 9. Skin in the game — whoever designs it, lives with it
|
||||||
|
|
||||||
|
Security theatre is what happens when the people imposing controls never carry the pager. A consultant who recommends a control they'd never have to operate is selling fragility dressed as diligence. The person who designs the break-glass process should be woken up by the drill. The architect who couples AD to Entra should be the one who has to uncouple it under fire. This applies to you. Don't recommend what you wouldn't run. Don't hand a client a 40-page hardening guide you've never operated. Your reputation is your skin in the game — stake it on advice that survives contact with reality. **Judgement prompts**
|
||||||
|
|
||||||
|
- Does the person who designed this control have to live with its consequences? If not, expect theatre.
|
||||||
|
- Am I recommending this because it's right, or because it's defensible if something goes wrong? (Defensive medicine is fragility you can bill for.)
|
||||||
|
- Would I bet my own reputation that this works under real attack? If I hesitate, why am I asking the client to bet theirs?
|
||||||
|
- Who gets the 3 a.m. call when this fails — and were they in the room when it was designed?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How to spot fragility (the field skill)
|
||||||
|
|
||||||
|
You will walk into estates with no documentation and no time. Fragility has a smell. Train your nose on these tells:
|
||||||
|
|
||||||
|
- **Folklore.** Configurations only one person understands, justified by "we've always done it that way." If they leave, it becomes un-auditable. Folklore is fragility with tenure.
|
||||||
|
- **Single points of failure wearing a uniform.** One service account that runs everything. One admin who holds all the keys. One unreplicated DC. One sync server treated as cattle but actually a pet.
|
||||||
|
- **Tight coupling.** Compromise one thing → automatically own a second. AD↔Entra, identity-device-MFA all on one phone, prod and admin in one forest.
|
||||||
|
- **Things never tested.** Backups never restored. Break-glass never used. DR plans never run. "It should work" is the sound of a fragile system.
|
||||||
|
- **Permanent "temporary."** Exclusions, exceptions, pilot configs, and risk acceptances older than 18 months.
|
||||||
|
- **Even spreading.** Effort distributed uniformly is a sign nobody asked what matters. The barbell is missing.
|
||||||
|
- **Green dashboards, untested reality.** Everything compliant, nothing ever stress-tested. The most dangerous estate of all, because it feels safe.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The anti-benchmark: what we measure instead of compliance %
|
||||||
|
|
||||||
|
We don't score controls passed. If the client needs a number, give them these — and explain why each beats a compliance percentage:
|
||||||
|
|
||||||
|
- **Blast radius** — from a single phished standard user, how many independent barriers to tenant/domain dominance? (Higher is better. Most estates: zero or one.)
|
||||||
|
- **Mean time to recover** — measured by *actually doing it* in a drill, not by the RTO written in a policy.
|
||||||
|
- **Single points of failure** — counted, named, and owned. The goal is a shrinking list, not a green tick.
|
||||||
|
- **Untested assumptions** — the number of load-bearing beliefs never verified by observation. The goal is to drive this toward zero.
|
||||||
|
- **Time-to-remove** — how fast can we delete a fragilizer (legacy auth, a standing admin) once found? Velocity *is* a security metric.
|
||||||
|
|
||||||
|
None of these are easy to fake, which is exactly why they're worth measuring.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How to use this handbook
|
||||||
|
|
||||||
|
Book I is the lens. The domain books that follow — Hybrid Identity, Privileged Access, Devices, Data & Collaboration, Recovery, Detection-as-feedback — each apply this same lens in the same shape:
|
||||||
|
1. **Fragility inventory** — where does this domain break, and what's the blast radius?
|
||||||
|
2. **Via negativa** — what do we remove first?
|
||||||
|
3. **The barbell** — what gets paranoid protection, what stays cheap?
|
||||||
|
4. **Optionality & recovery** — what are the escape hatches, and are they tested?
|
||||||
|
5. **Stressor** — how do we deliberately break this to learn?
|
||||||
|
|
||||||
|
If you ever find yourself reaching for "because the benchmark says so," stop. Go back to the governing question. Draw the wall. If you can't draw it, you've found your work.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Book I of the Antifragile Handbook. Principles over checklists. Judgement over obedience. Move fast and fix things.*
|
||||||
@@ -0,0 +1,167 @@
|
|||||||
|
# The Antifragile Handbook for M365 & Active Directory
|
||||||
|
|
||||||
|
## Book II — Hybrid Identity
|
||||||
|
|
||||||
|
> *Draw the wall between on-prem and cloud. In most estates there isn't one — there's a hallway with the door propped open.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why this is the keystone
|
||||||
|
|
||||||
|
If you only ever fix one domain, fix this one. Every other book — privileged access, devices, data — assumes identity holds. In a hybrid M365 + AD estate, identity usually doesn't hold, and the reason is always the same: on-prem AD and Entra ID are not two systems with a guarded border. They are **one organism wearing two badges**, joined by a bridge that most organisations cannot draw, do not monitor, and have never tested severing.
|
||||||
|
|
||||||
|
The governing question, applied here:
|
||||||
|
|
||||||
|
> **If on-prem AD is ransomwared or domain-dominated tonight, does the cloud survive — or is it already poisoned by inheritance?**
|
||||||
|
|
||||||
|
For the overwhelming majority of estates the honest answer is "poisoned," and nobody has ever said it out loud. Your job is to say it out loud, then build the wall.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Fragility inventory — anatomy of the bridge
|
||||||
|
|
||||||
|
You cannot harden what you can't draw. Here is the bridge, piece by piece, with the blast radius of each. Learn to find all of these on day one of an engagement.
|
||||||
|
|
||||||
|
### The sync engine (the single most dangerous server you'll forget about)
|
||||||
|
|
||||||
|
Entra Connect Sync (the old Azure AD Connect) or Entra Cloud Sync runs the synchronisation. Whatever the diagram says, **this server is Tier 0** — because of the accounts it holds:
|
||||||
|
|
||||||
|
- **The on-prem connector account.** Under the old "express" install, this account was granted *Replicate Directory Changes* and *Replicate Directory Changes All* — which is **DCSync**. That means the sync server holds an identity that can pull every password hash in the domain. Read that again. The box your infra team treats as a middling utility VM can dump the entire domain.
|
||||||
|
- **The Entra connector account** (Directory Synchronization Accounts role) — can manipulate synced objects in the cloud.
|
||||||
|
|
||||||
|
So: compromise the sync server → DCSync on-prem **and** tamper with cloud objects. One box, both kingdoms. If this server is domain-joined to the production domain (it usually is), then anything that reaches prod-tier reaches your DCSync machine. That is the central coupling of the entire estate.
|
||||||
|
|
||||||
|
**Where it's worse than you think:** the sync server is often internet-facing for updates, runs a local SQL Express nobody patches, sits on an OS build from the project that installed it, and has not had its connector account rights reviewed since go-live.
|
||||||
|
|
||||||
|
### The authentication method (decides whether the cloud lives or dies with AD)
|
||||||
|
|
||||||
|
Three options, three completely different fragility profiles. Know which one you're actually on before you say anything — the diagram and the reality often disagree.
|
||||||
|
|
||||||
|
- **Password Hash Sync (PHS).** A hash-of-a-hash is synced to Entra; the cloud can authenticate on its own. *This is the most resilient for availability* — if on-prem dies, cloud auth keeps working. The transport is fine and not trivially reversible to the plaintext password; the risk is **not** "PHS leaks passwords," it's that the connector account doing the sync can DCSync. Don't let anyone fragilise availability to "fix" a risk that lives in the connector account, not the hash.
|
||||||
|
- **Pass-through Authentication (PTA).** Credentials are validated against on-prem AD in real time by PTA agents. **Coupling: on-prem outage = cloud auth outage.** Worse, the agent must handle the credential to validate it, so a compromised PTA agent is a plaintext-credential harvesting position. PTA agents are Tier 0 and a juicy target, and PTA is a conduit, not a firebreak. (You can enable PHS *alongside* PTA as failover — cheap optionality, see §4.)
|
||||||
|
- **Federation / AD FS.** The catastrophe. See below — it gets its own treatment because it's usually the single largest fragility in the estate.
|
||||||
|
|
||||||
|
### AD FS and Golden SAML (the thing that ends careers)
|
||||||
|
|
||||||
|
If AD FS issues tokens, then the **token-signing key** can forge a SAML assertion for *any* user — including bypassing MFA when MFA is enforced at the federation layer — and the cloud will trust it because it's validly signed. This is **Golden SAML**. It is how nation-state actors turned a single on-prem foothold into silent, total, persistent cloud impersonation (the SolarWinds intrusions). It is nearly invisible: the IdP is forging legitimate tokens, so there's no failed login, no anomalous password, nothing for a benchmark to catch.
|
||||||
|
|
||||||
|
The token-signing certificate is a single catastrophic point of failure that most orgs never rotate, store poorly, and don't monitor. If you take one thing from this book: **AD FS is fragility incarnate, and the correct long-term answer is to remove it** (§2), not to harden it.
|
||||||
|
|
||||||
|
### Seamless SSO (the forgotten Kerberos key)
|
||||||
|
|
||||||
|
Seamless SSO creates the `AZUREADSSOACC` computer account in AD. Its Kerberos decryption key, if never rotated (it usually never is), is a silver-ticket / token-forging exposure. Classic Lindy fragility: old, unrotated, forgotten, exploitable.
|
||||||
|
|
||||||
|
### The writebacks (reverse conduits nobody counts)
|
||||||
|
|
||||||
|
Every writeback turns the bridge two-way and creates *reverse* blast radius:
|
||||||
|
|
||||||
|
- **Password writeback** — cloud SSPR can change on-prem passwords. Useful; also a path from cloud to on-prem.
|
||||||
|
- **Device writeback / group writeback** — cloud objects written into AD. Group writeback (v2), where cloud security groups become AD objects that gate on-prem resource access, means a **cloud group compromise now affects on-prem access** — a coupling people rarely diagram.
|
||||||
|
|
||||||
|
Each writeback may be justified. None should be silent. Count them, name the blast radius of each.
|
||||||
|
|
||||||
|
### The admin coupling (one organism, two badges)
|
||||||
|
|
||||||
|
The deepest fragility isn't a setting, it's the people and accounts:
|
||||||
|
|
||||||
|
- The same humans are Domain Admins **and** Global Admins.
|
||||||
|
- Cloud admin accounts are **synced from on-prem**, so on-prem compromise → harvest → cloud admin.
|
||||||
|
- Admins use the same workstation for AD and Entra, and that workstation is also their email/MFA device.
|
||||||
|
|
||||||
|
If on-prem privilege flows into cloud privilege through any of these, there is no wall. There's a hallway.
|
||||||
|
|
||||||
|
### Source of authority (why you can't fix it in the cloud)
|
||||||
|
|
||||||
|
For synced objects, **on-prem is authoritative**. You cannot durably fix a synced object purely cloud-side; the next sync cycle overwrites you. This matters enormously in incident response: if AD is owned, your cloud objects are downstream of poison and "just fix it in Entra" doesn't hold.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Via negativa — what to remove (in priority order)
|
||||||
|
|
||||||
|
Hybrid identity is where subtraction pays the highest dividend in the whole estate. In rough order of leverage:
|
||||||
|
1. **Remove AD FS. Migrate to cloud authentication** (PHS, or PTA if you have a hard real-time-validation requirement), and move MFA and access decisions to Conditional Access in Entra where they belong. This deletes Golden SAML as a class, shrinks attack surface massively, and removes a SPOF you were never rotating anyway. This is the single highest-leverage deletion in this book.
|
||||||
|
2. **Stop syncing privileged on-prem accounts to the cloud.** Domain Admins, Enterprise Admins, Tier 0 — filter them *out* of sync scope. They have no business being cloud objects. A synced privileged account is a free bridge for the attacker.
|
||||||
|
3. **Make cloud admins cloud-only.** Global Admins and other Entra privileged roles should be cloud-only accounts (`.onmicrosoft.com`), phishing-resistant, never derived from or synced with on-prem identity. This is the firebreak in one move (see §3).
|
||||||
|
4. **Trim the writebacks.** Keep only the ones with a named owner and a justified reverse blast radius. Delete the rest.
|
||||||
|
5. **Rotate or remove Seamless SSO.** If you don't need it, remove the `AZUREADSSOACC` account. If you keep it, rotate the key on a schedule — and the fact that nobody has is itself a finding.
|
||||||
|
6. **Reduce sync scope.** OU-filter aggressively. Don't sync what the cloud doesn't need. Every synced object is attack surface and a potential bridge. The default "sync everything" is laziness, not architecture.
|
||||||
|
|
||||||
|
For each deletion the test from Book I applies: *if I removed this, would anyone notice in 90 days?* For AD FS the honest answer, after migration, is usually "no — and the attackers will notice it's gone."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. The barbell — what gets paranoia, what stays cheap
|
||||||
|
|
||||||
|
**The irreplaceable few (paranoid protection, redundancy, monitoring):**
|
||||||
|
|
||||||
|
- **The sync server.** Treat it as Tier 0 *in practice*, not just on the diagram: dedicated admin tier, no internet browsing, hardened OS, least-privileged connector account (use a gMSA; strip DCSync rights if your topology allows the scoped permission model), restricted logon, alerting on the connector account's behaviour.
|
||||||
|
- **The connector accounts.** Least privilege, gMSA where supported, monitored. An account that can DCSync should scream in your SIEM if it ever behaves like a domain controller from the wrong host.
|
||||||
|
- **The AD FS token-signing key** — if AD FS still exists, the key belongs in an HSM, monitored, rotated on a real schedule (remember the rollover cert). But the better barbell move is §2.1: don't own this liability at all.
|
||||||
|
- **Cloud-only break-glass Global Admins** (from Book I) — phishing-resistant, excluded from the CA policy that would lock them out, tested.
|
||||||
|
|
||||||
|
**The firebreak — the one design decision that builds the wall:**
|
||||||
|
|
||||||
|
> **Cloud privilege must not be reachable from on-prem compromise.**
|
||||||
|
|
||||||
|
Cloud-only admin accounts + not syncing privileged on-prem accounts + separate privileged workstations = on-prem can fall completely and the attacker still hits a wall at the cloud admin boundary. *That wall is the entire point of this book.* Draw it, then verify an attacker can't walk around it through the sync server (which is why the sync server is in the paranoid bucket).
|
||||||
|
|
||||||
|
**Everything else stays cheap.** Standard user sync, normal device registration, the bulk of the directory — these are replaceable and don't deserve the attention that the sync server and the admin boundary demand. Don't gold-plate the directory while the connector account can dump it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Optionality & recovery — escape hatches, tested
|
||||||
|
|
||||||
|
- **The "kill the sync" runbook.** A written, rehearsed procedure to stop sync fast when on-prem is compromised, so poison stops flowing cloud-ward. Know the nuance per auth method, because severing behaves differently:
|
||||||
|
- *PHS:* disabling sync stops new changes flowing, but already-synced hashes remain — containment of *propagation*, not instant revocation. Pair with token revocation and credential resets.
|
||||||
|
- *PTA / Federation:* severing the bridge can take cloud auth down with it unless you've pre-staged a fallback. Which is why —
|
||||||
|
- **Pre-stage the federated-to-managed conversion.** Know, in advance, how to convert the domain from federated (or PTA) to managed/cloud auth (PHS) *fast*, so that during an on-prem incident you can cut the dependency and keep the cloud alive on its own. Rehearse it. "We think we could" is not a plan.
|
||||||
|
- **PHS as failover under PTA.** Cheap optionality: run PHS alongside PTA so a PTA-agent or on-prem outage doesn't lock everyone out of the cloud. Small certain cost now, large uncertain payoff later. Classic Book I optionality.
|
||||||
|
- **Cloud-only admin path that survives AD death.** Because cloud admins are cloud-only (§3), you retain full control of the tenant even if AD is gone. This *is* the recovery path — verify it actually works without any on-prem dependency (including MFA that doesn't secretly route through on-prem).
|
||||||
|
- **Accept the source-of-authority reality.** Your IR plan must account for the fact that synced objects are downstream of on-prem. Decide *in advance* whether, during a domain-dominance incident, you sever first and rebuild authority cloud-side. Discovering this mid-incident is how recoveries fail.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Stressor — break it on purpose
|
||||||
|
|
||||||
|
Untested = broken. Game-days for hybrid identity, smallest/safest first:
|
||||||
|
|
||||||
|
- **Pull the sync server** (planned window). Does cloud auth survive? The answer *proves* which auth method you're really on and whether your availability assumptions are true. Most teams are surprised. That surprise is the point.
|
||||||
|
- **Revoke / disable the connector account and watch your SIEM.** Did anything alert? An account that can DCSync going dark, or behaving oddly, should be the loudest alarm you own. If nothing fired, you've found a detection gap worth more than any control you could add.
|
||||||
|
- **Golden SAML tabletop** (if AD FS exists). Walk through: attacker has the token-signing key — what do you detect, how do you contain, how fast can you rotate, and could you tell at all? If the honest answer is "we couldn't tell," escalate the §2.1 removal from "roadmap" to "now."
|
||||||
|
- **Break-glass under sync-down.** Test the cloud-only break-glass account *while the bridge is severed*. It must work with zero on-prem dependency. If it silently relied on something on-prem, you just found it on a Tuesday instead of during the breach.
|
||||||
|
- **DCSync detection drill.** Have someone simulate DCSync from an unexpected host and confirm detection fires. The connector account is the one place DCSync is "normal," which is exactly why attackers love to look like it.
|
||||||
|
|
||||||
|
Every one of these, per Book I principle 6: whatever breaks must produce a **structural** change, not a calendar reminder.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Honest uncertainty (read this, don't trust a handbook on moving parts)
|
||||||
|
|
||||||
|
This book teaches stable mechanisms — the coupling between AD and Entra, Golden SAML, the DCSync-via-connector path, the PHS/PTA/federation trade-offs. Those don't change much; they're Lindy.
|
||||||
|
|
||||||
|
What **does** move, and what you must verify against current Microsoft documentation rather than trusting any 2026-vintage handbook:
|
||||||
|
|
||||||
|
- **Connect Sync vs Cloud Sync feature parity.** Microsoft has been steering new deployments toward the lighter Cloud Sync agent (no SQL, multiple agents for HA — better optionality), but parity for specific scenarios (certain writebacks, device sync, large/complex topologies, passthrough nuances) has been evolving. **Check the current parity matrix before you recommend a migration.** Don't let me, or any document, freeze this for you.
|
||||||
|
- **AD FS deprecation / migration tooling.** Direction of travel is clearly away from AD FS toward Entra-native auth, with staged-rollout and migration tooling to ease it. Exact timelines, tool capabilities, and supported paths shift — verify current state when you scope the work.
|
||||||
|
- **Connector account hardening guidance** (gMSA support, least-privilege permission models, the scoped alternative to full DCSync rights) continues to improve — confirm what's available for your topology and version.
|
||||||
|
|
||||||
|
If a client's safety depends on a current-version specific, **look it up and cite it**, don't quote your memory or this book. Honest "I need to verify the current parity" beats confident and wrong every time. That's not weakness; that's the job.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consolidated judgement prompts
|
||||||
|
|
||||||
|
The questions to carry into any hybrid estate:
|
||||||
|
|
||||||
|
- Which auth method are we *actually* on — and does the cloud survive on-prem death? (Verify by testing, not by asking.)
|
||||||
|
- Is the sync server Tier 0 in practice or only on the diagram? What can its connector account reach? Can it DCSync?
|
||||||
|
- Are any privileged on-prem accounts synced to the cloud? Are Global Admins cloud-only or synced?
|
||||||
|
- Can on-prem privilege reach cloud privilege by *any* path — accounts, workstations, the sync server, writebacks? Draw every path. Each one is a hole in the wall.
|
||||||
|
- Do we have AD FS? *Why?* What exactly would removing it take, and what's the honest reason it hasn't happened?
|
||||||
|
- When was the Seamless SSO key / AD FS token-signing cert last rotated? ("Never" is a finding, not an answer.)
|
||||||
|
- Which writebacks are on, and what reverse blast radius does each create?
|
||||||
|
- If we severed the bridge in the next 30 minutes, what breaks, and is the procedure written where someone panicking can run it?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Book II of the Antifragile Handbook. The wall between on-prem and cloud is the most important structure you will ever draw — because in most estates, it isn't there. Move fast and fix things.*
|
||||||
@@ -0,0 +1,142 @@
|
|||||||
|
# The Antifragile Handbook for M365 & Active Directory
|
||||||
|
|
||||||
|
## Book III — Privileged Access
|
||||||
|
|
||||||
|
> *Privilege is blast radius with a time axis. Standing privilege reaches everything, forever. The whole job is to collapse both: less reach, less time.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The governing question
|
||||||
|
|
||||||
|
Book I asked you to draw the wall. Book II built it between on-prem and cloud. This book is about the credentials that can knock any wall down. Ask of every privileged identity — human, service account, or app:
|
||||||
|
|
||||||
|
> **If this credential leaks tonight, how long does it stay useful, and how far does it reach?**
|
||||||
|
|
||||||
|
A permanent Domain Admin answers *"forever, everything."* A permanent Global Admin answers *"forever, the whole tenant."* A JIT, scoped, time-boxed role answers *"for one hour, for one task."* Every technique in this book exists to turn the first kind of answer into the second. That's it. That's the whole craft of privileged access: **shrink the reach, shrink the time.**
|
||||||
|
|
||||||
|
Compliance counts whether you "have a PAM solution." Wrong question. The question is whether privilege *evaporates when not in use* and whether a leaked credential hits a wall in minutes instead of owning the estate forever.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Fragility inventory — where privilege rots
|
||||||
|
|
||||||
|
### Standing privilege (the original sin)
|
||||||
|
|
||||||
|
An account that is *always* an admin is a loaded gun left on the table, every hour of every day, whether anyone's using it or not. Its blast radius is constant and maximal. Permanent Domain Admins, permanent Enterprise Admins, permanent Global Admins — every one of them is a credential whose value to an attacker never drops to zero. **The single most important number in this book is: how many identities hold standing privilege?** In most estates it's an order of magnitude too high, and nobody has ever counted.
|
||||||
|
|
||||||
|
### Service accounts and service principals (the dark matter)
|
||||||
|
|
||||||
|
This is where the bodies are buried, on both sides of the wall:
|
||||||
|
|
||||||
|
- **On-prem service accounts** — over-permissioned ("we made it Domain Admin to make it work"), static passwords that haven't changed since 2016, an SPN attached so they're **Kerberoastable** (request the ticket offline, crack the weak password at leisure), owned by nobody, documented nowhere, and impossible to turn off because something unknown will break.
|
||||||
|
- **Cloud service principals / app registrations** — the same disease in a new body. Client secrets that never expire, **tenant-wide admin consent**, and Microsoft Graph permissions that are quietly catastrophic: `RoleManagement.ReadWrite.Directory`, `AppRoleAssignment.ReadWrite.All`, `Application.ReadWrite.All` — any of which is a privilege-escalation path to Global Admin. Service principals **cannot do MFA**, usually hold **standing** privilege, and live in a blind spot no benchmark looks at hard enough.
|
||||||
|
|
||||||
|
Service identities are dark matter: most of the privileged mass of the estate, invisible in the usual diagrams, and gravitationally dominant when something goes wrong.
|
||||||
|
|
||||||
|
### Tier violations (the wall with a hole kicked in it)
|
||||||
|
|
||||||
|
The Lindy core of on-prem security is the tier model (Tier 0 = identity control plane: DCs, AD, ADCS, the sync server from Book II; Tier 1 = servers; Tier 2 = workstations). Microsoft has since reframed it as the Enterprise Access Model reaching into the cloud, but the rule never changed:
|
||||||
|
|
||||||
|
> **A higher-tier credential must never be exposed on a lower-tier system.**
|
||||||
|
|
||||||
|
Every Domain Admin who RDPs into a workstation, every admin whose daily-driver laptop also touches a DC, every shared jump box used for both Tier 0 and Tier 1 — that's a tier violation, and it's how `pass-the-hash` / `pass-the-ticket` turns one phished workstation into domain dominance. The clean-source principle is absolute: **you cannot securely manage a system from a less-secure one.**
|
||||||
|
|
||||||
|
### The escalation plumbing nobody maps
|
||||||
|
|
||||||
|
- **AD ACL backdoors** — who can reset whose password, who has `WriteDACL` / `GenericAll` on what. Privilege hides in object permissions, not just group membership. Attackers map this in minutes; defenders rarely map it at all.
|
||||||
|
- **Delegation** — unconstrained delegation is a standing golden-ticket risk; constrained/RBCD misconfigurations are escalation paths.
|
||||||
|
- **ADCS** — the certificate services escalation paths (the ESC-series misconfigurations) turn a forgotten CA template into domain compromise. ADCS is **Tier 0** and is almost always treated as Tier 1 or forgotten entirely.
|
||||||
|
- **KRBTGT** — the master key behind golden tickets. Rarely rotated; if an attacker ever had it, they may still have it.
|
||||||
|
- **LAPS absent** — without per-machine local admin password randomisation, one cracked local admin hash unlocks lateral movement across every machine sharing it.
|
||||||
|
|
||||||
|
### The recovery paradox
|
||||||
|
|
||||||
|
The accounts that can rebuild the estate after a disaster are, by definition, the most powerful — and therefore the most valuable to an attacker. Break-glass done carelessly is just standing privilege with a heroic name. (Handled in §4.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Via negativa — what to remove (in priority order)
|
||||||
|
|
||||||
|
Privilege is the domain where deletion is the entire strategy. Adding "privileged access controls" on top of unmanaged standing privilege is rearranging furniture in a burning room.
|
||||||
|
1. **Eliminate standing privilege.** Roles become *eligible*, not *active*. Cloud-side this is PIM (§3). On-prem it's harder and the tooling is weaker — be honest about that (§ honest uncertainty) — but time-bound group membership and JIT elevation tooling exist; use them. The target state: at rest, almost nobody is an admin.
|
||||||
|
2. **Empty the top groups toward the irreducible minimum.** Drive Domain Admins, Enterprise Admins, and standing Global Admins down to the smallest number that reality permits (plus break-glass). Delegate specific rights instead of handing out god-mode. "Empty Domain Admins" is an achievable goal, not a fantasy.
|
||||||
|
3. **Kill, convert, or constrain service identities.** Remove the ones nobody can justify (apply the 90-day-scream test). Convert the rest to managed identities — **gMSA** on-prem (the established, Lindy fix: automatic password rotation, no static secret, not Kerberoastable in the same way), **managed identities** in Azure where possible. Strip every excess right. For app registrations: remove the dangerous Graph permissions, expire and rotate secrets, prefer certificate credentials or managed identities over secrets, and delete unused registrations and stale consent grants.
|
||||||
|
4. **Remove tier violations.** No high-tier credential on a low-tier box, ever. This is mostly subtraction — taking admin rights *off* daily-driver machines and shared boxes.
|
||||||
|
5. **Fix the escalation plumbing by removal.** Decommission unused ADCS templates, remove unconstrained delegation, prune dangerous ACLs, deploy LAPS so standing shared local admin passwords cease to exist.
|
||||||
|
6. **Remove standing local admin from users.** Most don't need it. The ones who think they do usually need it for ten minutes a month — which is a JIT problem, not a standing-rights problem.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. The barbell — paranoia for the control plane, cheap for the rest
|
||||||
|
|
||||||
|
**The irreplaceable few (paranoid, redundant, monitored):**
|
||||||
|
|
||||||
|
- **Tier 0** — DCs, AD, ADCS, KRBTGT, and the sync server from Book II. This is the control plane; if it falls, everything falls.
|
||||||
|
- **The handful of break-glass Global Admins** (§4).
|
||||||
|
- **The PIM / role-management configuration itself** — because whoever controls *who can become admin* is effectively admin. Privileged Role Administrator and Privileged Authentication Administrator are crown roles; treat them as such.
|
||||||
|
|
||||||
|
**Paranoid protection for privileged work means, non-negotiably:**
|
||||||
|
|
||||||
|
- **PAWs** — privileged access workstations. All Tier 0 / Global Admin work happens from a clean, hardened, single-purpose device that never reads email or browses the web. The admin's normal laptop is Tier 2 and stays there.
|
||||||
|
- **Phishing-resistant MFA only** for admins — FIDO2 / passkeys / certificate- based. SMS and push-approve are not admin-grade; they're phishable, and admins are the phishing prize.
|
||||||
|
- **Separate, cloud-only privileged identities** for cloud admin (the Book II firebreak, enforced here). On-prem admin identity must not be the cloud admin identity.
|
||||||
|
- **JIT for everything** via PIM: eligible-not-active, time-boxed, MFA on activation, justification logged, and **approval workflow on the crown roles**.
|
||||||
|
- **Conditional Access scoped to admins** — privileged roles usable only from PAWs / compliant devices / named locations.
|
||||||
|
|
||||||
|
**Everything else stays cheap.** Standard RBAC, normal user access, ordinary app permissions — don't pour the privileged-access budget evenly across the whole directory. Concentrate it ferociously on the tiny set of identities that own the control plane. A thousand hardened standard users won't save you if one permanent Domain Admin uses `Password1!` on a Kerberoastable SPN.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Optionality & recovery — escape hatches, tested
|
||||||
|
|
||||||
|
- **Break-glass done right.** This is the deliberate exception to "no standing privilege" — you *need* an account that works when PIM, MFA infrastructure, or the IdP is down. So it's standing by necessity, which means it is protected differently: cloud-only, phishing-resistant credential stored offline/split, excluded from the CA policy that would otherwise lock it out, and **wired so that any use at all triggers a screaming alert.** Standing privilege you can't remove, you watch like a hawk. And you **test it** — an untested break-glass account is Schrödinger's recovery.
|
||||||
|
- **KRBTGT rotation on demand.** Can you rotate KRBTGT (twice, with the required interval) the moment you suspect golden tickets — without taking the forest down? Is it rehearsed? If not, you have a theoretical control, not a real one.
|
||||||
|
- **Fast session revocation / admin disable.** A one-move way to kill a compromised admin's sessions and tokens and disable the account, on both sides of the wall. Rehearse it; the breach is not the time to discover the command.
|
||||||
|
- **No single human as the only recovery path** — balanced against blast radius. You want enough redundancy that one person under a bus (or under coercion) doesn't end recovery, without so many standing admins that you've recreated the problem. The barbell, again.
|
||||||
|
- **Tier 0 / forest rebuild path** — links forward to Book V (Recovery). Know it exists, know it's been tested, know it doesn't secretly depend on a credential that the incident just compromised.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Stressor — break it on purpose
|
||||||
|
|
||||||
|
- **Pull an admin's standing access and route them through PIM for a week.** Does real work still flow? If JIT activation is too slow or broken, people will route around it — and you'll have found that in a drill instead of discovering the shadow standing-admin account they created in revenge.
|
||||||
|
- **Kerberoast yourself.** Run the attack against your own directory. Which service accounts crack? Did anything *detect* the ticket requests? Two findings in one cheap test.
|
||||||
|
- **Attempt a tier violation in a test window.** Try to use a Tier 0 credential on a Tier 2 box. Is it blocked? Detected? Silent? Silence is the worst answer and the most common.
|
||||||
|
- **Run attack-path analysis as routine, not as a once-a-year pentest.** Tools that map "who can reach Domain Admin / Global Admin in N hops" turn privilege escalation into a number you can track over time. **The count of paths to domain/tenant dominance is a better security metric than any compliance percentage.** Drive it down; watch it not creep back up.
|
||||||
|
- **Simulate a malicious consent grant / over-permissioned app.** Register an app requesting a dangerous Graph scope. Does anything flag it? Can you find every existing app holding those scopes today? (You should be able to. Most can't.)
|
||||||
|
- **Break-glass drill** — yes, again, and on a schedule. The recurring test in this whole handbook.
|
||||||
|
|
||||||
|
Per Book I principle 6: each of these must yield a **structural** change — a removed right, a severed path, a new alert — not a note that says "be careful."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Honest uncertainty (the moving parts — verify, don't trust this book)
|
||||||
|
|
||||||
|
Stable and Lindy (teach with confidence): standing privilege is the core risk; the tier / clean-source model; Kerberoasting, pass-the-hash, golden/silver tickets, DCSync; the gMSA pattern; JIT/eligibility as the goal. These don't churn.
|
||||||
|
|
||||||
|
What moves, and what you must verify against current Microsoft documentation:
|
||||||
|
|
||||||
|
- **PIM capabilities, role definitions, and the risk classification of specific Graph permissions** evolve continually. Confirm which scopes are escalation-grade *today* rather than trusting a 2026 list.
|
||||||
|
- **On-prem JIT/PAM tooling is genuinely weaker and more fragmented than the cloud story.** Native time-bound group membership, MIM PAM, and third-party PAM all have trade-offs that shift. Don't promise a client a clean AD-native JIT experience without checking current reality — and be honest that on-prem eligibility is harder than PIM makes cloud look.
|
||||||
|
- **gMSA vs dMSA.** gMSA is the established, Lindy answer for managed service accounts. **dMSA** (delegated managed service accounts, introduced with the Windows Server 2025 generation) targets the real gap — migrating a standing service account and disabling the original — but newer mechanisms carry newer attack surface, and there has been published privilege-escalation research against the dMSA migration path. **Verify current patch and hardening guidance before you recommend dMSA**; this is exactly the kind of new-and-shiny that Book I principle 8 warns about. gMSA until you've checked dMSA's current state.
|
||||||
|
- **Enterprise Access Model vs the classic three-tier model** — same logic, evolving names and cloud extensions. Use whichever vocabulary the client knows; don't get religious about the label.
|
||||||
|
|
||||||
|
If a client's safety hinges on a current specific, look it up and cite it. "I need to verify the current Graph permission classification" beats confidently quoting a stale one. That posture *is* the independence this handbook is trying to build.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consolidated judgement prompts
|
||||||
|
|
||||||
|
- How many identities hold **standing** privilege — human, service account, and service principal — counted, named, and owned? (If you can't produce the number, that's finding #1.)
|
||||||
|
- For each privileged credential: leaked tonight, how long is it useful and how far does it reach? Where's the wall?
|
||||||
|
- Where are the tier violations? Which high-tier credentials touch low-tier systems? Does any admin's daily laptop reach Tier 0?
|
||||||
|
- Which service accounts are Kerberoastable? Which app registrations hold escalation-grade Graph permissions or non-expiring secrets?
|
||||||
|
- Are cloud admins cloud-only and phishing-resistant, or synced and push-MFA'd? (Book II firebreak — verify it's actually enforced here.)
|
||||||
|
- Does privilege **evaporate when idle** (PIM/JIT) or sit loaded on the table?
|
||||||
|
- Is ADCS treated as Tier 0? When was KRBTGT last rotated? Is LAPS deployed?
|
||||||
|
- Break-glass: does it exist, is it monitored to scream on use, and when was it last *tested* — not created, tested?
|
||||||
|
- How many paths to Domain Admin / Global Admin exist right now, and is that number going up or down?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Book III of the Antifragile Handbook. Privilege is blast radius with a clock on it. Shrink the reach, shrink the time, and watch the credentials that can rebuild the world. Move fast and fix things.*
|
||||||
@@ -0,0 +1,172 @@
|
|||||||
|
# The Antifragile Handbook for M365 & Active Directory
|
||||||
|
|
||||||
|
## Book IV — Devices & Endpoint (Intune)
|
||||||
|
|
||||||
|
> *The device will be compromised. Compliant is not the same as secure, and the portal toggle is not the same as the device's behaviour. Build for the compromise, not against it.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The governing question
|
||||||
|
|
||||||
|
Most endpoint programmes are built on a wish: *make the device trusted.* That wish is unwinnable — a device in a user's hand, on a network you don't control, running an OS you didn't write, will eventually be compromised, and no amount of hardening changes that. So flip the question:
|
||||||
|
|
||||||
|
> **Assume every device is already compromised. What still holds?**
|
||||||
|
|
||||||
|
If the answer is "nothing, because a compromised-but-compliant device gets full access," you've built fragility with a green tick on it. The antifragile endpoint posture stops trying to own the device and instead builds a boundary that **survives an untrusted device**: the data lives behind a wall, the device is cheap and disposable, and "compliant" is treated as what it actually is — a *signal that can be wrong*, not a guarantee.
|
||||||
|
|
||||||
|
That reframe — **compliance is a signal, not a checkbox** — is the spine of this whole book.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Fragility inventory — where the endpoint betrays you
|
||||||
|
|
||||||
|
### The fleet is a fiction: managed, unmanaged, shadow, dark
|
||||||
|
|
||||||
|
Before any of the controls below mean anything, confront the foundational lie of endpoint security: **you do not know your fleet.** The whole book so far has said "the managed devices" as if that set is the fleet. It isn't. The managed devices are the part you *chose to count* — and in most estates they're the bigger part only *if you're lucky.* The blast radius lives in everything else.
|
||||||
|
|
||||||
|
The honest spectrum of what touches your data:
|
||||||
|
|
||||||
|
- **Managed** — enrolled (MDM) or app-managed (MAM). The devices you can see and control. The part the programme is about, and the part everyone fixates on.
|
||||||
|
- **Known-but-unmanaged** — devices that authenticate and reach data but aren't managed. Entra-registered-but-not-compliant, BYOD that hit OWA or a SharePoint link in a browser. They're in the sign-in logs; they're not under your control.
|
||||||
|
- **Shadow** — devices the org never sanctioned but users brought anyway: a personal phone, a contractor's laptop, a home PC pulling files through the web client. Shadow IT at the device layer.
|
||||||
|
- **Dark** — access you have *no device-level visibility into at all.* Legacy- protocol sign-ins that bypass Conditional Access and never produce a clean device signal. Long-lived tokens issued once and never re-evaluated. App passwords. Service principals and automation that aren't devices but reach data like one (the "dark matter" of Book III, wearing a different hat). This is the end of the spectrum that should frighten you, because it never trips a sensor.
|
||||||
|
|
||||||
|
And the inventory of record — the CMDB — is almost always **more wish than reality.** It's populated by *process* (someone files a ticket), and process decays the moment attention moves on. The real device population is populated by *behaviour* — what is actually authenticating right now. The gap between those two is precisely your shadow and dark population, and it's invisible exactly where it matters most.
|
||||||
|
|
||||||
|
This is the Book I corollary made flesh: **the inventory is a claim; the sign-in log is the fact.** Stop deriving your fleet from the CMDB (declarative, decaying, wishful) and start deriving it from observed authentication (behavioural, current, honest). You can't manage what you can't see, and you can't see what you decided not to look at.
|
||||||
|
|
||||||
|
The reframe that saves you is the same barbell from §3: the goal is **not** to manage every device — that's impossible, and chasing it is fragile. The goal is (a) to *know the real population* by observation, and (b) to *gate the data* so that an unmanaged or unknown device gets limited, app-contained, or no access. The question was never "is this device managed." It's **"can a device I don't control reach the data, and what happens when it does?"** An unmanaged device forced through an app-protection boundary in a browser session control is contained. An unmanaged device holding a fat client and a never-re-evaluated token is a hole in the wall you didn't know was open.
|
||||||
|
|
||||||
|
### The compliance signal lies (in both directions)
|
||||||
|
|
||||||
|
"Require compliant device" in Conditional Access is the real control. But the compliance signal underneath it is softer than the toggle suggests:
|
||||||
|
|
||||||
|
- **It's stale.** Compliance is evaluated on a check-in cadence, not continuously. There's a window where a device falls out of compliance — gets rooted, drops encryption, falls behind on patches — and still carries a "compliant" state and a valid token. The signal lags reality.
|
||||||
|
- **It's spoofable.** Root/jailbreak detection is an arms race, not a wall. A motivated attacker (or a determined user with a YouTube tutorial) steps over the tripwire. Treat detection as a tripwire, never as a barrier.
|
||||||
|
- **It's shallow.** "Compliant" usually means a handful of boxes — PIN set, encrypted, OS version, not-jailbroken. None of those stop malware running with the user's own token on a device that passed every check.
|
||||||
|
- **It fails both ways.** A false *compliant* over-trusts a hostile device. A false *non-compliant* locks a legitimate user out at the worst possible moment — and anyone who's run endpoint at scale has watched a flaky signal brick access for someone important mid-flight. Both failure modes are real; design for both.
|
||||||
|
|
||||||
|
### The ghost policy: displayed config ≠ enforced config
|
||||||
|
|
||||||
|
This one is field-earned and genuinely frightening, because it defeats every form of inspection there is. A Conditional Access policy can show a **perfectly correct configuration in the portal** — every condition, assignment, and grant exactly as intended — and yet **never enforce anything.** The backend state has desynced or corrupted; the object you're *looking at* is not the object being *evaluated*. Recreating the policy from scratch with byte-identical parameters restores enforcement. Nothing in the displayed config ever told you it was broken.
|
||||||
|
|
||||||
|
Sit with what that means. A config review passes. An export-and-diff passes. A CIS audit ticks it green. Every parameter is "correct." And the control is doing nothing — a CA policy that **fails open, silently.** This is the worst failure on the convexity axis: the control you trusted to be convex (fails safe, blocks a class) is quietly behaving concave (fails open, protects nothing), and *no artefact you can read reveals it.* A benchmark cannot catch this. It is invisible to inspection by construction.
|
||||||
|
|
||||||
|
There is exactly one thing on earth that detects it: **observed enforcement under test.** This is not an edge case to file away — it is the single hardest piece of evidence for why the entire stressor discipline in this handbook exists. The iron rule that follows (and it is non-negotiable):
|
||||||
|
|
||||||
|
> **A CA policy's displayed configuration is a claim, never proof. The only proof is a real sign-in producing the expected outcome. Define the expected results *before* you build or change the policy, and test against them every time.**
|
||||||
|
|
||||||
|
Concretely: for the users and conditions that matter, write down the required outcome first — *user X, condition Y → MUST be blocked / granted / MFA-prompted* — so you're testing against a pre-committed expectation, not rationalising whatever you observe. Use the What If tool as a first pass, but understand its limit: What If evaluates the *configuration logic*, so it will happily tell you a ghost policy "applies" while the live evaluator ignores it. **Only a real authentication attempt is proof.** And when behaviour and config disagree, **recreate the policy from scratch — do not re-edit it**, because editing a corrupt object can carry the corruption forward. Re-test after tenant-level changes too, not just after policy edits; the desync can appear without you having touched the policy at all.
|
||||||
|
|
||||||
|
### The join-state coupling (Book II reaches the desktop)
|
||||||
|
|
||||||
|
Entra hybrid join drags the Book II fragility down to the device: the device identity now depends on on-prem AD, the SCP, the sync, and line-of-sight to a DC for some flows. It's the device-layer version of "one organism, two badges," and it exists almost entirely to service legacy app/auth dependencies. Pure Entra join + Intune is the cloud-native path that severs that coupling.
|
||||||
|
|
||||||
|
### The PRT is the device's golden ticket
|
||||||
|
|
||||||
|
The Primary Refresh Token on a managed device is its key to seamless cloud SSO. A compromised endpoint with a live PRT is a serious blast-radius problem. TPM binding (the session key sealed in hardware) is what raises the cost of stealing it — so "is the PRT TPM-bound?" is a real question, not a checkbox.
|
||||||
|
|
||||||
|
### MAM / App Protection is a *porous* boundary
|
||||||
|
|
||||||
|
Managing the data layer without owning the device (MAM-WE / App Protection Policies) is the right idea — wall the data, don't try to own a personal phone. But the wall has seams, and the data leaks through them: the OS share sheet, copy/paste where it isn't blocked, screenshots, "open in unmanaged app," local save paths, backups and cloud sync, and unmanaged browsers. A **"Block" in the policy is a claim, not a guarantee** — there are documented cases where the data goes out a path the policy was supposed to close. And enforcement is **not symmetric across iOS and Android**: different OS capabilities, different companion app requirements, different gaps that shift release to release. Never assume parity, and never trust the toggle without watching the device.
|
||||||
|
|
||||||
|
### Enrollment is a trust-establishment moment
|
||||||
|
|
||||||
|
Autopilot and enrollment are when a device becomes "trusted." That makes the enrollment path — tokens, the Autopilot device list, enrollment restrictions — a target: hijack it and you enrol a hostile device as a friend. Most programmes harden the device after enrollment and never look hard at the enrollment trust itself.
|
||||||
|
|
||||||
|
### The legacy and standing-privilege drag
|
||||||
|
|
||||||
|
- **GPO + co-management overlap** — on-prem-coupled config (Book II again), conflicts with Intune, and a migration most estates have half-finished for years.
|
||||||
|
- **Standing local admin** on endpoints — the device-layer version of Book III's original sin; one cracked local admin path = lateral movement.
|
||||||
|
- **Legacy auth that bypasses CA entirely** — the device controls are irrelevant on a protocol that never consults Conditional Access.
|
||||||
|
|
||||||
|
### Patch velocity, and its evil twin
|
||||||
|
|
||||||
|
A fleet you can patch in 24 hours is antifragile; one that takes six weeks of change control is fragile, and the attackers know your patch latency better than you do. But the *opposite* failure is just as real: a fast push to **everything at once** with no staging is how a single bad update bricks an entire fleet — the 2024 CrowdStrike mass-BSOD event was exactly this, a security vendor's own update shipped fast to everyone with no canary. Velocity without an escape hatch is concave (see §4).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Via negativa — what to remove
|
||||||
|
|
||||||
|
1. **Go cloud-native.** Move to Entra join + Intune + Autopilot and retire hybrid join, domain join, and GPO wherever the legacy dependency can actually be killed. This severs the Book II coupling at the device layer and deletes a whole class of "the desktop broke because the DC/sync/SCP did" failures.
|
||||||
|
2. **Stop trying to trust the device.** This is a *deletion* — stop pouring effort into making BYOD a trusted device. Wall the data instead (MAM/App Protection) and treat the device as untrusted by default. Subtracting the impossible goal is the move.
|
||||||
|
3. **Remove data from the endpoint.** If the data lives in managed apps and the cloud, there's less on the device to leak or lose. Shrink the local footprint and the compromise gets cheaper to absorb.
|
||||||
|
4. **Remove standing local admin.** JIT elevation (Endpoint Privilege Management) instead — Book III's "shrink the time" at the desktop.
|
||||||
|
5. **Kill legacy auth and the protocols that bypass CA.** A device control you can route around isn't a control.
|
||||||
|
6. **Prune the cruft** — conflicting/duplicate config profiles, dead enrollment profiles, stale Autopilot registrations, orphaned compliance policies nobody can explain. Each one is drift waiting to surprise you.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. The barbell — cheap devices, protected boundary
|
||||||
|
|
||||||
|
**The device is cattle, not a pet.** This is the central barbell of the book. A lost, stolen, or compromised endpoint should be a **shrug**: selective-wipe the corporate data (BYOD) or full-wipe and re-provision via Autopilot in about an hour (corporate). If losing a laptop is a crisis, you've made the device irreplaceable — which means you protected the wrong thing.
|
||||||
|
|
||||||
|
**Protect the irreplaceable boundary instead:**
|
||||||
|
|
||||||
|
- **The access decision** — Conditional Access. This is the convex control of the endpoint world (Book I): one well-built policy blocks whole classes of attack, cheaply. It is also one of the few things that can brick an entire tenant if misconfigured, so it gets paranoid change discipline (§4).
|
||||||
|
- **The data boundary** — the managed-app container / App Protection policy set, tested at the seams (§5), not trusted at the toggle.
|
||||||
|
- **The PRT and enrollment trust** — TPM-bound credentials, hardened enrollment restrictions, device-bound phishing-resistant auth (links Book III).
|
||||||
|
|
||||||
|
**Don't gold-plate the disposable.** Spending weeks locking down a kiosk's wallpaper policy while the CA policy set has a legacy-auth hole is the endpoint version of even-spreading. Concentrate on the decision and the data wall.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Optionality & recovery — escape hatches, tested
|
||||||
|
|
||||||
|
- **Wipe-and-reprovision as the recovery primitive.** Autopilot makes the device replaceable; *that* is your endpoint recovery plan. But "replaceable in an hour" is a slide claim until you've timed it on a real device. Drill it.
|
||||||
|
- **Selective wipe for BYOD** — the clean escape hatch that pulls corporate data without touching the user's photos. The thing that makes MAM politically survivable.
|
||||||
|
- **Update rings and canaries — velocity *with* a brake.** The answer to the CrowdStrike failure mode isn't "patch slowly," it's "patch fast through rings with a real canary, and keep the ability to **halt or roll back** a bad push before it reaches everyone." Fast *and* reversible. This is the barbell and optionality fused: speed on the upside, a bounded blast radius on the downside.
|
||||||
|
- **Break-glass exclusion from device requirements.** A flaky compliance signal must never lock out recovery. The break-glass accounts (Book I/III) sit outside the "require compliant device" gate — and that exclusion is monitored, not forgotten.
|
||||||
|
- **Fast device-trust revocation.** A one-move way to disable a device, revoke its tokens, and drop it from CA trust. Rehearse it.
|
||||||
|
- **Continuous Access Evaluation** is the mechanism shrinking the stale-token window — near-real-time response to critical events instead of waiting for token expiry. It narrows §1's "the signal is stale" gap. Coverage is not universal across every app and flow (verify current state, §honest uncertainty).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Stressor — break it on purpose
|
||||||
|
|
||||||
|
This domain rewards hands-on stress more than any other, because the gap between *policy* and *behaviour* only shows up on a real device.
|
||||||
|
|
||||||
|
- **Reconcile the four lists and hunt the deltas.** Pull Intune-enrolled devices, Entra-registered devices, devices appearing in sign-in logs, and the CMDB. None of them will agree. The **disagreements are the findings**: devices authenticating that nobody manages, CMDB entries that never sign in, registered devices that fell out of management. Then go further — count legacy-auth sign-ins and long-lived sessions (the dark end), and run network device discovery for the unmanaged things on the wire. The size of the gap between "the fleet we think we have" and "the population actually touching data" is one of the most honest metrics you can put in a report.
|
||||||
|
- **Attack your own MAM boundary, per platform.** Try to get corporate data out through every seam: share sheet, copy/paste, screenshot, save-as-local, open-in- unmanaged-app, backup/sync, an unmanaged browser. Find where "Block" doesn't actually block. Do it **separately on iOS and Android** — they will not behave the same, and the difference is the finding. (When you find a gap that survives reinstall and reset, that's an escalation to the vendor, not a config you missed.)
|
||||||
|
- **Spoof the compliance signal.** Root/jailbreak a test device. Is it caught? How long until the signal flips and CA reacts? That latency is your real exposure window.
|
||||||
|
- **Prove every CA policy actually enforces.** Never sign off a policy on its displayed config. With expected results written down beforehand, drive real sign-ins for each user/condition that matters and confirm the *observed* outcome matches. Treat What If as a hint, not proof. If a policy that looks correct doesn't enforce, recreate it from scratch rather than editing — the displayed object and the evaluated object can diverge silently, and a ghost policy fails open without ever telling you.
|
||||||
|
- **Lock yourself out on purpose.** In report-only mode, simulate a false non-compliant on a privileged user. Watch the CA decision. Confirm break-glass sails through. Better to find the lockout in a drill than during an outage.
|
||||||
|
- **Push a deliberately bad config/update to the canary ring.** Confirm the ring *contains* it and that halt/rollback works. An untested canary is just the first domino with a friendly name.
|
||||||
|
- **Time a wipe-and-reprovision.** Is the device truly replaceable in an hour, or is that a fiction the recovery plan rests on?
|
||||||
|
- **Compromise a test endpoint.** What does its PRT reach? Does EDR detect it? Does the device-risk signal actually flow into CA and revoke access — or does it stop at a dashboard nobody watches?
|
||||||
|
|
||||||
|
Per Book I principle 6: every gap found becomes a **structural** change — a closed seam, a tightened ring, a severed coupling, an escalation raised — not a line in a test log that dies there.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Honest uncertainty (endpoints are the worst offender — verify on a real device)
|
||||||
|
|
||||||
|
Stable and Lindy (teach with confidence): the device will be compromised; trust the boundary, not the device; cheap-and-reprovisionable beats hardened-and- precious; compliance is a signal; velocity needs a brake. None of that churns.
|
||||||
|
|
||||||
|
What moves — and on the endpoint, it moves *faster and more quietly* than anywhere else in this handbook:
|
||||||
|
|
||||||
|
- **MAM / App Protection enforcement is version-, platform-, and OS-build- dependent, and it has gaps that shift release to release.** iOS and Android are not symmetric and never have been; companion app requirements and managed- browser support change. The portal will tell you a policy is enforced while the device quietly does something else. **The only reliable test is on a real device, on the current OS build, every release** — the documentation and the hardware disagree more than Microsoft likes to admit. If you live anywhere in this handbook, live here.
|
||||||
|
- **Continuous Access Evaluation coverage** is expanding but not universal — which apps and flows honour near-real-time revocation changes; verify current coverage before you promise it closes the stale-token window.
|
||||||
|
- **Windows LAPS, Endpoint Privilege Management, Autopatch, Smart App Control / WDAC** capabilities and management surfaces all evolve; confirm current state and licensing before recommending.
|
||||||
|
- **Cloud-native vs hybrid-join guidance and the GPO→Settings-Catalog migration tooling** keep shifting toward cloud-native; check what's actually supported for the client's app estate before promising the coupling can be cut.
|
||||||
|
|
||||||
|
If a client's safety hinges on a specific enforcement behaviour, **test it on the device and, if needed, cite the current Microsoft doc** — and when the device behaviour contradicts the doc, believe the device. Confident-but-wrong about an endpoint control is how data walks out a seam everyone swore was closed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consolidated judgement prompts
|
||||||
|
|
||||||
|
- If this device is compromised right now, what does the attacker get, how fast do we know, and how fast is it gone? Is the device a shrug or a crisis?
|
||||||
|
- Do we know our *real* device population — derived from what's authenticating — or are we trusting a CMDB that's more wish than reality? How big is the gap between managed, known-unmanaged, shadow, and dark? What dark access bypasses CA entirely?
|
||||||
|
- Is "compliant" being treated as a guarantee or as a signal that can be stale, spoofed, or shallow? What happens when it's wrong — in *both* directions?
|
||||||
|
- Is the boundary the data (MAM/CA) or the device? Have we tested the data wall at every seam, on every platform, on the current OS build — or just toggled it?
|
||||||
|
- Are devices hybrid-joined out of genuine need, or out of habit? What would it take to go cloud-native and cut the Book II coupling?
|
||||||
|
- Can we patch the fleet fast — and can we *halt* a bad push before it reaches everyone? Do we have rings and a real canary, or hope?
|
||||||
|
- Is the PRT TPM-bound? Is enrollment trust hardened, or can a hostile device enrol as a friend?
|
||||||
|
- Does standing local admin still exist? Does legacy auth still bypass CA?
|
||||||
|
- For every CA policy that matters: has it been proven to enforce by a *real sign-in* against pre-written expected results — or are we trusting the displayed config of a policy that might be a ghost?
|
||||||
|
- Has anyone timed a wipe-and-reprovision, tested break-glass against the device gate, or watched the device-risk signal actually reach a CA decision?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Book IV of the Antifragile Handbook. Stop defending the device; assume it's already lost and build the boundary that survives it. Trust the device behaviour over the portal toggle, every time. Move fast and fix things.*
|
||||||
@@ -0,0 +1,140 @@
|
|||||||
|
# The Antifragile Handbook for M365 & Active Directory
|
||||||
|
|
||||||
|
## Book V — Data & Collaboration (Exchange, SharePoint, Teams, OneDrive)
|
||||||
|
|
||||||
|
> *Data is liquid. It leaves where you put it — copied, shared, forwarded, synced, linked. The question is never "is it locked down" but "where can it flow, who can reshare it, and can you see and reverse the flow?"*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The governing question
|
||||||
|
|
||||||
|
Books II–IV protected the *containers*: identity, privilege, devices. This book is about the *contents*, and contents obey a different physics. You can perfectly secure a container and still lose the data, because data doesn't stay put — it's duplicated into an email, dropped in a Team, synced to a laptop, handed to a guest who reshares it to someone you've never heard of. Perimeter thinking dies here.
|
||||||
|
|
||||||
|
> **Every share is a copy of your blast radius handed to a party you don't control. Can you see where it went, and can you pull it back?**
|
||||||
|
|
||||||
|
For most estates the honest answers are "no" and "no": nobody can enumerate the external shares, nobody reviews the guests, and a file shared to "Anyone with the link" three years ago is still reachable by anyone who ever held that link.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Fragility inventory — how data leaks
|
||||||
|
|
||||||
|
### "Anyone" links: bearer tokens for your data
|
||||||
|
|
||||||
|
Anonymous "Anyone with the link" sharing in SharePoint/OneDrive is the single largest data-exposure fragility in M365. A link is a **bearer token** — whoever holds it has access, no identity, no MFA, no device check, often no expiry, and it's forwardable. Its blast radius is everyone the link ever reaches, forever, including the open web if it leaks into an email thread or a crawler. Conditional Access, compliant devices, all of Books II–IV — none of it applies to a bearer link. It's a hole punched clean through every wall you built.
|
||||||
|
|
||||||
|
### Reshare, and the chain you can't see
|
||||||
|
|
||||||
|
Once data is shared — especially externally — the recipient can usually reshare, download, and copy it. You've handed your blast radius to an org (or a personal account) whose security posture you don't control and can't observe. Guests reshare to other guests. The chain of custody becomes invisible after the first hop. And the controls that govern this in Teams collaboration are **split across several layers** — Teams policy, SharePoint org- and site-level sharing, OneDrive, tenant sharing settings, and B2B/cross-tenant access — that interact in non-obvious ways and don't always agree. (More in §honest uncertainty; this is a place where the policy matrix and the observed behaviour routinely diverge.)
|
||||||
|
|
||||||
|
### Guest sprawl: standing blast radius at the data layer
|
||||||
|
|
||||||
|
Guests accumulate and nobody prunes them. The guest invited for one project in 2022 still has a foothold. Each is an external identity governed by *their* security, not yours — the data-layer cousin of standing privilege (Book III) and shadow devices (Book IV). Unreviewed guest access is a slowly metastasising external attack surface, and most tenants cannot even produce the list of who has it and to what.
|
||||||
|
|
||||||
|
### Email: the oldest, most Lindy exfil channel
|
||||||
|
|
||||||
|
Auto-forwarding rules are the classic business-email-compromise move — a quiet hidden rule that copies all mail to an external address, persistent and invisible. Add attachment-save paths that escape policy, and mail remains the most reliable way data walks out the door. External auto-forward should be off by default, and its presence should scream.
|
||||||
|
|
||||||
|
### The hybrid Exchange anchor (Book II at the data layer)
|
||||||
|
|
||||||
|
An on-prem Exchange server is a Tier-0-adjacent liability — historically one of the most catastrophic on-prem attack surfaces, where mailbox/management permissions can escalate toward AD. Hybrid Exchange drags that liability into the estate, and subtle functionality dependencies keep the last server alive long past its welcome. The via-negativa prize is decommissioning on-prem Exchange entirely (§2) — verify the current management/recipient tooling first.
|
||||||
|
|
||||||
|
### Internal oversharing
|
||||||
|
|
||||||
|
External isn't the only blast radius. "Everyone," "All company," and "Everyone except external users" permissions on a site holding HR, finance, or M&A data mean one compromised *internal* account reaches it all. Default-open SharePoint sites and self-service site creation produce internal data sprawl that no one maps.
|
||||||
|
|
||||||
|
### Collaboration sprawl by design
|
||||||
|
|
||||||
|
Every Team spins up a SharePoint site, an M365 group, a mailbox, and more — each with its own sharing and guest settings, each a potential leak. Self-service creation means ungoverned proliferation of data containers, and collaboration tools carry subtle data-visibility behaviours (who sees what history, what a late joiner can read) that surprise even experts. Sprawl nobody inventories is fragility nobody can see.
|
||||||
|
|
||||||
|
### Illicit OAuth consent: data exfil through a "legitimate" app
|
||||||
|
|
||||||
|
A user clicks OK on an app requesting `Mail.Read` or `Files.Read.All`, and now a third party reads tenant data through a sanctioned-looking grant. This is the data-layer face of Book III's app-registration dark matter — exfil that needs no malware and trips no device control.
|
||||||
|
|
||||||
|
### Retention as hoarded blast radius
|
||||||
|
|
||||||
|
Keeping everything forever makes every breach maximal: the attacker gets fifteen years of data instead of one. Over-retention is hoarding fragility — every byte you keep is a byte that can be stolen. (Its opposite, no recoverable copy at all, is Book VI's problem. The art is disposing of what you don't need while protecting what you do.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Via negativa — what to remove
|
||||||
|
|
||||||
|
1. **Kill anonymous "Anyone" links.** Default external sharing to authenticated, time-limited, least-permission (view, not edit). Remove the bearer token from your data entirely where you can.
|
||||||
|
2. **Decommission on-prem Exchange.** Remove the Tier-0-adjacent liability; get off hybrid Exchange where the dependency can actually be cut (verify current tooling — §honest uncertainty).
|
||||||
|
3. **Block external auto-forwarding by default.** Delete the quietest exfil channel there is.
|
||||||
|
4. **Prune guests ruthlessly.** Access reviews, expiration, entitlement management. Stale external access gets removed, and new guest access expires by default. Treat guest sprawl like standing privilege: minimise and time-box it.
|
||||||
|
5. **Minimise retention.** Dispose of stale data on a schedule. Shrink the prize so every breach is smaller. Data you no longer hold cannot be exfiltrated.
|
||||||
|
6. **Remove broad internal shares** ("All company"/"Everyone") from anything sensitive. Sensitive data should live in *few, known* places with *narrow* access.
|
||||||
|
7. **Govern self-service creation and clean up the dead.** Curb ungoverned Team/ site/app creation; archive and delete orphaned, inactive containers.
|
||||||
|
8. **Restrict user consent and revoke illicit grants.** Users shouldn't be able to hand tenant data to arbitrary apps; admin-consent workflow for anything sensitive, and sweep out the over-permissioned grants already there.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. The barbell — find the crown jewels, free the rest
|
||||||
|
|
||||||
|
**Name the crown jewels.** Which handful of data sets — the IP, the regulated data, the executive and M&A comms, the source of the company's value — would, if leaked, actually end the business? Most organisations cannot name them, and *that inability is finding #1.* You cannot protect asymmetrically until you know what the asymmetry is for.
|
||||||
|
|
||||||
|
**Paranoid protection for the crown jewels:**
|
||||||
|
|
||||||
|
- **Sensitivity labels with encryption that travels with the file.** This is the convex control of the data world (Book I, principle 7): one label protects the file *everywhere it goes*, forever — even after it leaves the tenant, lands on an unmanaged device, or is forwarded to a stranger. The protection is bound to the data, not the container. That's the only thing that survives data's liquidity.
|
||||||
|
- **Restricted sites, no external sharing, tight access with recurring reviews.**
|
||||||
|
- **Conditional Access app control / session controls** — browser-only, block-download for sensitive data on unmanaged devices (the Book IV boundary applied to content).
|
||||||
|
- **Heightened monitoring** on crown-jewel access (feeds Book VI).
|
||||||
|
|
||||||
|
**Free everything else.** Most collaboration data is low value and should flow *fast* — velocity is a feature (Book I creed). Don't lock the lunch-menu SharePoint with M&A-vault rigour. Spreading DLP and restriction evenly across all data is the concave failure: enormous maintenance, false positives that train users to click through, and the real exfil lost in the noise. **DLP is a scalpel for known high-value patterns (card numbers, national IDs, the labelled crown jewels), not a dragnet over everything.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Optionality & recovery — escape hatches, tested
|
||||||
|
|
||||||
|
- **The label *is* the escape hatch.** Because encryption travels with the file, a leaked crown-jewel document is still encrypted wherever it lands — you pre-paid for the data to survive being stolen. That is optionality bound into the byte.
|
||||||
|
- **Fast share revocation.** Can you, in 30 minutes, enumerate and *kill* every external share and anonymous link? If you can't produce the list, you can't pull it back — build the report and the revocation muscle before you need them.
|
||||||
|
- **Audit and content forensics — switched on and retained.** "Who accessed and downloaded what" is your post-incident truth, but only if audit logging is actually enabled and retained long enough to matter. Verify it's on; don't assume (§honest uncertainty).
|
||||||
|
- **Guest access reviews as recurring pruning** — the recovery loop for sprawl.
|
||||||
|
- **Immutable/held copies of crown-jewel data** — the bridge to Book VI backup.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Stressor — break it on purpose
|
||||||
|
|
||||||
|
- **Exfiltrate a labelled crown-jewel file yourself.** Email it externally, share it anonymously, download it through CAA session control, open it on an unmanaged device. Does the label encryption hold? Does DLP fire? Does anything alert? You are testing the *behaviour*, not the policy screen (Book I corollary).
|
||||||
|
- **Plant a canary document** seeded with a detectable pattern and try to move it out every way you can. What catches it? What doesn't?
|
||||||
|
- **Enumerate the external surface.** Produce the full list of "Anyone" links, external guests, and externally-shared files. The exercise of *trying* usually reveals you can't — which is the finding.
|
||||||
|
- **Simulate the BEC forward rule.** Set a test external auto-forward. Is it blocked? Alerted? Silent? Silence is the BEC attacker's favourite answer.
|
||||||
|
- **Test the reshare chain.** Share to a test guest, have them reshare onward. Can you see it? Stop it? Pull it back?
|
||||||
|
- **Reconcile declared vs enforced sharing.** The tenant sharing setting says one thing; walk the actual per-site and per-link reality. They diverge — the ghost-policy cousin from Book IV, at the data layer.
|
||||||
|
|
||||||
|
Per Book I principle 6: every leak path found becomes a **structural** change — a killed link type, a pruned guest population, a label applied, a coupling removed — not a note in a spreadsheet.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Honest uncertainty (the sharing matrix moves — test, don't trust it)
|
||||||
|
|
||||||
|
Stable and Lindy (teach with confidence): data is liquid; bearer links are exposure; protection must travel with the data; minimise the prize; DLP is a scalpel not a dragnet; guests are standing blast radius. None of that churns.
|
||||||
|
|
||||||
|
What moves, and what you must verify by testing rather than reading:
|
||||||
|
|
||||||
|
- **External sharing enforcement is split across many interacting layers** — Teams policy, SharePoint org/site sharing, OneDrive, tenant settings, B2B/cross-tenant access, and the Premium tiers — and they don't always agree. Enforcement can differ by client and platform, and the documented matrix and the observed behaviour diverge often enough that you should **confirm the real behaviour on a real client, not from the policy screen.** When you find an inconsistency that survives reconfiguration, that's a vendor escalation, not your error.
|
||||||
|
- **On-prem Exchange decommissioning** and the "last server for management" story — the tooling has evolved; verify the current supported path before promising the coupling can be cut.
|
||||||
|
- **Purview / sensitivity labels / auto-labelling / DLP** capabilities churn fast, including the branding. Verify current coverage and licensing.
|
||||||
|
- **Cross-tenant access settings (B2B collaboration and direct connect)** are comparatively new and evolving — verify current behaviour.
|
||||||
|
- **Audit log retention defaults and licensing have changed over time.** Confirm what's actually captured and for how long *before* you rely on it for forensics.
|
||||||
|
|
||||||
|
If a client's safety hinges on a specific sharing behaviour, test it on a live client and cite the current doc — and where the client behaviour contradicts the doc, believe the client.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consolidated judgement prompts
|
||||||
|
|
||||||
|
- Can we name the crown jewels? If not, that's finding #1 — everything else is guesswork until we can.
|
||||||
|
- Can we enumerate every external share, anonymous link, and guest *right now*? Can we revoke them fast?
|
||||||
|
- Does protection travel *with* the crown-jewel data (labels/encryption), or only with the container it currently sits in?
|
||||||
|
- Where can this data flow — reshare, forward, sync, download, OAuth app — and is any of that flow visible or reversible?
|
||||||
|
- Are guests treated as standing blast radius (minimised, time-boxed, reviewed) or left to accumulate?
|
||||||
|
- Is DLP a scalpel on known high-value patterns, or a dragnet generating noise everyone clicks through?
|
||||||
|
- Is on-prem Exchange still anchoring the estate? What would it take to cut it?
|
||||||
|
- Is audit logging actually on and retained long enough to reconstruct an incident?
|
||||||
|
- Does the tenant's *declared* sharing posture match what the sites and links *actually* enforce?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Book V of the Antifragile Handbook. You cannot wall in a liquid. Name the few things that would end the company, bind protection to the data itself, shrink the prize, and make every flow visible and reversible. Move fast and fix things.*
|
||||||
@@ -0,0 +1,154 @@
|
|||||||
|
# The Antifragile Handbook for M365 & Active Directory
|
||||||
|
|
||||||
|
## Book VI — Recovery & Detection-as-Feedback
|
||||||
|
|
||||||
|
> *Robust means you survive the shock unchanged. Antifragile means you come back stronger. The shock is coming either way — the only choice is what you do with it.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The governing question
|
||||||
|
|
||||||
|
This is the capstone, because it's the book that decides whether everything before it was merely *robust* or genuinely *anti*fragile. The first five books harden the estate; this one builds the machine that turns every shock into improvement. Ask:
|
||||||
|
|
||||||
|
> **When — not if — this fails, do you come back weaker, the same, or stronger?**
|
||||||
|
|
||||||
|
A fragile estate comes back weaker (if at all). A robust estate comes back the same and waits for the next identical hit. An antifragile estate comes back *different and harder to hit the same way twice* — because it ran the shock through a feedback loop and changed its own structure. That loop is the entire subject of this book.
|
||||||
|
|
||||||
|
The reframe that powers it: most organisations treat detection and recovery as the sad afterthought — the thing they hope never to need. Invert it. **Incidents, alerts, failed drills, and near-misses are the most valuable intelligence the system ever produces** — honest, real-world data about where the fragility actually is, bought in the cheapest currency available *if you harvest it.* The org that buries incidents stays fragile. The org that treats them as fuel becomes antifragile. Your job is to build the machine that converts disorder into structural strength.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Fragility inventory — where recovery and detection rot
|
||||||
|
|
||||||
|
### Backups that have never been restored
|
||||||
|
|
||||||
|
The biggest recovery lie in the industry: *"we have backups."* Having a backup is not the same as being able to recover, and an untested backup is Schrödinger's recovery — simultaneously fine and worthless until someone actually opens the box. Two M365-specific traps make this worse:
|
||||||
|
|
||||||
|
- **"Microsoft backs it up for us."** Microsoft provides geo-redundancy, recycle bins, and limited native retention — *not* point-in-time backup against your own ransomware, malicious deletion, or retention expiry. Under the shared- responsibility model, **your data is your responsibility.** Most tenants have no real, independent, point-in-time M365 backup, and discover this during the incident.
|
||||||
|
- **Attackers target backups first.** Ransomware operators delete or encrypt the backups *before* they hit production, because they know it's your only way out. A backup reachable from the compromised estate is not a backup; it's another victim.
|
||||||
|
|
||||||
|
### AD forest recovery: the nightmare nobody rehearses
|
||||||
|
|
||||||
|
Recovering a compromised or destroyed AD forest is one of the hardest operations in all of IT — clean OS installs, authoritative restore of one DC per domain, metadata cleanup, double krbtgt reset, trust resets, the whole brutal sequence. Almost no one has practised it. So when ransomware takes AD, "restore from backup" is a multi-day, error-prone, improvised ordeal performed for the first time under maximum pressure. Entra recovery is less apocalyptic but has its own teeth: the hard-delete window for objects, and the fact that tenant *configuration* (CA policies, Intune, roles) has no native "undo" unless you captured it as code.
|
||||||
|
|
||||||
|
### Recovery that depends on what the incident destroyed
|
||||||
|
|
||||||
|
The fatal circular dependency: backups authenticated by the AD that's down. The recovery runbook stored in the SharePoint that's encrypted. The break-glass that needs the MFA service that's offline. The recovery admin whose credentials the attacker already has. **A recovery path that depends on the thing it's recovering is not a recovery path** — it's the clean-source principle (Book III) applied to survival.
|
||||||
|
|
||||||
|
### Detection that fires into a void
|
||||||
|
|
||||||
|
Logs not collected. Audit logging never enabled or silently aged out. A SIEM full of alerts nobody triages. And the specific blind spots the earlier books planted: the unmonitored DCSync (Book II), the unwatched break-glass use (Book III), the device-risk signal that dies on a dashboard (Book IV), the BEC forward rule nobody sees (Book V). Detection that nobody acts on is theatre with a subscription fee.
|
||||||
|
|
||||||
|
### Alert fatigue: the boy who cried wolf, automated
|
||||||
|
|
||||||
|
Too many low-fidelity alerts is itself a fragility — the real signal drowns in noise, and the analyst who's dismissed a thousand false positives dismisses the one that mattered. More alerts is not more security; past a point it's *less.*
|
||||||
|
|
||||||
|
### MTTR that exists only on paper
|
||||||
|
|
||||||
|
RTO/RPO numbers in a policy document, never once validated by an actual restore, are fiction. (Book I anti-benchmark: MTTR is measured by *doing it*, not by declaring it.)
|
||||||
|
|
||||||
|
### Incidents that close without changing anything
|
||||||
|
|
||||||
|
The post-incident review that concludes "remind users to be more careful" has wasted the disorder entirely and guaranteed the recurrence. And a blame culture destroys the feedback loop at the source — if surfacing an incident gets you punished, incidents get buried, and the system goes blind.
|
||||||
|
|
||||||
|
### No known-good to return to
|
||||||
|
|
||||||
|
If your tenant configuration lives only as click-ops in a portal, you have no golden image of "correct," so you can neither rebuild it fast nor detect drift *from* it — and you can't catch a ghost policy (Book I/IV) because you have nothing to diff against. No config-as-code means no known-good.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Via negativa — what to remove
|
||||||
|
|
||||||
|
1. **Delete the false comfort that Microsoft backs you up.** Removing the dangerous belief comes before adding the real backup.
|
||||||
|
2. **Sever recovery's dependencies on the estate it recovers.** Recovery credentials, runbooks, and backups must not depend on prod AD/Entra/SharePoint. Decouple, so the lifeboat doesn't sink with the ship.
|
||||||
|
3. **Cut alert noise.** Ruthlessly remove low-fidelity alerts so the high-fidelity ones become visible. Via negativa applied to detection: fewer, louder, truer.
|
||||||
|
4. **Remove blame from the post-incident process.** Blameless on people so people surface incidents — then ruthless on structure so the incident actually changes something. Removing the incentive to hide *protects the feedback loop itself.*
|
||||||
|
5. **Remove click-ops from critical configuration.** Move control-plane config (CA, Intune, roles) to code, so a known-good exists to rebuild from and diff against.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. The barbell — paranoid recovery for the irreplaceable, best-effort for the rest
|
||||||
|
|
||||||
|
**The irreplaceable few** — the identity control plane (Books II/III) and the crown-jewel data (Book V) — get **real, tested, immutable, offline/isolated backup** and **rehearsed** recovery. AD forest recovery is practised, not theorised. Recovery objectives for these are measured in a drill, in minutes or hours, not asserted in a policy.
|
||||||
|
|
||||||
|
**The recovery capability is itself a crown jewel.** Backups are a top attacker target, so protect them like break-glass: immutable, offline or in a separate trust domain, unreachable even from full domain dominance. A backup the attacker can reach is not a control.
|
||||||
|
|
||||||
|
**Everything else is best-effort and tiered.** Don't gold-plate recovery for the lunch-menu SharePoint. Tier recovery objectives to value — crown jewels get immutable and fast; bulk collaboration gets good-enough. And concentrate **high-fidelity detection** on the control-plane and crown-jewel signals (the screaming break-glass, the anomalous DCSync, the impossible-travel admin, the crown-jewel mass-download) rather than spreading shallow alerting evenly across everything.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Optionality & recovery — the heart of the book
|
||||||
|
|
||||||
|
- **Tested restores on a schedule.** The only proof of recovery is a restore that happened. Make the restore drill routine, time it, and verify integrity — that time *is* your real MTTR.
|
||||||
|
- **Immutable + offline/isolated backups** — the escape hatch that survives the attacker reaching production. Ransomware-resilient by design, not by hope.
|
||||||
|
- **Rehearsed AD forest and Entra recovery runbooks, stored independently** — on paper or offline, reachable when the estate is dark, not in the SharePoint that's encrypted.
|
||||||
|
- **Configuration-as-code (IaC) for the control plane** — instant rebuild *and* a known-good baseline to detect drift and ghost configuration against. This single practice serves recovery, drift detection, and the Book I corollary at once.
|
||||||
|
- **A clean-room / isolated recovery environment** — somewhere to rebuild that the attacker isn't already inside.
|
||||||
|
- **The fail-over-vs-clean-in-place decision pre-made.** When do we rebuild rather than try to clean a compromised estate? Decide the criteria *before* the incident; it's the Book II "sever the sync" decision generalised to the whole estate.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Stressor — the hormesis engine (the climax of the handbook)
|
||||||
|
|
||||||
|
This is where the entire handbook either runs or rusts. Everything else is preparation for the loop; this is the loop turning.
|
||||||
|
|
||||||
|
- **Live restore of a crown-jewel dataset and the control plane.** Not a tabletop — an actual restore, integrity-verified and timed. The number you get is the truth; the number in the policy was always fiction.
|
||||||
|
- **Rehearse AD forest recovery.** The first time you perform the hardest recovery in IT must not be during the real disaster. Run it. Find what's missing. Fix the runbook.
|
||||||
|
- **Inject attacks end-to-end and follow them all the way through.** DCSync, malicious consent, break-glass use, impossible-travel admin, crown-jewel mass- download. Confirm not just that the alert *exists*, but that it's **triaged, and someone acts.** Detection that fires into a void fails this test on purpose, so you can fix it.
|
||||||
|
- **Run a ransomware game-day** that assumes Tier 0 is owned and backups are the first target. Watch your decoupling hold or fail.
|
||||||
|
- **Purple-team as routine, not annually.** Standing, escalating, blast-radius- controlled stress — hormesis, not a once-a-year audit ritual.
|
||||||
|
- **Measure the loop itself.** Track *time from incident to structural change.* If drills and incidents close without a removed right, a severed coupling, or a new firebreak, the loop is broken and you are merely robust.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The feedback loop — what makes all six books antifragile
|
||||||
|
|
||||||
|
Name the loop explicitly, because it's the thread that ties the whole handbook together and the thing that converts robustness into antifragility:
|
||||||
|
|
||||||
|
**Detect** (see the stressor) → **Respond** (contain it) → **Recover** (come back) → **Learn structurally** (come back *stronger*) → which feeds back into **Removal and redesign** across every prior book — a fragilizer deleted (Book I via negativa), a coupling severed (Book II), a standing privilege collapsed (Book III), a device boundary tightened (Book IV), a data flow closed (Book V).
|
||||||
|
|
||||||
|
The first three steps are robustness; plenty of organisations reach them and call it security. **The fourth step is the whole game.** A shock that produces no structural change has been wasted, and the system will meet the same shock again, unchanged. A shock that *does* produce structural change has made the estate stronger — which is the literal definition of antifragile, and the only honest justification for everything in this handbook.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Honest uncertainty (verify the moving parts)
|
||||||
|
|
||||||
|
Stable and Lindy (teach with confidence): untested backup is no backup; attackers hit backups first; recovery must not depend on what it recovers; detection without action is theatre; alert fatigue is fragility; every shock must change the structure. None of that churns — these are the oldest truths in operational security.
|
||||||
|
|
||||||
|
What moves, and what you must verify:
|
||||||
|
|
||||||
|
- **M365 native backup/retention specifics and the shared-responsibility boundary** — what Microsoft does and does not cover, recycle-bin and hard-delete windows — evolve. Verify current reality, and **test what you can actually recover** rather than trusting either "Microsoft has us covered" or a vendor pitch.
|
||||||
|
- **Entra recovery and configuration-backup tooling** (deleted-object windows, Graph/IaC options for capturing CA, Intune, and roles as code) evolve — verify current capability.
|
||||||
|
- **AD forest recovery** is Lindy in principle (it is brutal; rehearse it), but automation and tooling evolve — confirm the current supported procedure.
|
||||||
|
- **Detection tooling** (the XDR/SIEM signal catalogue) churns continuously. Verify which detections exist *today* and test them end-to-end; the principle (high-fidelity over noise, tested through to action) is what's permanent.
|
||||||
|
- **Audit log retention and licensing** have changed over time — confirm what's captured and for how long *before* relying on it for forensics.
|
||||||
|
|
||||||
|
If recovery hinges on a current specific, verify it and test it. "We confirmed the restore works and it takes four hours" beats any RTO ever written in a policy.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consolidated judgement prompts
|
||||||
|
|
||||||
|
- When this fails, do we come back weaker, the same, or stronger? What's the mechanism that makes it *stronger*?
|
||||||
|
- When was a backup of the crown jewels and the control plane last *restored* — not taken, restored — and how long did it take?
|
||||||
|
- Are the backups reachable from the estate they protect? (If yes, they're another victim.) Are they immutable and offline?
|
||||||
|
- Has anyone ever rehearsed AD forest recovery? Is the runbook reachable when the estate is dark?
|
||||||
|
- Does any part of the recovery path depend on the thing the incident destroyed — credentials, runbook location, MFA, the recovery admin?
|
||||||
|
- Does detection fire into action, or into a void? Is there so much noise the real signal is lost?
|
||||||
|
- Does control-plane config exist as code (a known-good to rebuild and diff against), or only as click-ops?
|
||||||
|
- For the last three incidents and drills: what *structural* thing changed? If the answer is "a reminder," the loop is broken.
|
||||||
|
- How long from incident to structural change — and is that time getting shorter?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Coda — the whole arc
|
||||||
|
|
||||||
|
Six books, one idea. Book I is the **lens**: subtract before you add, protect the irreplaceable, measure blast radius, buy optionality, stress on purpose, and make every shock change the structure — verifying by observation, never by inspection. Books II–V apply that lens to the **containers and contents**: the identity bridge made a firebreak, privilege collapsed in reach and time, the device assumed hostile and the boundary moved to the data, and the data itself made to carry its own protection as it flows. Book VI is the **loop** that makes it all antifragile rather than merely robust — the machine that feeds every incident back into removal and redesign.
|
||||||
|
|
||||||
|
None of this is a checklist, and if a consultant trained on it ever reaches for "because the benchmark says so," they've missed the point. The point is judgement: draw the wall, find the fragility, fix what matters, and let every stress make the estate stronger than it was.
|
||||||
|
|
||||||
|
Move fast and fix things.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Book VI of the Antifragile Handbook, and the close of the arc.*
|
||||||
@@ -0,0 +1,91 @@
|
|||||||
|
# The Antifragile Handbook for M365 & Active Directory
|
||||||
|
|
||||||
|
Most M365 estates are fragile. Not because nobody has run the benchmarks — they have, and the scorecards look fine. They're fragile because a compliance certificate and a hardened estate are different things, and the industry has spent years teaching people to chase the first while missing the second.
|
||||||
|
|
||||||
|
This handbook is the attempt to close that gap. It is written for consultants who want to walk into a tenant they've never seen and find the thing that will actually kill the client — not the thing that fails the CIS audit. It is opinionated, sequenced, and deliberately uncomfortable. If you want a checklist, the CIS Benchmark is free. If you want to understand *why* the checklist exists, what breaks when the controls fail, and how to build an estate that gets stronger under attack rather than just surviving it, start here.
|
||||||
|
|
||||||
|
The governing question in every book is the same:
|
||||||
|
|
||||||
|
> **When — not if — this fails, does the estate come back weaker, the same, or stronger?**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The books
|
||||||
|
|
||||||
|
### [Book I — Principles & Judgement](00-principles-and-judgement.md)
|
||||||
|
|
||||||
|
*The craft before the controls.*
|
||||||
|
|
||||||
|
Everything else in this series rests on the discrimination developed here: the ability to distinguish signal from noise, to know that disabling legacy auth outranks renaming forty GPOs, and to understand why compliance is a floor and a by-product rather than the target. This book also introduces the "move fast and fix things" operating principle — a deliberate inversion of the Silicon Valley creed, because the things are already broken and speed means refusing to let a thirty-page risk-acceptance process protect a fragility a teenager with a phishing kit will remove for free.
|
||||||
|
|
||||||
|
Read this first, even if you're experienced. Especially if you're experienced.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### [Book II — Hybrid Identity](01-hybrid-identity.md)
|
||||||
|
|
||||||
|
*Draw the wall between on-prem and cloud. In most estates there isn't one — there's a hallway with the door propped open.*
|
||||||
|
|
||||||
|
In a hybrid estate, on-prem AD and Entra ID are not two systems with a guarded border. They're one organism wearing two badges, joined by a bridge that most organisations cannot draw, do not monitor, and have never tested severing. This book maps the bridge — the sync engine, the connector accounts, the authentication method, the writeback paths — and explains why a single compromise of the sync server gives an attacker DCSync on-prem *and* cloud object manipulation at the same time. Then it shows how to build the actual wall.
|
||||||
|
|
||||||
|
If you only ever fix one domain, fix this one. Everything else assumes identity holds.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### [Book III — Privileged Access](02-privileged-access.md)
|
||||||
|
|
||||||
|
*Privilege is blast radius with a time axis. Standing privilege reaches everything, forever. The whole job is to collapse both: less reach, less time.*
|
||||||
|
|
||||||
|
The most dangerous accounts in any estate are the ones nobody is watching — the permanent Domain Admins that have always existed, the service accounts with Kerberoastable SPNs and passwords from 2016, the app registrations with `RoleManagement.ReadWrite.Directory` and admin consent that nobody remembers granting. This book names them, shows how they become privilege-escalation paths, and builds the case for Just-in-Time access, Entra PIM, and a rigorous service-principal audit as the core of any engagement.
|
||||||
|
|
||||||
|
The single most important number in this book: how many identities hold standing privilege right now?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### [Book IV — Devices & Endpoint (Intune)](03-devices-and-intune.md)
|
||||||
|
|
||||||
|
*The device will be compromised. Compliant is not the same as secure, and the portal toggle is not the same as the device's behaviour.*
|
||||||
|
|
||||||
|
Endpoint programmes are usually built on a wish: make the device trusted. That wish is unwinnable. This book flips the question — assume every device is already compromised, and ask what still holds — and uses that reframe to expose the gap between a "compliant" device in the portal and a device that is actually behaving as expected. It covers the hidden fleet (managed, unmanaged, shadow, dark), the Conditional Access misconfiguration patterns that most estates share, and how to build posture that survives an untrusted device rather than depending on the device being clean.
|
||||||
|
|
||||||
|
The spine of the book: compliance is a signal, not a checkbox.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### [Book V — Data & Collaboration](04-data-and-collaboration.md)
|
||||||
|
|
||||||
|
*Data is liquid. The question is never "is it locked down" but "where can it flow, who can reshare it, and can you see and reverse the flow?"*
|
||||||
|
|
||||||
|
Books II–IV protect the containers: identity, privilege, devices. This book is about the contents, and contents obey different physics. An "Anyone with the link" SharePoint share is a bearer token — no identity, no MFA, no device check, often no expiry, forwardable to anyone, reachable by the open web if it leaks. Guest sprawl hands your blast radius to external identities you don't govern. Email is the oldest exfil channel in the industry and almost never properly monitored. This book maps the exposure patterns across Exchange, SharePoint, Teams, and OneDrive, and builds the controls that let you see — and reverse — the data flow.
|
||||||
|
|
||||||
|
For most estates the honest answer to "can you see where it went?" is no. That's the starting point.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### [Book VI — Recovery & Detection-as-Feedback](05-recovery-and-detection.md)
|
||||||
|
|
||||||
|
*Robust means you survive the shock unchanged. Antifragile means you come back stronger. The shock is coming either way — the only choice is what you do with it.*
|
||||||
|
|
||||||
|
The capstone, because it decides whether everything before it was merely robust or genuinely antifragile. Detection and recovery are not the sad afterthought — they're the feedback loop that changes the structure of the estate after every shock. An org that buries incidents stays fragile. An org that treats them as fuel becomes antifragile. This book covers the recovery lies the industry tells itself (untested backups, undocumented break-glass, AD forest recovery nobody has practised), builds the detection architecture, and — most importantly — describes the machine that turns incidents, alerts, and near-misses into structural improvement.
|
||||||
|
|
||||||
|
Read this last. It only makes sense once you've built something worth protecting.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Field Guide (2026 Edition)
|
||||||
|
|
||||||
|
The books are principles; they are deliberately stable. Two field guides apply them in practice:
|
||||||
|
|
||||||
|
**[Field Guide — 2026 Edition](field-guide-2026.md):** Concrete actions and current tooling for foundational engagements. The "do this" companion to the handbook. Review January 2027.
|
||||||
|
|
||||||
|
**[Field Guide — Adversarial Validation](field-guide-adversarial-validation.md):** For clients who have done the foundational work. Tests declared controls against observed behaviour, domain by domain. Closes with a client leave-behind cadence so the admin can self-monitor between engagements. Review January 2027.
|
||||||
|
|
||||||
|
For inspection checklists, see the [assessment templates](../assessment-templates/): the [Engagement Checklist](../assessment-templates/engagement-checklist.md) (foundational), the [Adversarial Validation Checklist](../assessment-templates/adversarial-validation-checklist.md) (phase 2), and the [Self-Service Cadence](../assessment-templates/self-service-cadence.md) (client leave-behind).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How to use this series
|
||||||
|
|
||||||
|
The books are sequenced deliberately — each one assumes the previous — but an experienced practitioner can use them as field references. The fragility inventories at the start of each book are designed to be usable on day one of an engagement, before you've had time to read everything. The "governing question" at the start of each section is designed to be asked out loud, to a client, in a room where someone will have to answer it.
|
||||||
|
|
||||||
|
The goal throughout is not compliance. Compliance is a by-product. The goal is an estate that gets harder to compromise every time it's tested — and is tested often enough to know.
|
||||||
@@ -0,0 +1,514 @@
|
|||||||
|
# M365 + AD Field Guide — 2026 Edition
|
||||||
|
|
||||||
|
> *The books are principles. This is practice — concrete actions, current tooling, and 2026-specific decisions. It will need updating next year. That is the point.*
|
||||||
|
|
||||||
|
**Last updated:** June 2026
|
||||||
|
**Companion to:** The Antifragile Handbook for M365 & AD (Books I–VI)
|
||||||
|
**Next review:** January 2027
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What this is
|
||||||
|
|
||||||
|
The Antifragile Handbook teaches judgement. This document teaches actions — what to do, in 2026, with the tooling that exists now, in the estates you will actually walk into. Where the handbook says "eliminate AD FS," this document says how and what blockers to expect. Where the handbook says "test the CA policy," this document says what a ghost policy looks like when you find one.
|
||||||
|
|
||||||
|
Read the books first. Use this document on-site.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Notation
|
||||||
|
|
||||||
|
**P0** — attacker already through; fix before leaving this session
|
||||||
|
**P1** — closes in this engagement
|
||||||
|
**P2** — roadmap item, documented
|
||||||
|
**2026 note** — something that has changed or become clearer since the handbook was written
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Hybrid Identity
|
||||||
|
|
||||||
|
### Remove AD FS — this is now a P0 conversation
|
||||||
|
|
||||||
|
In 2026, Microsoft's migration tooling has matured to the point where AD FS is a choice, not an inevitability. Every client still running it should have a migration plan or a written, named reason for not having one.
|
||||||
|
|
||||||
|
**Why it is a P0:** Golden SAML is still an active nation-state technique. The token-signing private key in most tenants has never been rotated, is stored on the AD FS servers, and is not monitored. One foothold on any on-prem system that can reach the AD FS servers ends cloud identity entirely — silently, with validly-signed tokens, no failed logins, nothing for a SIEM to catch.
|
||||||
|
|
||||||
|
**What to do:**
|
||||||
|
- In the Entra portal, go to Identity > Applications > AD FS activity (if it appears). This gives you the relying party trust inventory and migration readiness per application. This is your conversation starter.
|
||||||
|
- Enumerate relying party trusts: `Get-AdfsRelyingPartyTrust | Select-Object Name, Enabled, Identifier`. Each enabled one is a blocker that needs a cloud equivalent or decommission plan.
|
||||||
|
- Check the token-signing cert: `Get-AdfsCertificate -CertificateType Token-Signing`. Note the NotAfter date and when it was last rotated. "Has not been rotated since installation" is the expected answer and is itself a finding.
|
||||||
|
- Staged rollout in Entra lets you migrate users incrementally — you do not have to cut over all at once. Use it.
|
||||||
|
|
||||||
|
**Migration target:** Password Hash Sync (PHS) + Entra-managed MFA via Conditional Access. This removes the on-prem dependency for cloud authentication and kills Golden SAML as a class.
|
||||||
|
|
||||||
|
**2026 note:** The AD FS migration activity report and staged rollout tooling make this significantly more tractable than it was in 2023–2024. Remove the roadmap language and have the P0 conversation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Connect Sync vs Cloud Sync — new deployments
|
||||||
|
|
||||||
|
**2026 recommendation:** For new hybrid sync deployments and organizations without complex topologies (no device writeback, no large object filtering requirements, no multi-forest writeback scenarios), **Entra Cloud Sync** is the preferred deployment. Smaller attack surface than Connect Sync (no SQL Express, no full-blown sync engine, multiple lightweight agents for HA), easier to harden, no single machine that holds DCSync-capable credentials.
|
||||||
|
|
||||||
|
**Connect Sync stays correct for:** Large/complex topologies, specific writeback scenarios (check the current parity matrix at Microsoft Learn before promising Cloud Sync covers a client's requirements — this changes).
|
||||||
|
|
||||||
|
**For existing Connect Sync deployments:** The migration path to Cloud Sync exists. Check current documentation for topology compatibility. Do not promise the migration before confirming the client's scenario is supported.
|
||||||
|
|
||||||
|
**In either case, the sync server is Tier 0.** See the hardening actions below.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Sync server hardening — concrete actions
|
||||||
|
|
||||||
|
The sync server (Connect or Cloud Sync agent host) is typically treated as a utility VM. It holds an identity capable of DCSync. Treat it accordingly.
|
||||||
|
|
||||||
|
**Immediate checks:**
|
||||||
|
- Is the server domain-joined to the production domain? If yes, its blast radius is one hop from any Tier 1 or Tier 2 compromise. Ideal: join it to a dedicated Tier 0 or management forest, or isolate it behind jump-box access only.
|
||||||
|
- What account runs the connector service, and what permissions does it have? For Connect Sync, the on-prem connector account needs `Replicate Directory Changes` and `Replicate Directory Changes All`. Confirm it is a dedicated service account (ideally gMSA), not a human admin account that doubled up.
|
||||||
|
- Has the server ever been patched? Check `Get-HotFix | Sort-Object InstalledOn -Descending | Select-Object -First 5`. If nothing in the last 60 days, that is a finding.
|
||||||
|
- Is the Entra connector account (Directory Synchronization Accounts role) monitored? Any sign-in from any host other than the sync server should alert immediately.
|
||||||
|
- Are local administrators on the sync server documented and minimal?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Cloud-only Global Admins — enforce it on day one
|
||||||
|
|
||||||
|
**P0 if not in place.** Synced accounts holding Global Admin are the most common single finding across all engagements and the most direct path from a ransomwared on-prem AD to cloud dominance.
|
||||||
|
|
||||||
|
**Find the synced GAs:**
|
||||||
|
```powershell
|
||||||
|
# Connect-MgGraph -Scopes "Directory.Read.All"
|
||||||
|
$gaRoleId = (Get-MgDirectoryRole -Filter "displayName eq 'Global Administrator'").Id
|
||||||
|
Get-MgDirectoryRoleMember -DirectoryRoleId $gaRoleId |
|
||||||
|
Where-Object { $_.AdditionalProperties['userPrincipalName'] -notlike "*.onmicrosoft.com" }
|
||||||
|
```
|
||||||
|
|
||||||
|
Every result is a synced account. Every synced account in GA is a P0.
|
||||||
|
|
||||||
|
**Remediation path:**
|
||||||
|
1. Create a new cloud-only account (`user@tenant.onmicrosoft.com` format), assign GA, configure phishing-resistant MFA.
|
||||||
|
2. Validate the new account works — sign in, confirm PIM activation if PIM is in place.
|
||||||
|
3. Remove GA from the synced account.
|
||||||
|
4. Add a Conditional Access policy blocking synced account UPNs from holding privileged roles (belt-and-suspenders; requires knowing the UPN pattern).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Seamless SSO key — rotate it
|
||||||
|
|
||||||
|
`AZUREADSSOACC` was created when Seamless SSO was enabled and is almost certainly unrotated. The Kerberos key on this account is a silver-ticket / cloud token-forging exposure if the on-prem is compromised.
|
||||||
|
|
||||||
|
**Check last password set:**
|
||||||
|
```powershell
|
||||||
|
Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet | Select-Object PasswordLastSet
|
||||||
|
```
|
||||||
|
|
||||||
|
If this matches the approximate go-live date of the Microsoft 365 tenant, it has never been rotated.
|
||||||
|
|
||||||
|
**Rotate it:** Use the `Update-AzureADSSOForest` PowerShell command (in the MSOnline / Entra Connect tooling). Run it twice per domain — same discipline as KRBTGT rotation. If Seamless SSO is not needed (Entra join and modern auth only), remove `AZUREADSSOACC` entirely.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Writebacks — name and own each one
|
||||||
|
|
||||||
|
Enumerate which writebacks are enabled (password writeback, group writeback, device writeback) in Connect Sync or Cloud Sync configuration. For each:
|
||||||
|
- Who owns the decision to have it enabled?
|
||||||
|
- What does an attacker reach if the cloud side is compromised — can they write into on-prem AD?
|
||||||
|
- Is the reverse blast radius documented?
|
||||||
|
|
||||||
|
Password writeback is usually justified (SSPR usability). Group writeback creates a two-way channel between cloud security groups and on-prem AD — the blast radius should be explicit. If there is no current owner or justification for a writeback, disable it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Privileged Access
|
||||||
|
|
||||||
|
### PIM: table stakes in 2026
|
||||||
|
|
||||||
|
If the client has Entra ID P2 (included in Microsoft 365 E5, Business Premium, and available as an add-on) and is not using PIM for Entra administrative roles, that is a P0. There is no acceptable reason in 2026 for standing Global Admin, Privileged Role Administrator, Security Administrator, or Exchange Administrator assignments when PIM provides JIT elevation.
|
||||||
|
|
||||||
|
**What to confirm during engagement:**
|
||||||
|
- Global Admin: eligible only, not active. Any active (permanent) GA assignment that is not a break-glass account is a finding.
|
||||||
|
- Privileged Role Administrator: requires approval workflow on activation, not just MFA. This role controls who becomes admin — it should require a second human to approve.
|
||||||
|
- Security Administrator and Exchange Administrator: eligible, MFA on activation, justified time box (8 hours maximum for a working day).
|
||||||
|
- PIM activation requires phishing-resistant MFA. If it accepts push-approve, it is phishable.
|
||||||
|
|
||||||
|
**2026 note:** PIM now supports custom role definitions. If a client is assigning built-in broad roles (like Global Admin) to do a narrow task, check whether a custom role or a more scoped built-in (e.g., Intune Administrator instead of Global Admin) applies.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Service principals: the 2026 audit
|
||||||
|
|
||||||
|
Service principals hold more standing privilege in most tenants than all human admins combined. They cannot do MFA. They are almost never reviewed. This is the dark matter of privileged access.
|
||||||
|
|
||||||
|
**Escalation-grade Graph permissions — find every app holding these in 2026:**
|
||||||
|
- `RoleManagement.ReadWrite.Directory` — can grant any Entra role
|
||||||
|
- `AppRoleAssignment.ReadWrite.All` — can assign any app role, including to itself
|
||||||
|
- `Application.ReadWrite.All` — can modify any application and create new ones
|
||||||
|
- `Directory.ReadWrite.All` — broad directory write
|
||||||
|
- Any API permission scoped `Full` or ending in `.ReadWrite.All` for sensitive services
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# Find service principals with dangerous Graph permissions (application permissions)
|
||||||
|
Get-MgServicePrincipal -All | ForEach-Object {
|
||||||
|
$sp = $_
|
||||||
|
Get-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $sp.Id |
|
||||||
|
Where-Object { $_.PrincipalId -eq $sp.Id }
|
||||||
|
} # — pipe to filter on the dangerous role IDs listed above
|
||||||
|
```
|
||||||
|
|
||||||
|
For every hit: who created this app registration, when, is the permission still needed, is there an expiring secret or certificate, and can it be replaced with a managed identity?
|
||||||
|
|
||||||
|
**Secrets never expire — find them:** In the Entra portal > App registrations > All applications > sort by "Certificate & secrets expiration." Filter for never-expiring secrets. Every one is a standing credential with no forced rotation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### On-prem service accounts: gMSA yes, dMSA wait
|
||||||
|
|
||||||
|
**gMSA (Group Managed Service Accounts):** The right answer for on-prem service accounts in 2026. Automatic password rotation (no static secret), not Kerberoastable in the traditional sense, natively supported across Windows Server 2012+. If a client has regular service accounts with static passwords (especially if those passwords are 2+ years old), migrate to gMSA.
|
||||||
|
|
||||||
|
**Kerberoasting check (run this, not just ask about it):**
|
||||||
|
```powershell
|
||||||
|
# Find accounts with SPNs and static passwords
|
||||||
|
Get-ADUser -Filter {ServicePrincipalName -ne "$null"} -Properties ServicePrincipalName, PasswordLastSet, Enabled |
|
||||||
|
Where-Object {$_.Enabled -eq $true} |
|
||||||
|
Select-Object Name, PasswordLastSet, ServicePrincipalName
|
||||||
|
```
|
||||||
|
|
||||||
|
Any result with a `PasswordLastSet` older than 1 year is Kerberoastable and a P0.
|
||||||
|
|
||||||
|
**dMSA (Delegated Managed Service Accounts):** Introduced with Windows Server 2025-era tooling, targeting the migration path from standing service accounts. Do not recommend dMSA in 2026 — there is published privilege-escalation research against the migration path. Use gMSA until the specific vulnerabilities are patched and the client's environment is confirmed current. Check current Microsoft advisories at engagement time.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### LAPS: Windows LAPS deployment in 2026
|
||||||
|
|
||||||
|
**Legacy Microsoft LAPS** (the separately-downloaded agent) should be migrated to **Windows LAPS**, the built-in solution available in Windows 10 22H2 / Windows 11 22H2 and Windows Server 2019+ with April 2023 updates or later.
|
||||||
|
|
||||||
|
Windows LAPS can store passwords in AD, in Entra ID (for Entra-joined devices), or both. For hybrid estates, store in both. Manage via Intune (cloud-joined) or GPO (domain-joined).
|
||||||
|
|
||||||
|
**Coverage check:**
|
||||||
|
```powershell
|
||||||
|
# Computers without LAPS password set (null = not managed)
|
||||||
|
Get-ADComputer -Filter * -Properties 'ms-Mcs-AdmPwd', 'msLAPS-Password' |
|
||||||
|
Where-Object { $_.'ms-Mcs-AdmPwd' -eq $null -and $_.'msLAPS-Password' -eq $null } |
|
||||||
|
Select-Object Name
|
||||||
|
```
|
||||||
|
|
||||||
|
Every result is a computer with a shared or unknown local admin password — lateral movement risk.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### KRBTGT rotation
|
||||||
|
|
||||||
|
Check password age. 365+ days without rotation is a P1. No documented rotation since domain creation (common when the domain is 5–10 years old) is a P0 for any high-sensitivity engagement.
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Get-ADUser krbtgt -Properties PasswordLastSet | Select-Object PasswordLastSet
|
||||||
|
```
|
||||||
|
|
||||||
|
Rotation procedure: rotate once, wait at least the max ticket lifetime (default 10 hours), rotate again. Document both rotation timestamps. After rotation, monitor for authentication failures caused by cached golden tickets — if detections fire, that was a real golden ticket, not a drill finding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### ADCS: treat it as Tier 0
|
||||||
|
|
||||||
|
If the client has Active Directory Certificate Services deployed (almost all do if they have a domain older than 7 years), run a basic ESC vulnerability check. The ESC1–ESC8 misconfigurations are well-documented, freely exploitable, and almost never remediated because most organizations do not know they have ADCS issues.
|
||||||
|
|
||||||
|
**Quick check:**
|
||||||
|
- Is ADCS installed? `Get-WindowsFeature ADCS-Cert-Authority` on any server
|
||||||
|
- Is any template published with "Supply subject in request" + broad enrollment rights? That is ESC1.
|
||||||
|
- Certipy (open source) or Certify: run in read-only enumeration mode (`certipy find`) to identify vulnerable templates
|
||||||
|
|
||||||
|
ADCS is Tier 0. It sits on whatever server it runs on, and that server should have the same access controls as a domain controller. Verify it is not on a Tier 1 or Tier 2 server.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Privileged Access Workstations — scope the conversation honestly
|
||||||
|
|
||||||
|
PAWs are right in principle. In 2026, the practical conversation with most mid-market clients is: **dedicated devices for Tier 0 administration** (Global Admins and Domain Admins use a separate machine for those tasks, even if that machine is just a hardened Windows device or a VM they launch for admin work).
|
||||||
|
|
||||||
|
The minimum viable version: a dedicated Intune-enrolled, Entra-joined device with no email, no browser for general use, and a Conditional Access policy that restricts Global Admin and Domain Admin-equivalent activity to that device only. Not perfect PAW architecture but a massive improvement over "I use my laptop for everything."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Devices & Endpoint
|
||||||
|
|
||||||
|
### Reconcile the real fleet — do this on day one
|
||||||
|
|
||||||
|
Do not trust Intune's enrolled device count or any CMDB. Pull from four sources and compare them:
|
||||||
|
1. Intune managed devices (Intune portal)
|
||||||
|
2. Entra registered/joined devices (Entra portal > Devices)
|
||||||
|
3. Entra sign-in logs, device detail (what is actually authenticating)
|
||||||
|
4. Network device discovery if in scope
|
||||||
|
|
||||||
|
The gap between sources 1+2 and source 3 is your shadow/dark device population. Source 3 will almost always be larger. Every device authenticating that is not in sources 1+2 is an unmanaged device reaching data.
|
||||||
|
|
||||||
|
**Concrete — pull sign-in logs by device compliance state:** In the Entra portal: Sign-in logs > Add filter > "Managed device" = No or "Compliant" = No > export. Count the distinct device IDs. That count, compared against your Intune enrolled count, is the gap metric.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Cloud-native migration: Entra join + Intune as default
|
||||||
|
|
||||||
|
For any new device deployment or device refresh in 2026, **Entra join + Intune management** is the default. Hybrid Entra join (AD-joined + cloud-registered) is technical debt to retire, not a target state.
|
||||||
|
|
||||||
|
**Migration readiness check:** What on-prem resources does the client's fleet actually need? Line-of-business applications, file shares, printers? Each dependency is a reason to stay hybrid; each that can be moved or resolved with another mechanism is a reason to go cloud-native. Build the dependency map first.
|
||||||
|
|
||||||
|
**GPO to Settings Catalog:** Most GPO settings now have equivalents in the Intune Settings Catalog. The IntunePolicyParser tool can parse existing GPOs and identify Settings Catalog equivalents. Run this early in an endpoint engagement to scope the migration effort.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Conditional Access — test every policy before signing off
|
||||||
|
|
||||||
|
This is not a recommendation. It is a requirement.
|
||||||
|
|
||||||
|
**Protocol:**
|
||||||
|
1. Before changing or reviewing any CA policy, write down the expected behavior for the users and conditions in scope: *"User X, device Y, location Z → MUST be [blocked/granted/MFA-prompted]."*
|
||||||
|
2. Use What If as a logic check only — it evaluates configuration, not enforcement.
|
||||||
|
3. Drive real sign-ins for every important user/condition combination. Observe the actual result.
|
||||||
|
4. If the observed result contradicts the displayed configuration, recreate the policy from scratch. Do not edit the existing object — a ghost policy carries corruption forward through edits.
|
||||||
|
5. Re-test after any tenant-level change: adding a domain, changing federation, new app registration. You do not need to have touched the CA policy for it to ghost.
|
||||||
|
|
||||||
|
**Report-only mode:** Use report-only to pre-validate before enabling. But test in enabled mode before signing off. Report-only cannot find a ghost policy — only a live enforcement failure can.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### EPM: eliminate standing local admin
|
||||||
|
|
||||||
|
In 2026, **Endpoint Privilege Management (EPM)** in Intune is the right answer for "some users need admin rights for specific software." EPM provides JIT, audited, approved elevation without giving the user permanent local admin.
|
||||||
|
|
||||||
|
**Licensing:** Requires Intune Plan 2 or the Intune Suite (not included in standard Business Premium or E3 — verify licensing before scoping).
|
||||||
|
|
||||||
|
**Deployment:**
|
||||||
|
1. Audit current local admin membership across the fleet (GPO reporting or Intune device reports)
|
||||||
|
2. Identify the specific applications or tasks requiring elevation
|
||||||
|
3. Create EPM rules for those specific executables
|
||||||
|
4. Remove standing local admin from standard user accounts
|
||||||
|
5. Monitor EPM elevation events for anomalies
|
||||||
|
|
||||||
|
If EPM licensing is not available, Windows LAPS for local admin credentials (randomized, no shared password) plus a JIT process for elevation requests is the intermediate posture.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Update rings: the lesson from 2024
|
||||||
|
|
||||||
|
Configure update rings in Intune for all managed endpoints. Every client needs:
|
||||||
|
- **Pilot ring** (5–10% of devices, IT staff / early adopters): 0 days deferral
|
||||||
|
- **Broad ring** (remainder): 7-day deferral after pilot passes
|
||||||
|
- A named person with the authority to **halt a broad ring push** — confirmed they know how and have tested it
|
||||||
|
|
||||||
|
**Windows Autopatch** (included in Business Premium, E3 with Intune add-on, E5) automates ring management and defers intelligently. If the client is licensed for it and not using it, that is a quick win.
|
||||||
|
|
||||||
|
The 2024 CrowdStrike event applies not just to AV/EDR updates — it applies to any software distributed at scale. Update ring discipline is now an endpoint governance requirement, not a preference.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### MAM boundaries: test them on a real device
|
||||||
|
|
||||||
|
If the client uses App Protection Policies for BYOD (MAM-WE), the policy screen does not prove enforcement. Test on real devices, on current OS builds, per platform:
|
||||||
|
|
||||||
|
**Test protocol (run separately on iOS and Android):**
|
||||||
|
- Attempt to copy text from a managed app (Outlook, Teams) and paste into an unmanaged app
|
||||||
|
- Attempt to "Open in" from a managed attachment to an unmanaged app
|
||||||
|
- Attempt to save a file locally or to the camera roll
|
||||||
|
- Attempt to screenshot (if blocked by policy)
|
||||||
|
- Test from an unmanaged browser accessing SharePoint or OWA
|
||||||
|
|
||||||
|
Document where "Block" does not block. When you find a gap that survives reinstall on multiple devices, that is a vendor escalation, not a configuration fix.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Data & Collaboration
|
||||||
|
|
||||||
|
### Anonymous sharing: disable at the tenant level on day one
|
||||||
|
|
||||||
|
"Anyone with the link" sharing is a bearer token for your data — no identity required, forwardable, often with no expiry, reachable by anyone who ever held the link. This is the single largest data exposure fragility in M365.
|
||||||
|
|
||||||
|
**Immediate action:** SharePoint Admin Center > Policies > Sharing > External sharing: set to "New and existing guests" (requires authentication) or "Only people in your organization." If the client has a business case for anonymous links, scope specific sites where it is permitted and disable at the tenant level for everything else.
|
||||||
|
|
||||||
|
**Enumerate existing anonymous links:**
|
||||||
|
```powershell
|
||||||
|
# PnP PowerShell
|
||||||
|
Get-PnPTenantSite -IncludeOneDriveSites | ForEach-Object {
|
||||||
|
Get-PnPSiteCollectionSharingLinks -Site $_.Url
|
||||||
|
} | Where-Object { $_.Link -like "*guestaccess*" }
|
||||||
|
```
|
||||||
|
|
||||||
|
The list you get is almost always longer than anyone expected. The exercise of producing it is itself a finding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### External auto-forwarding: block it and check for active rules
|
||||||
|
|
||||||
|
**Block at the global level:** Exchange Admin Center > Mail flow > Remote domains > Default domain > Automatic forwarding: Disabled.
|
||||||
|
|
||||||
|
**Check for existing rules (do this before blocking in case active BEC is in progress):**
|
||||||
|
```powershell
|
||||||
|
Get-TransportRule | Where-Object {$_.BlindCopyTo -ne $null -or $_.RedirectMessageTo -ne $null} |
|
||||||
|
Select-Object Name, BlindCopyTo, RedirectMessageTo, Enabled
|
||||||
|
```
|
||||||
|
|
||||||
|
Any rule forwarding to an external address with no documented business owner is a potential BEC persistence mechanism. Treat as P0 until confirmed otherwise.
|
||||||
|
|
||||||
|
Also check Outlook/OWA rules at the mailbox level for executive accounts:
|
||||||
|
```powershell
|
||||||
|
Get-Mailbox -ResultSize Unlimited | Get-InboxRule |
|
||||||
|
Where-Object {$_.ForwardTo -ne $null -or $_.RedirectTo -ne $null} |
|
||||||
|
Select-Object MailboxOWAUrl, Name, ForwardTo, RedirectTo
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Crown jewels: name them before scoping DLP or labels
|
||||||
|
|
||||||
|
The first question in every data engagement: *"Which five data sets, if exfiltrated, would end or materially damage this business?"*
|
||||||
|
|
||||||
|
If the client cannot name them, that is finding #1 and the prerequisite for everything else. DLP and sensitivity labels applied before the crown jewels are identified are DLP and sensitivity labels that protect the wrong things.
|
||||||
|
|
||||||
|
Common crown jewels in 2026: M&A communications, board and executive email, source code repositories, customer PII data subject to GDPR/NIS2, financial forecasts and models, intellectual property, credentials and secrets stored in SharePoint/Teams.
|
||||||
|
|
||||||
|
Once named: where do they live? Who has access? Are they labeled? Is access audited?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Sensitivity labels and auto-labeling
|
||||||
|
|
||||||
|
**2026 recommendation:** If the client is on E5 Compliance or equivalent, deploy auto-labeling policies for the crown jewel data types. Manual labeling depends on user behavior; auto-labeling does not.
|
||||||
|
|
||||||
|
**Licensing check first:** Sensitivity labels: all M365 plans. Auto-labeling, advanced DLP, and Purview data governance: M365 E5 Compliance or the Microsoft Purview compliance add-on. Verify before scoping.
|
||||||
|
|
||||||
|
**Implementation sequence:**
|
||||||
|
1. Define the crown jewels (see above)
|
||||||
|
2. Create sensitivity labels in order from most to least restrictive (Highly Confidential, Confidential, Internal, Public)
|
||||||
|
3. Apply encryption to Highly Confidential and Confidential labels — encryption travels with the file, including after exfiltration
|
||||||
|
4. Configure auto-labeling for known high-value content types (credit card numbers, national IDs, custom regex for the client's IP)
|
||||||
|
5. Monitor label application events before enforcing auto-labeling in production
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Guest access: treat as standing blast radius
|
||||||
|
|
||||||
|
Run a guest access review on every engagement. Most tenants cannot produce the list of current guests without effort. The exercise of trying to produce it is the finding.
|
||||||
|
|
||||||
|
**Enumerate guests:**
|
||||||
|
```powershell
|
||||||
|
Get-MgUser -Filter "userType eq 'Guest'" -All |
|
||||||
|
Select-Object DisplayName, Mail, CreatedDateTime, SignInActivity
|
||||||
|
```
|
||||||
|
|
||||||
|
Sort by `LastSignInDateTime`. Guests who have not signed in for 90+ days have no legitimate active need. The default should be expiration, not permanence.
|
||||||
|
|
||||||
|
**Configure guest access reviews** in Entra Identity Governance > Access reviews. Set recurring reviews for all guests at 90-day intervals. When a reviewer does not respond, the default action should be removal, not retention.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Audit log: verify it is on and retained
|
||||||
|
|
||||||
|
Do not assume audit logging is enabled. Go to Microsoft Purview > Audit > Start recording user and admin activity (if the banner appears, it is not on). Then run a test search to confirm log entries are being captured.
|
||||||
|
|
||||||
|
**Retention check — critical:**
|
||||||
|
- E3 licensing: 90-day default retention
|
||||||
|
- E5 / Purview Audit Premium: 1 year (extendable to 10 years with add-on)
|
||||||
|
- Unified audit log must be explicitly enabled; it has historically not been on by default in older tenants
|
||||||
|
|
||||||
|
For incident response purposes: if a breach is discovered 60 days in, and the client has 90-day retention, the evidence window is 30 days. For most meaningful incidents, 90 days is insufficient. Scope the retention discussion explicitly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Recovery & Detection
|
||||||
|
|
||||||
|
### M365 backup: the mandatory conversation
|
||||||
|
|
||||||
|
Native Microsoft 365 provides recycle bins and version history. It does not provide point-in-time backup against ransomware, malicious admin deletion, or retention policy expiry.
|
||||||
|
|
||||||
|
**The question to ask the client:** "If someone with Global Admin access right now deleted every Exchange Online mailbox and every SharePoint site, what is your recovery path, and how long does it take?"
|
||||||
|
|
||||||
|
If the answer involves the Microsoft recycle bin and "we would call Microsoft support," that is not a recovery plan. The recycle bin window is 14–93 days depending on the workload and configuration, and it does not protect against retention policy deletion or hard-delete operations by a malicious admin.
|
||||||
|
|
||||||
|
**2026 recommendation:** A third-party M365 backup solution covering Exchange Online, SharePoint Online, OneDrive for Business, and Teams is a baseline requirement for any client treating M365 as business-critical. The market is mature. Veeam, AvePoint, Acronis, and Dropsuite are the common options. Assess per client need.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Configuration-as-code: export the control plane
|
||||||
|
|
||||||
|
Export CA policies, Intune baseline configurations, and Entra role assignments to code or structured files at the start of every engagement. This serves three purposes:
|
||||||
|
1. Known-good baseline to detect drift and ghost configuration against
|
||||||
|
2. Rebuild artifact for a compromised or corrupted tenant
|
||||||
|
3. Change management — you can diff the configuration before and after every change
|
||||||
|
|
||||||
|
**CA policies:** Use CAExporter (`vibecoding/CAExporter`) to export all CA policies to JSON. Store in client's repository. Run the export again at the close of the engagement and diff against the opening export — changes are documented, not assumed.
|
||||||
|
|
||||||
|
**Intune:** The Graph API can export most Intune configuration; IntunePolicyParser assists with policy comprehension. Store the export.
|
||||||
|
|
||||||
|
**Entra roles:** Capture the current role assignment list (who holds what role, eligibility vs activation) as a document. This is your before-state for any privileged access engagement.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Detection: eight signals that matter more than eight hundred that don't
|
||||||
|
|
||||||
|
Configure these eight before anything else. Each one represents a category of attack where silence is catastrophic:
|
||||||
|
|
||||||
|
| Signal | Where to configure | Why it cannot be noise |
|
||||||
|
|--------|-------------------|----------------------|
|
||||||
|
| Break-glass account sign-in (any use at all) | Entra audit logs → alert rule or Sentinel | An account that should never sign in has signed in |
|
||||||
|
| New Global Admin assigned | Entra audit logs, `Add member to role` for GA role | Shadow admin creation |
|
||||||
|
| DCSync from non-DC host | Microsoft Defender for Identity or Sentinel | On-prem AD credential harvest in progress |
|
||||||
|
| Impossible-travel sign-in for admin accounts | Entra ID Protection > User risk alerts | Account takeover in flight |
|
||||||
|
| External auto-forward rule created | Exchange audit logs | BEC persistence being established |
|
||||||
|
| Mass download from SharePoint/OneDrive | Defender for Cloud Apps or Purview | Exfiltration in progress |
|
||||||
|
| New OAuth consent grant to high-privilege scope | Entra audit logs, `Consent to application` | Illicit app consent attack |
|
||||||
|
| Privileged role activation outside business hours | PIM alerts | Credential use at suspicious time |
|
||||||
|
|
||||||
|
Each of these should route to a named human who will respond within a defined SLA. Detection that fires into an unmonitored queue is theatre with a subscription cost.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### AD forest recovery: have the conversation
|
||||||
|
|
||||||
|
Ask the client: "Has anyone on your team ever run an AD forest recovery — not in a training lab, on a real forest?" The answer is almost universally no.
|
||||||
|
|
||||||
|
This is not a project you complete in an engagement — it is a finding and a recommendation. The finding: if AD is destroyed or corrupted (ransomware taking the DCs), recovery is a multi-day, expert-dependent process that nobody on this team has ever performed. The recommendation: run a tabletop of the procedure, identify the gaps in the runbook, and ensure the runbook is stored somewhere that survives the estate being dark (not in SharePoint, not in an AD-authenticated file share).
|
||||||
|
|
||||||
|
The minimum viable runbook should cover: authoritative DC restore sequence, metadata cleanup, double KRBTGT reset, trust rebuilds, and how the Entra side reconnects when on-prem is back.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Break-glass: test it, don't just create it
|
||||||
|
|
||||||
|
Break-glass accounts exist in most tenants. They are tested in almost none. On every engagement:
|
||||||
|
|
||||||
|
1. Does the break-glass account exist? (Cloud-only, `.onmicrosoft.com`, not synced)
|
||||||
|
2. Is it phishing-resistant? (FIDO2 key or certificate — not push-approve)
|
||||||
|
3. Is it excluded from the CA policy that would otherwise block it?
|
||||||
|
4. Does its use trigger an immediate alert? (If yes, verify the alert fires during the test — not just that the alert rule exists)
|
||||||
|
5. Where are the credentials? (Not in the client's normal password manager that requires the same identity to access)
|
||||||
|
6. When was it last signed in to? (Credential should be proven functional — test it)
|
||||||
|
|
||||||
|
The test is non-negotiable. An untested break-glass account is a belief, not a recovery path.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What changed: 2025 → 2026
|
||||||
|
|
||||||
|
| Area | Prior state | 2026 position |
|
||||||
|
|------|------------|---------------|
|
||||||
|
| AD FS | Roadmap item for most clients | P0 conversation — tooling mature, no excuse |
|
||||||
|
| Entra Cloud Sync | "For simple topologies" | Recommended default for new deployments |
|
||||||
|
| dMSA | Newly released, cautiously recommended | Hold — published escalation research; use gMSA |
|
||||||
|
| EPM | Available, optional | Table stakes for zero-standing-admin on endpoints |
|
||||||
|
| Windows Autopatch | Optional | Default recommendation for update ring discipline |
|
||||||
|
| CA ghost policy | Edge case, occasionally found | Documented pattern — test every policy as standard |
|
||||||
|
| M365 native backup | "Microsoft covers it" (wrong but common) | Third-party backup framed as baseline, not option |
|
||||||
|
| PIM activation MFA | Often push-approve | Must be phishing-resistant to count |
|
||||||
|
| Windows LAPS | New, replacing legacy LAPS | Deployed as standard; legacy LAPS is tech debt |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The governing question — carry it into every session
|
||||||
|
|
||||||
|
Before every finding, every recommendation, every conversation:
|
||||||
|
|
||||||
|
> **If this is owned tonight, what is the largest thing an attacker reaches before hitting a wall — and can I draw that wall?**
|
||||||
|
|
||||||
|
If the wall is missing or undrawn, you have found the work. Everything else is sequencing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Field Guide for the Antifragile Handbook. Updated June 2026. Review and update January 2027 — the honest uncertainty sections of the books define what will change.*
|
||||||
@@ -0,0 +1,509 @@
|
|||||||
|
# Field Guide — Adversarial Validation
|
||||||
|
|
||||||
|
> *"It's a nice compliance dashboard you have here."*
|
||||||
|
|
||||||
|
**Last updated:** June 2026
|
||||||
|
**Companion to:** [Field Guide — 2026 Edition](field-guide-2026.md) · Books I–VI
|
||||||
|
**Engagement type:** Phase 2 — for clients who have done the foundational work
|
||||||
|
**Checklist:** [Adversarial Validation Checklist](../assessment-templates/adversarial-validation-checklist.md)
|
||||||
|
**Next review:** January 2027
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The premise
|
||||||
|
|
||||||
|
The client has MFA. They have Conditional Access. They have Intune. They have a SIEM. Their CIS score is in the seventies or eighties. Their audit passed. The dashboard is green.
|
||||||
|
|
||||||
|
This is the most dangerous estate to walk into — not because it is badly configured, but because everyone in the room believes it works. That belief is the fragility. Book I calls it directly: *"Green dashboards, untested reality — the most dangerous estate of all, because it feels safe."*
|
||||||
|
|
||||||
|
The foundational field guide tells you how to build controls. This engagement is about finding out which of the client's existing controls are real and which are representations — configurations that *display* correctly but *enforce* nothing, backups that exist but have never been restored, detection that fires into a queue nobody reads, attack paths to Domain Admin that nobody has mapped because the BloodHound licence expired.
|
||||||
|
|
||||||
|
**What you are doing in this engagement:** Systematically converting claimed security into observed security, domain by domain, and producing a structural change for every gap found. Not a pentest. Not a red team. A constructive adversarial validation — you are working with the client, with full authorization, with the explicit goal of finding what breaks before an attacker does.
|
||||||
|
|
||||||
|
**What you are not doing:** Adding more controls. This engagement deliberately does not recommend new tooling or new policies. If a control exists and does not work, the finding is that the control does not work — not that a different control is needed. Via negativa applies here too: the fragility is almost always that the existing controls have too many exceptions, too little monitoring, and have never been tested.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Before you start
|
||||||
|
|
||||||
|
### Authorization scope
|
||||||
|
|
||||||
|
Before any test in this engagement, confirm written authorization covering:
|
||||||
|
|
||||||
|
- Simulating attacks against identity (Kerberoasting, DCSync simulation, PIM bypass attempts)
|
||||||
|
- Triggering security alerts deliberately (break-glass sign-in, impossible-travel simulation, fake consent grant)
|
||||||
|
- Testing compliance controls on managed devices (rooting a test device, forcing a non-compliant state)
|
||||||
|
- Attempting data exfiltration through DLP and labeling controls (on test data, to controlled test destinations)
|
||||||
|
- Restoring from backup in a test environment
|
||||||
|
|
||||||
|
Authorization is not "we told them verbally." It is a document signed by the named executive sponsor covering the scope of tests. Scope the authorization to the test accounts, test devices, and test data used — do not test on production privileged accounts or production data unless explicitly scoped.
|
||||||
|
|
||||||
|
### Baseline capture before anything changes
|
||||||
|
|
||||||
|
On day one, before any test or change:
|
||||||
|
|
||||||
|
1. Export all CA policies to JSON (CAExporter or Graph API). This is the declared state you will test against and the known-good you will compare the close-of-engagement state to.
|
||||||
|
2. Run BloodHound and capture the full attack graph. The number of paths to Domain Admin at T+0 is your opening metric.
|
||||||
|
3. Pull the Entra role assignment list — who holds what role, eligible vs. active.
|
||||||
|
4. Pull the service principal inventory with their Graph permissions.
|
||||||
|
5. Export Intune compliance and configuration policy assignments.
|
||||||
|
6. Run `Get-ADUser krbtgt -Properties PasswordLastSet`, `Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet`, and document both.
|
||||||
|
7. Count sign-in log distinct device IDs for the last 30 days. Compare to Intune enrolled device count. Record the gap.
|
||||||
|
|
||||||
|
These numbers are your before-state. Every structural change produced by this engagement is measured against them.
|
||||||
|
|
||||||
|
### The opening conversation
|
||||||
|
|
||||||
|
This engagement starts with a single question asked out loud, to the most senior technical person in the room:
|
||||||
|
|
||||||
|
> *"Can you show me one control in this estate that you are certain works — not because the portal says so, but because you have watched it fire under real conditions?"*
|
||||||
|
|
||||||
|
The answer tells you everything. A person who can point to a specific tested control on a specific date has a security programme. A person who gestures at the dashboard has a compliance programme. Both deserve good consulting — but they need different things.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Identity — proving the wall is real
|
||||||
|
|
||||||
|
### The firebreak claim
|
||||||
|
|
||||||
|
The client almost certainly claims that cloud privilege is separated from on-prem compromise. Test the claim, don't accept it.
|
||||||
|
|
||||||
|
**Draw the full graph, out loud:**
|
||||||
|
Starting from Domain Admin (or a simulated compromise of the sync server), trace every path that reaches a cloud privileged role:
|
||||||
|
- Are any GAs synced from on-prem? (They claim no — verify.)
|
||||||
|
- Can the sync server connector account be used to tamper with cloud objects?
|
||||||
|
- Do any admins use the same device for Tier 0 and cloud admin work?
|
||||||
|
- Is there a PTA agent that could be compromised to intercept credentials?
|
||||||
|
- Does any MFA for cloud admin rely on an authenticator app on a device that is also used for email? (The MFA device is Tier 2. The admin role is cloud Tier 0. That is a tier violation across the MFA layer.)
|
||||||
|
|
||||||
|
**Verify cloud-only GAs are actually cloud-only:**
|
||||||
|
```powershell
|
||||||
|
$gaRoleId = (Get-MgDirectoryRole -Filter "displayName eq 'Global Administrator'").Id
|
||||||
|
Get-MgDirectoryRoleMember -DirectoryRoleId $gaRoleId |
|
||||||
|
Select-Object @{N='UPN';E={$_.AdditionalProperties['userPrincipalName']}},
|
||||||
|
@{N='OnPremSyncEnabled';E={$_.AdditionalProperties['onPremisesSyncEnabled']}}
|
||||||
|
```
|
||||||
|
`onPremisesSyncEnabled: true` on any GA is a P0 finding. "We moved them to cloud-only" is the claim; this is the verification.
|
||||||
|
|
||||||
|
**Test the break-glass is actually independent:**
|
||||||
|
With the client present: sign in to the break-glass account. Does it succeed? Does an alert fire? Does the person named as the responder to that alert actually receive it and acknowledge it within the agreed SLA? An alert rule that exists but routes to an unmonitored inbox is a ghost detection.
|
||||||
|
|
||||||
|
### AD FS: is the token-signing key actually monitored?
|
||||||
|
|
||||||
|
If AD FS is still running (and in a "mature" estate it often is, "migration is on the roadmap"):
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Get-AdfsCertificate -CertificateType Token-Signing |
|
||||||
|
Select-Object Thumbprint, NotAfter, @{N='DaysSinceRotation';E={(Get-Date) - $_.Certificate.NotBefore | Select-Object -ExpandProperty Days}}
|
||||||
|
```
|
||||||
|
|
||||||
|
Then ask: if an attacker obtained the private key for this certificate right now, what would you see in your logs? Walk through the scenario. In almost every case the honest answer is "nothing — a Golden SAML token is indistinguishable from a legitimate one." That is the finding. The migration is no longer a roadmap item.
|
||||||
|
|
||||||
|
### PIM: test the activation path, not the configuration
|
||||||
|
|
||||||
|
The client has PIM. But:
|
||||||
|
|
||||||
|
- **What MFA method is required on activation?** Navigate to PIM > Settings for Global Administrator role > Require MFA on activation. Then confirm the MFA method registered for each eligible GA. Push-approve MFA + PIM activation = phishable PIM. The control is not what it appears.
|
||||||
|
- **Test an activation:** Have a test user with an eligible GA role activate it. Time the process. Observe: does the approval notification reach the approver? Does the approver know what they are approving, or does it arrive as a blind "approve this"? An approval workflow where approvers routinely click approve without context is not an approval workflow.
|
||||||
|
- **Check for standing GA assignments that are supposed to be eligible-only.** `Get-MgDirectoryRoleMember` for GA — any user with no corresponding PIM eligible assignment has a permanent standing assignment that exists outside PIM, whether intentionally or by configuration drift.
|
||||||
|
- **Check the maximum activation time box.** 24-hour activation windows are common in "we have PIM" deployments. An activation window that covers an entire working day is functionally standing privilege during business hours.
|
||||||
|
|
||||||
|
### The connector account as a canary
|
||||||
|
|
||||||
|
Reconfigure: any sign-in by the Entra connector account (Directory Synchronization Accounts role) from any host other than the sync server should fire an alert. Then test it: simulate a sign-in from an unexpected host. Does the alert fire? Does someone respond?
|
||||||
|
|
||||||
|
If the answer is "we have an alert rule," test it. "We have an alert rule" is a declaration. A firing alert reaching a responding human is an observation. The handbook's hardest rule applies here: verify by observation, never by inspection.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Privilege — attack paths the client has not mapped
|
||||||
|
|
||||||
|
### BloodHound as a metric, not a one-time scan
|
||||||
|
|
||||||
|
The client's mature estate almost certainly has attack paths to Domain Admin that nobody has counted since the last pentest, if ever. Run BloodHound, capture the full graph, and count:
|
||||||
|
|
||||||
|
- **Total paths to Domain Admin** (all principals)
|
||||||
|
- **Paths reachable from standard user compromise** (the realistic starting point for a phishing attack)
|
||||||
|
- **Paths involving Kerberoastable service accounts** specifically
|
||||||
|
- **Paths involving ADCS** (add `-CollectionMethod ACL,ObjectProps,Trusts` to catch certificate-based escalation)
|
||||||
|
|
||||||
|
Present the number. Do not present it as "you have X findings." Present it as: *"From a single compromised standard user account, there are N independent routes to Domain Admin. Each route is a path through controls the attacker does not need to break because they route around them."* Then pick the three shortest paths and show them concretely.
|
||||||
|
|
||||||
|
This number is now a tracked metric. The engagement is not complete until it is going down.
|
||||||
|
|
||||||
|
### Kerberoast it — don't ask if it's possible
|
||||||
|
|
||||||
|
Run the attack:
|
||||||
|
```powershell
|
||||||
|
# Using Rubeus or Invoke-Kerberoast in an authorized test context
|
||||||
|
Invoke-Kerberoast -OutputFormat Hashcat | Out-File kerberoast_hashes.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
The question is not "are there Kerberoastable accounts" (there are) — the question is: **did anything detect it?** A Kerberoast produces distinctive TGS request patterns. If Defender for Identity, Microsoft Sentinel, or any SIEM is watching, it should alert. If it does not, you have found a detection gap more important than the accounts themselves.
|
||||||
|
|
||||||
|
Then attempt to crack the hashes offline (with explicit authorization, on a controlled device). Report which accounts crack and in what time. Most clients are surprised. The service account from 2019 with the password that was "rotated" to `ServiceAcc0unt!2019` cracks in minutes.
|
||||||
|
|
||||||
|
### ADCS: the forgotten Tier 0 target
|
||||||
|
|
||||||
|
Run a basic ESC vulnerability enumeration:
|
||||||
|
```
|
||||||
|
certipy find -u <test-account>@domain.com -p <password> -dc-ip <DC-IP> -stdout
|
||||||
|
```
|
||||||
|
Or Certify if a Windows test host is more convenient:
|
||||||
|
```
|
||||||
|
Certify.exe find /vulnerable
|
||||||
|
```
|
||||||
|
|
||||||
|
In a mature estate, the ADCS server has been running for years, was configured for a specific purpose in 2018, and has never been audited against the ESC series. ESC1 (supply subject in request + broad enrollment rights) in particular is common and catastrophic — it allows any enrolled user to obtain a certificate for any principal, including Domain Admins. Find it, show the exploit path, and document that the ADCS server is being treated as Tier 1 when it is Tier 0.
|
||||||
|
|
||||||
|
### Service principal dark matter
|
||||||
|
|
||||||
|
The client's mature estate has app registrations. Some of them have permissions that were granted for a reason that nobody in the room can explain. Find the escalation-grade ones:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# Application permissions (not delegated — these run without a user)
|
||||||
|
$dangerousPermissions = @(
|
||||||
|
"9e3f62cf-ca93-4989-b6ce-bf83c28f9fe8", # RoleManagement.ReadWrite.Directory
|
||||||
|
"06b708a9-e830-4db3-a914-8e69da51d44f", # AppRoleAssignment.ReadWrite.All
|
||||||
|
"1bfefb4e-e0b5-418b-a88f-73c46d2cc8e9", # Application.ReadWrite.All
|
||||||
|
"19dbc75e-c2e2-444c-a770-ec69d8559fc7" # Directory.ReadWrite.All
|
||||||
|
)
|
||||||
|
|
||||||
|
Get-MgServicePrincipal -All | ForEach-Object {
|
||||||
|
$sp = $_
|
||||||
|
Get-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $sp.Id |
|
||||||
|
Where-Object { $_.AppRoleId -in $dangerousPermissions } |
|
||||||
|
ForEach-Object {
|
||||||
|
[PSCustomObject]@{
|
||||||
|
ServicePrincipal = $sp.DisplayName
|
||||||
|
Permission = $_.AppRoleId
|
||||||
|
GrantedDate = $_.CreatedDateTime
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} | Sort-Object GrantedDate
|
||||||
|
```
|
||||||
|
|
||||||
|
For each result: ask the room who created this app registration, what it does, and whether the permission is still needed. The answer to all three is usually "I don't know." That is the finding.
|
||||||
|
|
||||||
|
Then go further: check which of these service principals have non-expiring client secrets and which have never been used (check the sign-in logs for the service principal's `lastSignInDateTime`). A service principal that has not authenticated in 180 days with a never-expiring secret holding escalation-grade Graph permissions is a standing credential an attacker can use indefinitely without triggering a human sign-in.
|
||||||
|
|
||||||
|
### Standing privilege check: the PIM compliance gap
|
||||||
|
|
||||||
|
Ask for the full current list of active (not eligible) privileged role assignments. For each one:
|
||||||
|
- Is it a break-glass account? If not, it should not be standing.
|
||||||
|
- Is it a service account that cannot use PIM? Document and scope the managed-identity migration.
|
||||||
|
- Is it an account someone added "temporarily" and forgot?
|
||||||
|
|
||||||
|
In most mature tenants, the list of active non-break-glass assignments is longer than anyone expects, because PIM was deployed and the existing standing assignments were not cleaned up at the time.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Devices — the compliance signal gap
|
||||||
|
|
||||||
|
### The ghost CA policy protocol
|
||||||
|
|
||||||
|
Apply this to every CA policy the client considers important (not every policy — prioritize the ones that block legacy auth, enforce device compliance, and gate privileged sign-in):
|
||||||
|
|
||||||
|
**Before testing any policy:**
|
||||||
|
Write down the expected outcome: *"User [X], device [Y], from location [Z], accessing [App] → MUST be [blocked / MFA-prompted / compliant-device-required]."* Write this before looking at the policy configuration. This prevents rationalizing whatever you observe.
|
||||||
|
|
||||||
|
**The tests to run:**
|
||||||
|
|
||||||
|
1. **Legacy auth block:** Use a mail client that supports Basic Auth (older Outlook, curl with basic auth headers to Exchange Online) from a test account. Expected: blocked. If it succeeds, the CA policy that blocks legacy auth either has an exclusion, is in report-only, or is a ghost.
|
||||||
|
|
||||||
|
2. **Compliant device gate:** Sign in from a device that is known to be non-compliant (a personal device, or a managed device you have taken out of compliance by disabling BitLocker or removing an agent). Expected: blocked from sensitive workloads. If access is granted, either the CA policy is not evaluating correctly or the compliance signal is stale.
|
||||||
|
|
||||||
|
3. **Admin sign-in from non-PAW:** Attempt to activate a PIM role from a standard workstation or a personal device. Expected: blocked if there is a CA policy restricting admin access to compliant or named devices. If it succeeds, the PAW policy is a claim.
|
||||||
|
|
||||||
|
4. **The ghost test:** If any policy above fails to enforce despite its configuration appearing correct — recreate the policy from scratch with identical parameters. Re-test. If the recreated policy enforces and the original did not, you have found a ghost policy. Document the specific policy name, the discrepancy, the recreation, and the re-test result.
|
||||||
|
|
||||||
|
**Important:** Do not re-edit a failing policy to fix it. Recreate it. A ghost policy carries its corruption forward through edits.
|
||||||
|
|
||||||
|
### Compliance signal spoofing: measure the lag
|
||||||
|
|
||||||
|
Take a test enrolled device (a managed device you have authorization to modify):
|
||||||
|
|
||||||
|
1. Root/jailbreak it, or manually induce a non-compliant state (disable encryption, disable the screen lock, install a prohibited app — whatever the compliance policy checks).
|
||||||
|
2. Record the timestamp.
|
||||||
|
3. Watch Intune and Entra ID: when does the compliance state flip to non-compliant?
|
||||||
|
4. When does Conditional Access revoke the session token?
|
||||||
|
5. Is Continuous Access Evaluation (CAE) in place for the workloads that matter? If yes, token revocation should be near-real-time for supported apps. If no, the window is bounded by the token lifetime.
|
||||||
|
|
||||||
|
The gap between step 2 and step 4 is the attacker's window after compromising a compliant device. Present it in minutes, not as "the token may be stale." Most clients have never measured it.
|
||||||
|
|
||||||
|
### Reconcile the real fleet
|
||||||
|
|
||||||
|
Pull four numbers and compare them:
|
||||||
|
|
||||||
|
| Source | Count |
|
||||||
|
|--------|-------|
|
||||||
|
| Intune managed devices | |
|
||||||
|
| Entra registered/joined devices | |
|
||||||
|
| Distinct device IDs in sign-in logs (last 30 days) | |
|
||||||
|
| Distinct device IDs signing in with "Device compliant: No" or "Device managed: No" | |
|
||||||
|
|
||||||
|
The gap between row 1+2 and row 3 is the shadow population. The number in row 4 is the unmanaged population actively accessing data. Neither of these are hypothetical risks — they are current, observable facts about who is accessing the tenant right now.
|
||||||
|
|
||||||
|
For every device in row 4: what data can it reach, and what Conditional Access policy, if any, applies to it?
|
||||||
|
|
||||||
|
### Legacy auth: find the surviving flows
|
||||||
|
|
||||||
|
Even with a "block legacy auth" CA policy in place, find the exceptions:
|
||||||
|
|
||||||
|
```
|
||||||
|
Sign-in logs → Add filter → Client App → select all non-modern entries:
|
||||||
|
Exchange ActiveSync
|
||||||
|
Exchange Online PowerShell
|
||||||
|
Exchange Web Services
|
||||||
|
IMAP4
|
||||||
|
MAPI Over HTTP
|
||||||
|
Other clients
|
||||||
|
POP3
|
||||||
|
Reporting Web Services
|
||||||
|
SMTP
|
||||||
|
```
|
||||||
|
|
||||||
|
Export the results. Every entry is a legacy auth flow that either bypasses the CA policy (via an exclusion you should examine) or is a service account using a protocol that will break when the exclusion is removed. Build the map. The goal is zero — but the path to zero requires knowing what is currently there.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Data — does protection actually travel
|
||||||
|
|
||||||
|
### Exfiltrate a labelled document
|
||||||
|
|
||||||
|
With authorization, take a test document labelled at the highest sensitivity tier available (Highly Confidential, or equivalent):
|
||||||
|
|
||||||
|
1. Forward it as an email attachment to a personal test email address outside the tenant. Does DLP intercept it? Does the label encryption hold on the received document?
|
||||||
|
2. Download it to an unmanaged device (one that is not Intune-enrolled). Open it. Does encryption require authentication to the tenant?
|
||||||
|
3. Share it via an anonymous "Anyone with the link" URL (if anonymous sharing is still permitted). Access the link from a browser with no tenant authentication. Does it open?
|
||||||
|
4. Copy and paste the content from the document into an unmanaged app (on a device where the MAM boundary applies). Does the block work?
|
||||||
|
5. Open it in a browser through Conditional Access App Control session policy. Attempt to download. Does the block work?
|
||||||
|
|
||||||
|
Document which paths hold and which do not. The ones that do not hold are the exfiltration routes an attacker (or a careless employee) will actually use. Every failed block is a finding; the label configuration that passed in the policy screen is the ghost, and the exfiltrated file is the fact.
|
||||||
|
|
||||||
|
### Enumerate the anonymous link population
|
||||||
|
|
||||||
|
The tenant sharing setting may say "restricted." That setting controls new links. It does not remove existing ones. Run:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
# PnP PowerShell — requires SiteCollection Admin on each site
|
||||||
|
Get-PnPTenantSite | ForEach-Object {
|
||||||
|
Connect-PnPOnline -Url $_.Url -Interactive
|
||||||
|
Get-PnPSharingLinks | Where-Object { $_.SharingLinkType -eq "Anonymous" }
|
||||||
|
} | Export-Csv anonymous_links.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
Present the count. In mature tenants, the anonymous link population predates the current tenant sharing settings by years. The setting was changed; the links were not revoked. Every entry is an active bearer token for data that predates the restriction.
|
||||||
|
|
||||||
|
### The BEC forward rule: simulate it
|
||||||
|
|
||||||
|
With a test account (not an executive, not a privileged account):
|
||||||
|
|
||||||
|
1. Create an Inbox rule forwarding all email to an external test address you control.
|
||||||
|
2. Wait to see whether anything detects it and when.
|
||||||
|
3. Check whether the global block on external auto-forwarding (`Get-RemoteDomain Default | Select-Object AutoForwardEnabled`) actually blocks this test rule from executing.
|
||||||
|
4. Confirm: does the transport rule block the forwarding, or does the block only apply to Outlook/OWA auto-forwarding (not to manually-created Inbox rules)?
|
||||||
|
|
||||||
|
There is a documented distinction: the transport-level `AutoForwardEnabled: false` on Remote Domains blocks transport-rule-level forwarding and OWA Auto-Reply forwarding, but Inbox rules created in Outlook/OWA by the user may still forward depending on the specific configuration. Test this on the client's environment. Do not assume.
|
||||||
|
|
||||||
|
### Crown jewel access review
|
||||||
|
|
||||||
|
For the data sets the client has identified as crown jewels (if they have not identified them, that is the first finding — go back to basic engagement):
|
||||||
|
|
||||||
|
1. Pull the access list for the crown-jewel SharePoint sites and OneDrive locations.
|
||||||
|
2. Pull the audit log for access events on those locations over the last 30 days.
|
||||||
|
3. Identify: who accessed them, how frequently, from what devices?
|
||||||
|
4. Find: any access from unmanaged devices. Any access from accounts that should not have visibility. Any bulk download events.
|
||||||
|
5. Specifically check for guest access to the crown-jewel locations — guests whose project has concluded but whose access persists.
|
||||||
|
|
||||||
|
The audit log review is also a test of the audit infrastructure: can you produce a coherent forensic reconstruction of who accessed what, when, from where, over the last 30 days? If the answer is "we would need to run several different reports and correlate them manually," that is an incident response readiness finding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Detection — does it fire, does anyone act
|
||||||
|
|
||||||
|
This section is the difference between robustness and antifragility. Everything before this is about whether controls hold. This section is about whether the organization learns when they do not.
|
||||||
|
|
||||||
|
### The eight simulations
|
||||||
|
|
||||||
|
For each of these, run the simulation with authorization, observe the outcome, and measure the time from event to human acknowledgment. The SLA the client believes they have is the declared state. The measured time is the observed state.
|
||||||
|
|
||||||
|
**Simulation 1 — Break-glass sign-in:**
|
||||||
|
Sign in to the break-glass Global Admin account. This should trigger an immediate, high-priority alert routed to a named responder. Measure: how long from sign-in to human acknowledgment? If the answer is longer than 15 minutes, the break-glass is not monitored at the level it needs to be.
|
||||||
|
|
||||||
|
**Simulation 2 — New Global Admin assigned:**
|
||||||
|
Assign GA to a test account. Observe: does an alert fire in Microsoft Sentinel, Microsoft Defender, or the configured SIEM? Who receives it? When? Revoke the assignment after the test.
|
||||||
|
|
||||||
|
**Simulation 3 — DCSync simulation:**
|
||||||
|
From a non-DC host with a test account that has the relevant permissions (or using Mimikatz in an authorized test context), simulate a DCSync operation. Defender for Identity should alert on `Directory Services Replication`. Does it? Does the alert reach a human? Most mature clients have DfI deployed; fewer have confirmed the specific alert fires and routes correctly.
|
||||||
|
|
||||||
|
**Simulation 4 — Kerberoasting (detection, not just the attack):**
|
||||||
|
Run the Kerberoast from section 2 again, now with the explicit goal of measuring detection. Did the TGS request pattern generate an alert? The attack was run earlier to find the vulnerable accounts; run it again now to find the detection gap.
|
||||||
|
|
||||||
|
**Simulation 5 — Impossible travel for an admin account:**
|
||||||
|
Using a VPN exit node or a cloud VM in a geographically distant region, sign in as a test user who recently signed in from the client's location. Entra ID Protection should flag this as a risky sign-in. Does the user risk policy elevate the risk? Does a CA policy enforce remediation (MFA challenge or block)? Does an alert fire to the SOC? For admin accounts specifically, this should be a high-priority signal.
|
||||||
|
|
||||||
|
**Simulation 6 — External auto-forward rule:**
|
||||||
|
From the data section — did anything alert when the test Inbox rule was created? If no detection fired during that test, that is a finding: BEC persistence can be established without triggering a single alert.
|
||||||
|
|
||||||
|
**Simulation 7 — Mass download from SharePoint:**
|
||||||
|
With a test account that has access to a document library, download 50+ files in rapid succession. Does Defender for Cloud Apps or Microsoft Purview generate an unusual-download alert? Does anything block or throttle it?
|
||||||
|
|
||||||
|
**Simulation 8 — OAuth consent grant:**
|
||||||
|
Register a test app requesting `Mail.Read` and `Files.ReadWrite.All` permissions. Grant it on behalf of a test user (simulating a user who clicks "Accept" on a consent prompt). Does anything alert on the grant event? Is user consent for this class of permission blocked by policy, or can users grant it freely?
|
||||||
|
|
||||||
|
### Alert fatigue: measure it honestly
|
||||||
|
|
||||||
|
Pull the alert volume from the last 30 days (from Sentinel, Defender XDR, or wherever alerts are collected). Calculate:
|
||||||
|
|
||||||
|
- Total alerts generated
|
||||||
|
- Alerts closed as "true positive" with a documented response
|
||||||
|
- Alerts closed as "false positive"
|
||||||
|
- Alerts that have sat open for more than 48 hours
|
||||||
|
- Alerts that were suppressed or auto-closed without human review
|
||||||
|
|
||||||
|
The ratio of responded-to versus everything else is the real detection efficacy rate. Most mature clients discover that their effective detection rate is single-digit percentages of generated alerts. Present the number; it is a more honest metric than "we have Sentinel."
|
||||||
|
|
||||||
|
### The structural change test
|
||||||
|
|
||||||
|
Pull the last five security incidents or alerts that resulted in a closed ticket. For each:
|
||||||
|
|
||||||
|
- What was the incident?
|
||||||
|
- What was the response?
|
||||||
|
- What structural change resulted — what was removed, severed, restricted, or reconfigured because of this incident?
|
||||||
|
|
||||||
|
If the answer to the third question is "we sent a reminder," "we noted it in the risk register," or "we trained the affected user" — the feedback loop is broken. Pain that closes a ticket without changing the architecture is wasted pain. Present the count of structural changes from the last five incidents. If it is zero, that is the most important finding in the report.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Recovery — is the exit ramp real
|
||||||
|
|
||||||
|
### Restore something
|
||||||
|
|
||||||
|
Before the engagement closes, restore a real dataset from backup. Not a test restore of a test file — a production dataset (authorized, scoped, non-disruptive) or the clearest approximation the client can authorize.
|
||||||
|
|
||||||
|
Time it. Record the actual MTTR. Compare it to the RTO written in the policy document.
|
||||||
|
|
||||||
|
If the actual MTTR is longer than the policy MTTR, the policy is fiction. Present the observed time as the finding. The goal is not to shame the recovery team — it is to replace a comfortable fiction with a useful truth.
|
||||||
|
|
||||||
|
**For M365 specifically:** Restore a mailbox or a SharePoint document library item from the third-party backup (if one exists). If no third-party backup exists in a mature estate, that is a P0 — it means the client has delegated recovery to Microsoft's recycle bin, which is not a backup posture.
|
||||||
|
|
||||||
|
### AD forest recovery readiness
|
||||||
|
|
||||||
|
Ask the client to produce their AD forest recovery runbook. Three things to verify:
|
||||||
|
|
||||||
|
1. **Is the runbook stored where it can be accessed when AD is down?** Not in SharePoint. Not in an AD-authenticated file share. Not in a password manager that authenticates against the domain. Paper, or a system outside the recovery domain, or both.
|
||||||
|
2. **Has anyone ever run the procedure?** Not a tabletop — an actual restore, even in a lab. The first time you perform AD forest recovery must not be during the real disaster.
|
||||||
|
3. **Does the runbook account for the double-KRBTGT rotation, metadata cleanup, and trust resets?** If it says "restore the DC from backup and you're done," it is incomplete.
|
||||||
|
|
||||||
|
If the answer to question 2 is no, scope a recovery rehearsal. This is the finding: the organization is one ransomware incident away from performing the hardest IT operation in existence for the first time, under maximum pressure, with incomplete runbooks.
|
||||||
|
|
||||||
|
### Configuration drift from the known-good
|
||||||
|
|
||||||
|
Compare the CA policy export from the beginning of this engagement against the current state. In any mature estate where CA policies are managed by multiple people without change control, there will be differences. For each difference:
|
||||||
|
|
||||||
|
- Was it intentional? Is there a change record?
|
||||||
|
- Does the difference make the policy more or less restrictive?
|
||||||
|
- If a policy was modified by someone without change authorization, how long ago and how would it have been detected?
|
||||||
|
|
||||||
|
The absence of a known-good baseline means the client cannot answer these questions. The presence of a known-good baseline and a diff is the beginning of drift detection. If the diff reveals changes made outside the change window or without documentation, that is a control failure independent of whether the change was malicious.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The close
|
||||||
|
|
||||||
|
### What changes structurally
|
||||||
|
|
||||||
|
At the end of this engagement, for every finding that was verified by observation (not just inspected), produce a specific structural change:
|
||||||
|
|
||||||
|
| Finding type | Structural change target |
|
||||||
|
|---|---|
|
||||||
|
| Ghost CA policy found | Policy recreated, re-tested, documented |
|
||||||
|
| PIM activation MFA is push-approve | Migration to phishing-resistant MFA scoped |
|
||||||
|
| Kerberoasting not detected | Detection rule created, tested end-to-end |
|
||||||
|
| Standing GA outside PIM | Account removed from role; break-glass confirmed working |
|
||||||
|
| Anonymous links not revoked | Links enumerated and revoked; expiration policy applied |
|
||||||
|
| BEC rule creation not detected | Exchange alert configured, tested |
|
||||||
|
| Alert queue not triaged | Alert owner named, SLA defined, volume reduced |
|
||||||
|
| Backup MTTR exceeds policy | Policy updated to observed time; rehearsal scheduled |
|
||||||
|
|
||||||
|
The engagement deliverable is not the report. The deliverable is the list of structural changes, plus the metrics: BloodHound path count before and after, standing privilege account count before and after, confirmed-working detection count, and measured MTTR.
|
||||||
|
|
||||||
|
### Metrics to deliver at close
|
||||||
|
|
||||||
|
| Metric | Before | After |
|
||||||
|
|--------|--------|-------|
|
||||||
|
| BloodHound paths to Domain Admin (from standard user) | | |
|
||||||
|
| Standing (non-break-glass) Global Admin count | | |
|
||||||
|
| Standing (non-break-glass) Domain Admin count | | |
|
||||||
|
| CA policies verified to enforce by observation | | |
|
||||||
|
| Detection signals tested end-to-end and confirmed working | | |
|
||||||
|
| Anonymous link count (existing) | | |
|
||||||
|
| Unmanaged devices in sign-in logs (% of total) | | |
|
||||||
|
| Actual MTTR from backup restore drill | | |
|
||||||
|
| Structural changes from last 5 incidents (before) | | |
|
||||||
|
|
||||||
|
These numbers are the honest alternative to a compliance score. None of them can be faked by clicking a toggle. All of them represent something an attacker either can or cannot do.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. The leave-behind
|
||||||
|
|
||||||
|
The engagement ends. The admin has to operate the estate alone until the next engagement. This section is what you set up during the engagement so they can do that.
|
||||||
|
|
||||||
|
### The self-service cadence document
|
||||||
|
|
||||||
|
Every adversarial validation engagement closes with a filled-in [Self-Service Cadence](../assessment-templates/self-service-cadence.md) document, customized for the client. The template becomes their recurring runbook — monthly portal checks, quarterly tool runs, and a clear list of "call us if you see this" triggers.
|
||||||
|
|
||||||
|
Spend the last session of the engagement walking through the document with the named admin. Run the first quarterly check together, with them driving. The goal is not to hand over a PDF — it is to verify they can execute it without you in the room.
|
||||||
|
|
||||||
|
### Tools to leave installed and working
|
||||||
|
|
||||||
|
Before you leave, confirm these are installed and the admin has run each at least once:
|
||||||
|
|
||||||
|
| Tool | Confirm working | Leave-behind |
|
||||||
|
|------|----------------|--------------|
|
||||||
|
| PingCastle | Run a healthcheck scan, admin can read the output | HTML report from today as the baseline |
|
||||||
|
| Purple Knight | Run a full scan, admin can read the indicators | PDF report from today as the baseline |
|
||||||
|
| CAExporter | Exported today's CA policies, stored in agreed location | JSON files from today as the known-good |
|
||||||
|
| Graph PowerShell module | Admin can connect and run the scripts in the cadence document | Scripts saved to the agreed local path |
|
||||||
|
| PnP PowerShell | Admin can connect to SharePoint admin and run the anonymous link export | Confirmed connected during the session |
|
||||||
|
|
||||||
|
Do not leave a tool installed that the admin has never run. An unfamiliar tool is not a capability — it is a task that will not get done.
|
||||||
|
|
||||||
|
### The baseline numbers
|
||||||
|
|
||||||
|
At close of engagement, record the opening and closing metrics in the tracking spreadsheet you set up with the admin. These are the numbers their quarterly PingCastle and Purple Knight runs will be compared against. Without a baseline, a quarterly scan is a point in time with no direction — with a baseline, it tells a story.
|
||||||
|
|
||||||
|
| Metric | Value at close of engagement |
|
||||||
|
|--------|------------------------------|
|
||||||
|
| PingCastle score | |
|
||||||
|
| Purple Knight: Critical indicators | |
|
||||||
|
| BloodHound paths to DA (standard user) | |
|
||||||
|
| Standing GA count (non-break-glass) | |
|
||||||
|
| Anonymous link count | |
|
||||||
|
| Stale guest count (90+ days inactive) | |
|
||||||
|
| CA policies verified to enforce | |
|
||||||
|
| Detection signals confirmed working | |
|
||||||
|
|
||||||
|
### "Call us" triggers — agree them explicitly
|
||||||
|
|
||||||
|
From the [cadence document](../assessment-templates/self-service-cadence.md), go through the trigger list out loud with the admin and confirm they understand each one. The list exists so they do not have to judge whether something is important enough to contact you — the bar is already defined.
|
||||||
|
|
||||||
|
The most important part of this conversation: *"When in doubt, contact us. We would rather look at a false alarm than hear about a real incident that sat for two weeks because you were not sure if it was worth mentioning."*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What this engagement is not
|
||||||
|
|
||||||
|
**Not a red team.** The client knows you are here. You are working with them, not against them. When a simulation fires an alert, you tell the responder it is a test. The goal is to calibrate the detection, not to prove that you can evade it.
|
||||||
|
|
||||||
|
**Not a vulnerability scan.** You are not looking for unpatched CVEs or misconfigured services in bulk. You are validating the specific controls the client believes are in place.
|
||||||
|
|
||||||
|
**Not a compliance audit.** You will not produce a CIS score or a NIST gap report at the end. You will produce a list of controls that work and a list of controls that do not, measured by observation, with structural changes attached to the ones that do not.
|
||||||
|
|
||||||
|
**Not additive.** You are not recommending new tools, new policies, or new products. If something does not work, the fix is almost always to remove the exception, test the existing control, or eliminate the coupling — not to add a compensating control on top of the broken one.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Field Guide — Adversarial Validation. Updated June 2026. Review alongside the main field guide — January 2027.*
|
||||||
Reference in New Issue
Block a user