Files
antifragile/antifragile-consulting/books/field-guide-2026.md
T

569 lines
36 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# M365 + AD Field Guide — 2026 Edition
> *The books are principles. This is practice — concrete actions, current tooling, and 2026-specific decisions. It will need updating next year. That is the point.*
**Last updated:** June 2026
**Companion to:** The Antifragile Handbook for M365 & AD (Books IVI)
**Next review:** January 2027
---
## What this is
The Antifragile Handbook teaches judgement. This document teaches actions — what to do, in 2026, with the tooling that exists now, in the estates you will actually walk into. Where the handbook says "eliminate AD FS," this document says how and what blockers to expect. Where the handbook says "test the CA policy," this document says what a ghost policy looks like when you find one.
Read the books first. Use this document on-site.
---
## Notation
**P0** — attacker already through; fix before leaving this session
**P1** — closes in this engagement
**P2** — roadmap item, documented
**2026 note** — something that has changed or become clearer since the handbook was written
---
## 1. Hybrid Identity
### Remove AD FS — this is now a P0 conversation
In 2026, Microsoft's migration tooling has matured to the point where AD FS is a choice, not an inevitability. Every client still running it should have a migration plan or a written, named reason for not having one.
**Why it is a P0:** Golden SAML is still an active nation-state technique. The token-signing private key in most tenants has never been rotated, is stored on the AD FS servers, and is not monitored. One foothold on any on-prem system that can reach the AD FS servers ends cloud identity entirely — silently, with validly-signed tokens, no failed logins, nothing for a SIEM to catch.
**What to do:**
- In the Entra portal, go to Identity > Applications > AD FS activity (if it appears). This gives you the relying party trust inventory and migration readiness per application. This is your conversation starter.
- Enumerate relying party trusts: `Get-AdfsRelyingPartyTrust | Select-Object Name, Enabled, Identifier`. Each enabled one is a blocker that needs a cloud equivalent or decommission plan.
- Check the token-signing cert: `Get-AdfsCertificate -CertificateType Token-Signing`. Note the NotAfter date and when it was last rotated. "Has not been rotated since installation" is the expected answer and is itself a finding.
- Staged rollout in Entra lets you migrate users incrementally — you do not have to cut over all at once. Use it.
**Migration target:** Password Hash Sync (PHS) + Entra-managed MFA via Conditional Access. This removes the on-prem dependency for cloud authentication and kills Golden SAML as a class.
**2026 note:** The AD FS migration activity report and staged rollout tooling make this significantly more tractable than it was in 20232024. Remove the roadmap language and have the P0 conversation.
---
### Connect Sync vs Cloud Sync — new deployments
**2026 recommendation:** For new hybrid sync deployments and organizations without complex topologies (no device writeback, no large object filtering requirements, no multi-forest writeback scenarios), **Entra Cloud Sync** is the preferred deployment. Smaller attack surface than Connect Sync (no SQL Express, no full-blown sync engine, multiple lightweight agents for HA), easier to harden, no single machine that holds DCSync-capable credentials.
**Connect Sync stays correct for:** Large/complex topologies, specific writeback scenarios (check the current parity matrix at Microsoft Learn before promising Cloud Sync covers a client's requirements — this changes).
**For existing Connect Sync deployments:** The migration path to Cloud Sync exists. Check current documentation for topology compatibility. Do not promise the migration before confirming the client's scenario is supported.
**In either case, the sync server is Tier 0.** See the hardening actions below.
---
### Sync server hardening — concrete actions
The sync server (Connect or Cloud Sync agent host) is typically treated as a utility VM. It holds an identity capable of DCSync. Treat it accordingly.
**Immediate checks:**
- Is the server domain-joined to the production domain? If yes, its blast radius is one hop from any Tier 1 or Tier 2 compromise. Ideal: join it to a dedicated Tier 0 or management forest, or isolate it behind jump-box access only.
- What account runs the connector service, and what permissions does it have? For Connect Sync, the on-prem connector account needs `Replicate Directory Changes` and `Replicate Directory Changes All`. Confirm it is a dedicated service account (ideally gMSA), not a human admin account that doubled up.
- Has the server ever been patched? Check `Get-HotFix | Sort-Object InstalledOn -Descending | Select-Object -First 5`. If nothing in the last 60 days, that is a finding.
- Is the Entra connector account (Directory Synchronization Accounts role) monitored? Any sign-in from any host other than the sync server should alert immediately.
- Are local administrators on the sync server documented and minimal?
---
### Cloud-only Global Admins — enforce it on day one
**P0 if not in place.** Synced accounts holding Global Admin are the most common single finding across all engagements and the most direct path from a ransomwared on-prem AD to cloud dominance.
**Find the synced GAs:**
```powershell
# Connect-MgGraph -Scopes "Directory.Read.All"
$gaRoleId = (Get-MgDirectoryRole -Filter "displayName eq 'Global Administrator'").Id
Get-MgDirectoryRoleMember -DirectoryRoleId $gaRoleId |
Where-Object { $_.AdditionalProperties['userPrincipalName'] -notlike "*.onmicrosoft.com" }
```
Every result is a synced account. Every synced account in GA is a P0.
**Remediation path:**
1. Create a new cloud-only account (`user@tenant.onmicrosoft.com` format), assign GA, configure phishing-resistant MFA.
2. Validate the new account works — sign in, confirm PIM activation if PIM is in place.
3. Remove GA from the synced account.
4. Add a Conditional Access policy blocking synced account UPNs from holding privileged roles (belt-and-suspenders; requires knowing the UPN pattern).
---
### Seamless SSO key — rotate it
`AZUREADSSOACC` was created when Seamless SSO was enabled and is almost certainly unrotated. The Kerberos key on this account is a silver-ticket / cloud token-forging exposure if the on-prem is compromised.
**Check last password set:**
```powershell
Get-ADComputer AZUREADSSOACC -Properties PasswordLastSet | Select-Object PasswordLastSet
```
If this matches the approximate go-live date of the Microsoft 365 tenant, it has never been rotated.
**Rotate it:** Use the `Update-AzureADSSOForest` PowerShell command (in the MSOnline / Entra Connect tooling). Run it twice per domain — same discipline as KRBTGT rotation. If Seamless SSO is not needed (Entra join and modern auth only), remove `AZUREADSSOACC` entirely.
---
### Writebacks — name and own each one
Enumerate which writebacks are enabled (password writeback, group writeback, device writeback) in Connect Sync or Cloud Sync configuration. For each:
- Who owns the decision to have it enabled?
- What does an attacker reach if the cloud side is compromised — can they write into on-prem AD?
- Is the reverse blast radius documented?
Password writeback is usually justified (SSPR usability). Group writeback creates a two-way channel between cloud security groups and on-prem AD — the blast radius should be explicit. If there is no current owner or justification for a writeback, disable it.
---
## 2. Privileged Access
### PIM: table stakes in 2026
If the client has Entra ID P2 (included in Microsoft 365 E5, Business Premium, and available as an add-on) and is not using PIM for Entra administrative roles, that is a P0. There is no acceptable reason in 2026 for standing Global Admin, Privileged Role Administrator, Security Administrator, or Exchange Administrator assignments when PIM provides JIT elevation.
**What to confirm during engagement:**
- Global Admin: eligible only, not active. Any active (permanent) GA assignment that is not a break-glass account is a finding.
- Privileged Role Administrator: requires approval workflow on activation, not just MFA. This role controls who becomes admin — it should require a second human to approve.
- Security Administrator and Exchange Administrator: eligible, MFA on activation, justified time box (8 hours maximum for a working day).
- PIM activation requires phishing-resistant MFA. If it accepts push-approve, it is phishable.
**2026 note:** PIM now supports custom role definitions. If a client is assigning built-in broad roles (like Global Admin) to do a narrow task, check whether a custom role or a more scoped built-in (e.g., Intune Administrator instead of Global Admin) applies.
---
### Service principals: the 2026 audit
Service principals hold more standing privilege in most tenants than all human admins combined. They cannot do MFA. They are almost never reviewed. This is the dark matter of privileged access.
**Escalation-grade Graph permissions — find every app holding these in 2026:**
- `RoleManagement.ReadWrite.Directory` — can grant any Entra role
- `AppRoleAssignment.ReadWrite.All` — can assign any app role, including to itself
- `Application.ReadWrite.All` — can modify any application and create new ones
- `Directory.ReadWrite.All` — broad directory write
- Any API permission scoped `Full` or ending in `.ReadWrite.All` for sensitive services
```powershell
# Find service principals with dangerous Graph permissions (application permissions)
Get-MgServicePrincipal -All | ForEach-Object {
$sp = $_
Get-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $sp.Id |
Where-Object { $_.PrincipalId -eq $sp.Id }
} # — pipe to filter on the dangerous role IDs listed above
```
For every hit: who created this app registration, when, is the permission still needed, is there an expiring secret or certificate, and can it be replaced with a managed identity?
**Secrets never expire — find them:** In the Entra portal > App registrations > All applications > sort by "Certificate & secrets expiration." Filter for never-expiring secrets. Every one is a standing credential with no forced rotation.
---
### On-prem service accounts: gMSA yes, dMSA wait
**gMSA (Group Managed Service Accounts):** The right answer for on-prem service accounts in 2026. Automatic password rotation (no static secret), not Kerberoastable in the traditional sense, natively supported across Windows Server 2012+. If a client has regular service accounts with static passwords (especially if those passwords are 2+ years old), migrate to gMSA.
**Kerberoasting check (run this, not just ask about it):**
```powershell
# Find accounts with SPNs and static passwords
Get-ADUser -Filter {ServicePrincipalName -ne "$null"} -Properties ServicePrincipalName, PasswordLastSet, Enabled |
Where-Object {$_.Enabled -eq $true} |
Select-Object Name, PasswordLastSet, ServicePrincipalName
```
Any result with a `PasswordLastSet` older than 1 year is Kerberoastable and a P0.
**dMSA (Delegated Managed Service Accounts):** Introduced with Windows Server 2025-era tooling, targeting the migration path from standing service accounts. Do not recommend dMSA in 2026 — there is published privilege-escalation research against the migration path. Use gMSA until the specific vulnerabilities are patched and the client's environment is confirmed current. Check current Microsoft advisories at engagement time.
---
### LAPS: Windows LAPS deployment in 2026
**Legacy Microsoft LAPS** (the separately-downloaded agent) should be migrated to **Windows LAPS**, the built-in solution available in Windows 10 22H2 / Windows 11 22H2 and Windows Server 2019+ with April 2023 updates or later.
Windows LAPS can store passwords in AD, in Entra ID (for Entra-joined devices), or both. For hybrid estates, store in both. Manage via Intune (cloud-joined) or GPO (domain-joined).
**Coverage check:**
```powershell
# Computers without LAPS password set (null = not managed)
Get-ADComputer -Filter * -Properties 'ms-Mcs-AdmPwd', 'msLAPS-Password' |
Where-Object { $_.'ms-Mcs-AdmPwd' -eq $null -and $_.'msLAPS-Password' -eq $null } |
Select-Object Name
```
Every result is a computer with a shared or unknown local admin password — lateral movement risk.
---
### KRBTGT rotation
Check password age. 365+ days without rotation is a P1. No documented rotation since domain creation (common when the domain is 510 years old) is a P0 for any high-sensitivity engagement.
```powershell
Get-ADUser krbtgt -Properties PasswordLastSet | Select-Object PasswordLastSet
```
Rotation procedure: rotate once, wait at least the max ticket lifetime (default 10 hours), rotate again. Document both rotation timestamps. After rotation, monitor for authentication failures caused by cached golden tickets — if detections fire, that was a real golden ticket, not a drill finding.
---
### ADCS: treat it as Tier 0
If the client has Active Directory Certificate Services deployed (almost all do if they have a domain older than 7 years), run a basic ESC vulnerability check. The ESC1ESC8 misconfigurations are well-documented, freely exploitable, and almost never remediated because most organizations do not know they have ADCS issues.
**Quick check:**
- Is ADCS installed? `Get-WindowsFeature ADCS-Cert-Authority` on any server
- Is any template published with "Supply subject in request" + broad enrollment rights? That is ESC1.
- Certipy (open source) or Certify: run in read-only enumeration mode (`certipy find`) to identify vulnerable templates
ADCS is Tier 0. It sits on whatever server it runs on, and that server should have the same access controls as a domain controller. Verify it is not on a Tier 1 or Tier 2 server.
---
### Admin workstations — the cloud VM is the deployable PAW
Physical PAWs are right in principle and almost never get deployed. Hardware procurement, second device, behaviour change — the project does not survive contact with a real IT budget. Do not open the conversation with "you need a dedicated PAW laptop." Open it with the cloud admin VM.
**The cloud admin VM:** a Windows 365 or Azure Virtual Desktop instance provisioned from a hardened template. The admin connects from their normal device via browser or RDP. Privileged credentials — including WireGuard keys for the management overlay — live in the cloud VM, not on the admin's local device. Compromise response: wipe it, reprovision from template in under 20 minutes.
**Provisioning the cloud admin VM:**
1. Create a Windows 365 or AVD instance from a hardened base image (CIS L2 baseline or equivalent)
2. Enrol in Intune, apply a configuration profile: no internet browsing, no personal email, no Microsoft Store apps, screen lock on idle, BitLocker enforced
3. Scope a CA policy restricting Global Admin and privileged role activation to this device (device compliance + named Intune group)
4. Install the Nebula client (if deploying T0 overlay) and distribute the pre-signed node certificate
5. Install the Tailscale client (if deploying T1 overlay) and enrol with the Entra OIDC identity
**Minimum viable without the overlay:** a dedicated Intune-enrolled, Entra-joined cloud VM with no email and no general browsing, and a CA policy restricting GA activation to it. Not perfect, but it will actually get deployed and maintained.
---
### Management overlay — Nebula for T0, Tailscale for T1
**When a client needs this:** SME and mid-market clients with multi-cloud resources, DevOps workloads, or remote admins — and no physical data centre with a proper management VLAN. The overlay builds the management plane that the physical network cannot provide.
**When a client does not need this:** organisations with their own data centres and physical network infrastructure already in place. Traditional management VLAN segmentation plus jump boxes is the right answer there. Adding an overlay creates a new Tier 0 component without proportional benefit.
**The T0 overlay — Nebula:**
Nebula has no coordinator in the runtime path. Once certificates are distributed, the overlay runs with zero external dependencies. This is the right property for T0: a compromised or unavailable external service cannot affect access to your domain controllers.
Deployment steps:
1. Provision the Nebula CA on a dedicated air-gapped machine (a dedicated laptop that is never networked, or a cheap PC kept in a drawer)
2. Generate and sign node certificates for each T0 node (DCs, sync server, ADCS, cloud admin VMs/PAWs)
3. Distribute the signed certificates and the CA certificate to each node
4. Configure the Nebula ACL policy: cloud admin VMs can reach DCs on port 3389 (RDP) and 5985/5986 (WinRM); nothing else. DCs do not reach each other through Nebula (they have their own replication channel)
5. Start the Nebula service on each node. Test connectivity from the cloud admin VM to a DC
6. Document the CA signing ceremony: who can sign new certs, what approval is needed, where the CA key is stored, how to revoke (distribute updated blocklist to all nodes)
**Realistic T0 node count:** 1525 nodes for a 5,000-person organisation. Certificate management is a documented ceremony run a few times a year, not an ongoing operational burden.
**The T1 overlay — Tailscale:**
Tailscale with Entra OIDC + key expiry gives you device trust (WireGuard node key) plus per-session identity assertion (Entra MFA on re-authentication). Configure key expiry to force re-authentication on a schedule aligned with the session risk tolerance (824 hours for admin access).
Deployment steps:
1. Create a Tailscale account or deploy Headscale (for sovereign requirements)
2. Configure the OIDC integration with Entra ID. Set the MFA requirement to phishing-resistant (FIDO2) in the Entra Conditional Access policy that governs Tailscale authentication
3. Set key expiry: 824 hours for admin nodes, 2472 hours for standard nodes
4. Define ACL policy: cloud admin VMs reach T1 servers on management ports only; standard user devices do not appear in the T1 ACL
5. Enrol cloud admin VMs as nodes. Enrol T1 servers (member servers, cloud management hosts, K8s API server endpoints)
6. Test: attempt to reach a T1 server from a non-enrolled device. Expected: no route. From an enrolled cloud admin VM: connected
**What Tailscale carries for multi-cloud:** kubectl access to K8s clusters, SSH/RDP to member servers and cloud VMs, cloud CLI access where the management API is behind a private endpoint. It does not carry M365 admin traffic — that goes direct to Microsoft over the internet, gated by Conditional Access.
**The Nebula CA — the one critical operation:**
The CA key is the trust anchor for the entire T0 overlay. Its compromise means an attacker can enrol their own node and grant it access to every DC. Treat it accordingly:
- Air-gapped machine, never networked after initial setup
- CA key encrypted at rest on the machine and backed up separately
- Certificate lifetime: 180 days maximum, so non-renewal handles most revocation cases
- Revocation: generate and distribute an updated `blocklist.pem` to all nodes if a PAW is lost or an admin departs before cert expiry
- At least two named people who know the ceremony and can perform it
---
## 3. Devices & Endpoint
### Reconcile the real fleet — do this on day one
Do not trust Intune's enrolled device count or any CMDB. Pull from four sources and compare them:
1. Intune managed devices (Intune portal)
2. Entra registered/joined devices (Entra portal > Devices)
3. Entra sign-in logs, device detail (what is actually authenticating)
4. Network device discovery if in scope
The gap between sources 1+2 and source 3 is your shadow/dark device population. Source 3 will almost always be larger. Every device authenticating that is not in sources 1+2 is an unmanaged device reaching data.
**Concrete — pull sign-in logs by device compliance state:** In the Entra portal: Sign-in logs > Add filter > "Managed device" = No or "Compliant" = No > export. Count the distinct device IDs. That count, compared against your Intune enrolled count, is the gap metric.
---
### Cloud-native migration: Entra join + Intune as default
For any new device deployment or device refresh in 2026, **Entra join + Intune management** is the default. Hybrid Entra join (AD-joined + cloud-registered) is technical debt to retire, not a target state.
**Migration readiness check:** What on-prem resources does the client's fleet actually need? Line-of-business applications, file shares, printers? Each dependency is a reason to stay hybrid; each that can be moved or resolved with another mechanism is a reason to go cloud-native. Build the dependency map first.
**GPO to Settings Catalog:** Most GPO settings now have equivalents in the Intune Settings Catalog. The IntunePolicyParser tool can parse existing GPOs and identify Settings Catalog equivalents. Run this early in an endpoint engagement to scope the migration effort.
---
### Conditional Access — test every policy before signing off
This is not a recommendation. It is a requirement.
**Protocol:**
1. Before changing or reviewing any CA policy, write down the expected behavior for the users and conditions in scope: *"User X, device Y, location Z → MUST be [blocked/granted/MFA-prompted]."*
2. Use What If as a logic check only — it evaluates configuration, not enforcement.
3. Drive real sign-ins for every important user/condition combination. Observe the actual result.
4. If the observed result contradicts the displayed configuration, recreate the policy from scratch. Do not edit the existing object — a ghost policy carries corruption forward through edits.
5. Re-test after any tenant-level change: adding a domain, changing federation, new app registration. You do not need to have touched the CA policy for it to ghost.
**Report-only mode:** Use report-only to pre-validate before enabling. But test in enabled mode before signing off. Report-only cannot find a ghost policy — only a live enforcement failure can.
---
### EPM: eliminate standing local admin
In 2026, **Endpoint Privilege Management (EPM)** in Intune is the right answer for "some users need admin rights for specific software." EPM provides JIT, audited, approved elevation without giving the user permanent local admin.
**Licensing:** Requires Intune Plan 2 or the Intune Suite (not included in standard Business Premium or E3 — verify licensing before scoping).
**Deployment:**
1. Audit current local admin membership across the fleet (GPO reporting or Intune device reports)
2. Identify the specific applications or tasks requiring elevation
3. Create EPM rules for those specific executables
4. Remove standing local admin from standard user accounts
5. Monitor EPM elevation events for anomalies
If EPM licensing is not available, Windows LAPS for local admin credentials (randomized, no shared password) plus a JIT process for elevation requests is the intermediate posture.
---
### Update rings: the lesson from 2024
Configure update rings in Intune for all managed endpoints. Every client needs:
- **Pilot ring** (510% of devices, IT staff / early adopters): 0 days deferral
- **Broad ring** (remainder): 7-day deferral after pilot passes
- A named person with the authority to **halt a broad ring push** — confirmed they know how and have tested it
**Windows Autopatch** (included in Business Premium, E3 with Intune add-on, E5) automates ring management and defers intelligently. If the client is licensed for it and not using it, that is a quick win.
The 2024 CrowdStrike event applies not just to AV/EDR updates — it applies to any software distributed at scale. Update ring discipline is now an endpoint governance requirement, not a preference.
---
### MAM boundaries: test them on a real device
If the client uses App Protection Policies for BYOD (MAM-WE), the policy screen does not prove enforcement. Test on real devices, on current OS builds, per platform:
**Test protocol (run separately on iOS and Android):**
- Attempt to copy text from a managed app (Outlook, Teams) and paste into an unmanaged app
- Attempt to "Open in" from a managed attachment to an unmanaged app
- Attempt to save a file locally or to the camera roll
- Attempt to screenshot (if blocked by policy)
- Test from an unmanaged browser accessing SharePoint or OWA
Document where "Block" does not block. When you find a gap that survives reinstall on multiple devices, that is a vendor escalation, not a configuration fix.
---
## 4. Data & Collaboration
### Anonymous sharing: disable at the tenant level on day one
"Anyone with the link" sharing is a bearer token for your data — no identity required, forwardable, often with no expiry, reachable by anyone who ever held the link. This is the single largest data exposure fragility in M365.
**Immediate action:** SharePoint Admin Center > Policies > Sharing > External sharing: set to "New and existing guests" (requires authentication) or "Only people in your organization." If the client has a business case for anonymous links, scope specific sites where it is permitted and disable at the tenant level for everything else.
**Enumerate existing anonymous links:**
```powershell
# PnP PowerShell
Get-PnPTenantSite -IncludeOneDriveSites | ForEach-Object {
Get-PnPSiteCollectionSharingLinks -Site $_.Url
} | Where-Object { $_.Link -like "*guestaccess*" }
```
The list you get is almost always longer than anyone expected. The exercise of producing it is itself a finding.
---
### External auto-forwarding: block it and check for active rules
**Block at the global level:** Exchange Admin Center > Mail flow > Remote domains > Default domain > Automatic forwarding: Disabled.
**Check for existing rules (do this before blocking in case active BEC is in progress):**
```powershell
Get-TransportRule | Where-Object {$_.BlindCopyTo -ne $null -or $_.RedirectMessageTo -ne $null} |
Select-Object Name, BlindCopyTo, RedirectMessageTo, Enabled
```
Any rule forwarding to an external address with no documented business owner is a potential BEC persistence mechanism. Treat as P0 until confirmed otherwise.
Also check Outlook/OWA rules at the mailbox level for executive accounts:
```powershell
Get-Mailbox -ResultSize Unlimited | Get-InboxRule |
Where-Object {$_.ForwardTo -ne $null -or $_.RedirectTo -ne $null} |
Select-Object MailboxOWAUrl, Name, ForwardTo, RedirectTo
```
---
### Crown jewels: name them before scoping DLP or labels
The first question in every data engagement: *"Which five data sets, if exfiltrated, would end or materially damage this business?"*
If the client cannot name them, that is finding #1 and the prerequisite for everything else. DLP and sensitivity labels applied before the crown jewels are identified are DLP and sensitivity labels that protect the wrong things.
Common crown jewels in 2026: M&A communications, board and executive email, source code repositories, customer PII data subject to GDPR/NIS2, financial forecasts and models, intellectual property, credentials and secrets stored in SharePoint/Teams.
Once named: where do they live? Who has access? Are they labeled? Is access audited?
---
### Sensitivity labels and auto-labeling
**2026 recommendation:** If the client is on E5 Compliance or equivalent, deploy auto-labeling policies for the crown jewel data types. Manual labeling depends on user behavior; auto-labeling does not.
**Licensing check first:** Sensitivity labels: all M365 plans. Auto-labeling, advanced DLP, and Purview data governance: M365 E5 Compliance or the Microsoft Purview compliance add-on. Verify before scoping.
**Implementation sequence:**
1. Define the crown jewels (see above)
2. Create sensitivity labels in order from most to least restrictive (Highly Confidential, Confidential, Internal, Public)
3. Apply encryption to Highly Confidential and Confidential labels — encryption travels with the file, including after exfiltration
4. Configure auto-labeling for known high-value content types (credit card numbers, national IDs, custom regex for the client's IP)
5. Monitor label application events before enforcing auto-labeling in production
---
### Guest access: treat as standing blast radius
Run a guest access review on every engagement. Most tenants cannot produce the list of current guests without effort. The exercise of trying to produce it is the finding.
**Enumerate guests:**
```powershell
Get-MgUser -Filter "userType eq 'Guest'" -All |
Select-Object DisplayName, Mail, CreatedDateTime, SignInActivity
```
Sort by `LastSignInDateTime`. Guests who have not signed in for 90+ days have no legitimate active need. The default should be expiration, not permanence.
**Configure guest access reviews** in Entra Identity Governance > Access reviews. Set recurring reviews for all guests at 90-day intervals. When a reviewer does not respond, the default action should be removal, not retention.
---
### Audit log: verify it is on and retained
Do not assume audit logging is enabled. Go to Microsoft Purview > Audit > Start recording user and admin activity (if the banner appears, it is not on). Then run a test search to confirm log entries are being captured.
**Retention check — critical:**
- E3 licensing: 90-day default retention
- E5 / Purview Audit Premium: 1 year (extendable to 10 years with add-on)
- Unified audit log must be explicitly enabled; it has historically not been on by default in older tenants
For incident response purposes: if a breach is discovered 60 days in, and the client has 90-day retention, the evidence window is 30 days. For most meaningful incidents, 90 days is insufficient. Scope the retention discussion explicitly.
---
## 5. Recovery & Detection
### M365 backup: the mandatory conversation
Native Microsoft 365 provides recycle bins and version history. It does not provide point-in-time backup against ransomware, malicious admin deletion, or retention policy expiry.
**The question to ask the client:** "If someone with Global Admin access right now deleted every Exchange Online mailbox and every SharePoint site, what is your recovery path, and how long does it take?"
If the answer involves the Microsoft recycle bin and "we would call Microsoft support," that is not a recovery plan. The recycle bin window is 1493 days depending on the workload and configuration, and it does not protect against retention policy deletion or hard-delete operations by a malicious admin.
**2026 recommendation:** A third-party M365 backup solution covering Exchange Online, SharePoint Online, OneDrive for Business, and Teams is a baseline requirement for any client treating M365 as business-critical. The market is mature. Veeam, AvePoint, Acronis, and Dropsuite are the common options. Assess per client need.
---
### Configuration-as-code: export the control plane
Export CA policies, Intune baseline configurations, and Entra role assignments to code or structured files at the start of every engagement. This serves three purposes:
1. Known-good baseline to detect drift and ghost configuration against
2. Rebuild artifact for a compromised or corrupted tenant
3. Change management — you can diff the configuration before and after every change
**CA policies:** Use CAExporter (`vibecoding/CAExporter`) to export all CA policies to JSON. Store in client's repository. Run the export again at the close of the engagement and diff against the opening export — changes are documented, not assumed.
**Intune:** The Graph API can export most Intune configuration; IntunePolicyParser assists with policy comprehension. Store the export.
**Entra roles:** Capture the current role assignment list (who holds what role, eligibility vs activation) as a document. This is your before-state for any privileged access engagement.
---
### Detection: eight signals that matter more than eight hundred that don't
Configure these eight before anything else. Each one represents a category of attack where silence is catastrophic:
| Signal | Where to configure | Why it cannot be noise |
|--------|-------------------|----------------------|
| Break-glass account sign-in (any use at all) | Entra audit logs → alert rule or Sentinel | An account that should never sign in has signed in |
| New Global Admin assigned | Entra audit logs, `Add member to role` for GA role | Shadow admin creation |
| DCSync from non-DC host | Microsoft Defender for Identity or Sentinel | On-prem AD credential harvest in progress |
| Impossible-travel sign-in for admin accounts | Entra ID Protection > User risk alerts | Account takeover in flight |
| External auto-forward rule created | Exchange audit logs | BEC persistence being established |
| Mass download from SharePoint/OneDrive | Defender for Cloud Apps or Purview | Exfiltration in progress |
| New OAuth consent grant to high-privilege scope | Entra audit logs, `Consent to application` | Illicit app consent attack |
| Privileged role activation outside business hours | PIM alerts | Credential use at suspicious time |
Each of these should route to a named human who will respond within a defined SLA. Detection that fires into an unmonitored queue is theatre with a subscription cost.
---
### AD forest recovery: have the conversation
Ask the client: "Has anyone on your team ever run an AD forest recovery — not in a training lab, on a real forest?" The answer is almost universally no.
This is not a project you complete in an engagement — it is a finding and a recommendation. The finding: if AD is destroyed or corrupted (ransomware taking the DCs), recovery is a multi-day, expert-dependent process that nobody on this team has ever performed. The recommendation: run a tabletop of the procedure, identify the gaps in the runbook, and ensure the runbook is stored somewhere that survives the estate being dark (not in SharePoint, not in an AD-authenticated file share).
The minimum viable runbook should cover: authoritative DC restore sequence, metadata cleanup, double KRBTGT reset, trust rebuilds, and how the Entra side reconnects when on-prem is back.
---
### Break-glass: test it, don't just create it
Break-glass accounts exist in most tenants. They are tested in almost none. On every engagement:
1. Does the break-glass account exist? (Cloud-only, `.onmicrosoft.com`, not synced)
2. Is it phishing-resistant? (FIDO2 key or certificate — not push-approve)
3. Is it excluded from the CA policy that would otherwise block it?
4. Does its use trigger an immediate alert? (If yes, verify the alert fires during the test — not just that the alert rule exists)
5. Where are the credentials? (Not in the client's normal password manager that requires the same identity to access)
6. When was it last signed in to? (Credential should be proven functional — test it)
The test is non-negotiable. An untested break-glass account is a belief, not a recovery path.
---
## What changed: 2025 → 2026
| Area | Prior state | 2026 position |
|------|------------|---------------|
| AD FS | Roadmap item for most clients | P0 conversation — tooling mature, no excuse |
| Entra Cloud Sync | "For simple topologies" | Recommended default for new deployments |
| dMSA | Newly released, cautiously recommended | Hold — published escalation research; use gMSA |
| EPM | Available, optional | Table stakes for zero-standing-admin on endpoints |
| Windows Autopatch | Optional | Default recommendation for update ring discipline |
| CA ghost policy | Edge case, occasionally found | Documented pattern — test every policy as standard |
| M365 native backup | "Microsoft covers it" (wrong but common) | Third-party backup framed as baseline, not option |
| PIM activation MFA | Often push-approve | Must be phishing-resistant to count |
| Windows LAPS | New, replacing legacy LAPS | Deployed as standard; legacy LAPS is tech debt |
---
## The governing question — carry it into every session
Before every finding, every recommendation, every conversation:
> **If this is owned tonight, what is the largest thing an attacker reaches before hitting a wall — and can I draw that wall?**
If the wall is missing or undrawn, you have found the work. Everything else is sequencing.
---
*Field Guide for the Antifragile Handbook. Updated June 2026. Review and update January 2027 — the honest uncertainty sections of the books define what will change.*