Files

148 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# The Antifragile Handbook for M365 & Active Directory
## Book III — Privileged Access
> *Privilege is blast radius with a time axis. Standing privilege reaches everything, forever. The whole job is to collapse both: less reach, less time.*
---
## The governing question
Book I asked you to draw the wall. Book II built it between on-prem and cloud. This book is about the credentials that can knock any wall down. Ask of every privileged identity — human, service account, or app:
> **If this credential leaks tonight, how long does it stay useful, and how far does it reach?**
A permanent Domain Admin answers *"forever, everything."* A permanent Global Admin answers *"forever, the whole tenant."* A JIT, scoped, time-boxed role answers *"for one hour, for one task."* Every technique in this book exists to turn the first kind of answer into the second. That's it. That's the whole craft of privileged access: **shrink the reach, shrink the time.**
Compliance counts whether you "have a PAM solution." Wrong question. The question is whether privilege *evaporates when not in use* and whether a leaked credential hits a wall in minutes instead of owning the estate forever.
---
## 1. Fragility inventory — where privilege rots
### Standing privilege (the original sin)
An account that is *always* an admin is a loaded gun left on the table, every hour of every day, whether anyone's using it or not. Its blast radius is constant and maximal. Permanent Domain Admins, permanent Enterprise Admins, permanent Global Admins — every one of them is a credential whose value to an attacker never drops to zero. **The single most important number in this book is: how many identities hold standing privilege?** In most estates it's an order of magnitude too high, and nobody has ever counted.
### Service accounts and service principals (the dark matter)
This is where the bodies are buried, on both sides of the wall:
- **On-prem service accounts** — over-permissioned ("we made it Domain Admin to make it work"), static passwords that haven't changed since 2016, an SPN attached so they're **Kerberoastable** (request the ticket offline, crack the weak password at leisure), owned by nobody, documented nowhere, and impossible to turn off because something unknown will break.
- **Cloud service principals / app registrations** — the same disease in a new body. Client secrets that never expire, **tenant-wide admin consent**, and Microsoft Graph permissions that are quietly catastrophic: `RoleManagement.ReadWrite.Directory`, `AppRoleAssignment.ReadWrite.All`, `Application.ReadWrite.All` — any of which is a privilege-escalation path to Global Admin. Service principals **cannot do MFA**, usually hold **standing** privilege, and live in a blind spot no benchmark looks at hard enough.
Service identities are dark matter: most of the privileged mass of the estate, invisible in the usual diagrams, and gravitationally dominant when something goes wrong.
### Tier violations (the wall with a hole kicked in it)
The Lindy core of on-prem security is the tier model (Tier 0 = identity control plane: DCs, AD, ADCS, the sync server from Book II; Tier 1 = servers; Tier 2 = workstations). Microsoft has since reframed it as the Enterprise Access Model reaching into the cloud, but the rule never changed:
> **A higher-tier credential must never be exposed on a lower-tier system.**
Every Domain Admin who RDPs into a workstation, every admin whose daily-driver laptop also touches a DC, every shared jump box used for both Tier 0 and Tier 1 — that's a tier violation, and it's how `pass-the-hash` / `pass-the-ticket` turns one phished workstation into domain dominance. The clean-source principle is absolute: **you cannot securely manage a system from a less-secure one.**
### The escalation plumbing nobody maps
- **AD ACL backdoors** — who can reset whose password, who has `WriteDACL` / `GenericAll` on what. Privilege hides in object permissions, not just group membership. Attackers map this in minutes; defenders rarely map it at all.
- **Delegation** — unconstrained delegation is a standing golden-ticket risk; constrained/RBCD misconfigurations are escalation paths.
- **ADCS** — the certificate services escalation paths (the ESC-series misconfigurations) turn a forgotten CA template into domain compromise. ADCS is **Tier 0** and is almost always treated as Tier 1 or forgotten entirely.
- **KRBTGT** — the master key behind golden tickets. Rarely rotated; if an attacker ever had it, they may still have it.
- **LAPS absent** — without per-machine local admin password randomisation, one cracked local admin hash unlocks lateral movement across every machine sharing it.
### The recovery paradox
The accounts that can rebuild the estate after a disaster are, by definition, the most powerful — and therefore the most valuable to an attacker. Break-glass done carelessly is just standing privilege with a heroic name. (Handled in §4.)
---
## 2. Via negativa — what to remove (in priority order)
Privilege is the domain where deletion is the entire strategy. Adding "privileged access controls" on top of unmanaged standing privilege is rearranging furniture in a burning room.
1. **Eliminate standing privilege.** Roles become *eligible*, not *active*. Cloud-side this is PIM (§3). On-prem it's harder and the tooling is weaker — be honest about that (§ honest uncertainty) — but time-bound group membership and JIT elevation tooling exist; use them. The target state: at rest, almost nobody is an admin.
2. **Empty the top groups toward the irreducible minimum.** Drive Domain Admins, Enterprise Admins, and standing Global Admins down to the smallest number that reality permits (plus break-glass). Delegate specific rights instead of handing out god-mode. "Empty Domain Admins" is an achievable goal, not a fantasy.
3. **Kill, convert, or constrain service identities.** Remove the ones nobody can justify (apply the 90-day-scream test). Convert the rest to managed identities — **gMSA** on-prem (the established, Lindy fix: automatic password rotation, no static secret, not Kerberoastable in the same way), **managed identities** in Azure where possible. Strip every excess right. For app registrations: remove the dangerous Graph permissions, expire and rotate secrets, prefer certificate credentials or managed identities over secrets, and delete unused registrations and stale consent grants.
4. **Remove tier violations.** No high-tier credential on a low-tier box, ever. This is mostly subtraction — taking admin rights *off* daily-driver machines and shared boxes.
5. **Fix the escalation plumbing by removal.** Decommission unused ADCS templates, remove unconstrained delegation, prune dangerous ACLs, deploy LAPS so standing shared local admin passwords cease to exist.
6. **Remove standing local admin from users.** Most don't need it. The ones who think they do usually need it for ten minutes a month — which is a JIT problem, not a standing-rights problem.
---
## 3. The barbell — paranoia for the control plane, cheap for the rest
**The irreplaceable few (paranoid, redundant, monitored):**
- **Tier 0** — DCs, AD, ADCS, KRBTGT, and the sync server from Book II. This is the control plane; if it falls, everything falls.
- **The handful of break-glass Global Admins** (§4).
- **The PIM / role-management configuration itself** — because whoever controls *who can become admin* is effectively admin. Privileged Role Administrator and Privileged Authentication Administrator are crown roles; treat them as such.
**Paranoid protection for privileged work means, non-negotiably:**
- **PAWs — the principle and the practical reality.** The principle: all Tier 0 / Global Admin work from a clean, hardened, single-purpose device that never reads email or browses the web. The admin's normal laptop is Tier 2. This is right. The practical reality: physical PAWs almost never get deployed. The hardware procurement, the second device on the desk, the behaviour change — all of it defeats the project before it starts. The deployable alternative that preserves the essential properties is a **cloud-hosted admin workstation** — a Windows 365 or Azure Virtual Desktop VM provisioned from a hardened template, enrolled in the management overlay, used only for privileged tasks. The admin connects from their normal device via browser or RDP. Privileged credentials live in the cloud VM, not on the admin's local device. If the VM is compromised: wipe it, reprovision from template in 20 minutes. The security property is the same — credentials isolated from the daily-use device — without the hardware problem. This is the practical PAW. Recommend it before recommending a dedicated physical device; it will actually get deployed.
- **The management overlay** connects the admin workstation (cloud VM or physical PAW) to the systems it manages without exposing those systems to the general network. The T0/T1 split matters here and maps directly to the tier model: T0 systems (DCs, ADCS, sync server) get an overlay with no external runtime dependency (Nebula with pre-distributed certificates); T1 systems (member servers, cloud workloads, multi-cloud resources) get an overlay with identity-aware access and per-session MFA (Tailscale with Entra OIDC). The realistic T0 node count for a 5,000-person organisation is 1525 nodes — small enough to manage with a documented certificate ceremony and a spreadsheet, not a full PKI team. The management overlay is what makes remote and hybrid admin work possible without either a traditional VPN's flat-network problem or physical-presence-only access.
- **Phishing-resistant MFA only** for admins — FIDO2 / passkeys / certificate-based. SMS and push-approve are not admin-grade; they're phishable, and admins are the phishing prize. For the management overlay, this means Tailscale configured with key expiry and an Entra OIDC IdP enforcing FIDO2 — so the WireGuard device trust and a per-session identity assertion are both present, not just the device key.
- **Separate, cloud-only privileged identities** for cloud admin (the Book II firebreak, enforced here). On-prem admin identity must not be the cloud admin identity.
- **JIT for everything** via PIM: eligible-not-active, time-boxed, MFA on activation, justification logged, and **approval workflow on the crown roles**.
- **Conditional Access scoped to admins** — privileged roles usable only from PAWs / compliant devices / named locations.
**Everything else stays cheap.** Standard RBAC, normal user access, ordinary app permissions — don't pour the privileged-access budget evenly across the whole directory. Concentrate it ferociously on the tiny set of identities that own the control plane. A thousand hardened standard users won't save you if one permanent Domain Admin uses `Password1!` on a Kerberoastable SPN.
---
## 4. Optionality & recovery — escape hatches, tested
- **Break-glass done right.** This is the deliberate exception to "no standing privilege" — you *need* an account that works when PIM, MFA infrastructure, or the IdP is down. So it's standing by necessity, which means it is protected differently: cloud-only, phishing-resistant credential stored offline/split, excluded from the CA policy that would otherwise lock it out, and **wired so that any use at all triggers a screaming alert.** Standing privilege you can't remove, you watch like a hawk. And you **test it** — an untested break-glass account is Schrödinger's recovery.
- **KRBTGT rotation on demand.** Can you rotate KRBTGT (twice, with the required interval) the moment you suspect golden tickets — without taking the forest down? Is it rehearsed? If not, you have a theoretical control, not a real one.
- **Fast session revocation / admin disable.** A one-move way to kill a compromised admin's sessions and tokens and disable the account, on both sides of the wall. Rehearse it; the breach is not the time to discover the command.
- **No single human as the only recovery path** — balanced against blast radius. You want enough redundancy that one person under a bus (or under coercion) doesn't end recovery, without so many standing admins that you've recreated the problem. The barbell, again.
- **Tier 0 / forest rebuild path** — links forward to Book V (Recovery). Know it exists, know it's been tested, know it doesn't secretly depend on a credential that the incident just compromised.
---
## 5. Stressor — break it on purpose
- **Pull an admin's standing access and route them through PIM for a week.** Does real work still flow? If JIT activation is too slow or broken, people will route around it — and you'll have found that in a drill instead of discovering the shadow standing-admin account they created in revenge.
- **Kerberoast yourself.** Run the attack against your own directory. Which service accounts crack? Did anything *detect* the ticket requests? Two findings in one cheap test.
- **Attempt a tier violation in a test window.** Try to use a Tier 0 credential on a Tier 2 box. Is it blocked? Detected? Silent? Silence is the worst answer and the most common.
- **Run attack-path analysis as routine, not as a once-a-year pentest.** Tools that map "who can reach Domain Admin / Global Admin in N hops" turn privilege escalation into a number you can track over time. **The count of paths to domain/tenant dominance is a better security metric than any compliance percentage.** Drive it down; watch it not creep back up.
- **Simulate a malicious consent grant / over-permissioned app.** Register an app requesting a dangerous Graph scope. Does anything flag it? Can you find every existing app holding those scopes today? (You should be able to. Most can't.)
- **Break-glass drill** — yes, again, and on a schedule. The recurring test in this whole handbook.
Per Book I principle 6: each of these must yield a **structural** change — a removed right, a severed path, a new alert — not a note that says "be careful."
---
## Honest uncertainty (the moving parts — verify, don't trust this book)
Stable and Lindy (teach with confidence): standing privilege is the core risk; the tier / clean-source model; Kerberoasting, pass-the-hash, golden/silver tickets, DCSync; the gMSA pattern; JIT/eligibility as the goal. These don't churn.
What moves, and what you must verify against current Microsoft documentation:
- **The management overlay pattern** (covered in §3 above) is stable in principle — the T0/T1 split, the clean-source reasoning for isolating the management plane, the cloud admin VM as the deployable PAW substitute. What moves: the specific tooling. Nebula's CA and ACL model, Tailscale's per-session MFA configuration and OIDC integration, and the Windows 365 / AVD provisioning model all evolve. Verify current implementation guidance before deploying, and confirm Tailscale's key-expiry and IdP enforcement behaviour is still available as described.
- **PIM capabilities, role definitions, and the risk classification of specific Graph permissions** evolve continually. Confirm which scopes are escalation-grade *today* rather than trusting a 2026 list.
- **On-prem JIT/PAM tooling is genuinely weaker and more fragmented than the cloud story.** Native time-bound group membership, MIM PAM, and third-party PAM all have trade-offs that shift. Don't promise a client a clean AD-native JIT experience without checking current reality — and be honest that on-prem eligibility is harder than PIM makes cloud look.
- **gMSA vs dMSA.** gMSA is the established, Lindy answer for managed service accounts. **dMSA** (delegated managed service accounts, introduced with the Windows Server 2025 generation) targets the real gap — migrating a standing service account and disabling the original — but newer mechanisms carry newer attack surface, and there has been published privilege-escalation research against the dMSA migration path. **Verify current patch and hardening guidance before you recommend dMSA**; this is exactly the kind of new-and-shiny that Book I principle 8 warns about. gMSA until you've checked dMSA's current state.
- **Enterprise Access Model vs the classic three-tier model** — same logic, evolving names and cloud extensions. Use whichever vocabulary the client knows; don't get religious about the label.
If a client's safety hinges on a current specific, look it up and cite it. "I need to verify the current Graph permission classification" beats confidently quoting a stale one. That posture *is* the independence this handbook is trying to build.
---
## Consolidated judgement prompts
- How many identities hold **standing** privilege — human, service account, and service principal — counted, named, and owned? (If you can't produce the number, that's finding #1.)
- For each privileged credential: leaked tonight, how long is it useful and how far does it reach? Where's the wall?
- Where are the tier violations? Which high-tier credentials touch low-tier systems? Does any admin's daily laptop reach Tier 0?
- Which service accounts are Kerberoastable? Which app registrations hold escalation-grade Graph permissions or non-expiring secrets?
- Are cloud admins cloud-only and phishing-resistant, or synced and push-MFA'd? (Book II firebreak — verify it's actually enforced here.)
- Does privilege **evaporate when idle** (PIM/JIT) or sit loaded on the table?
- Is ADCS treated as Tier 0? When was KRBTGT last rotated? Is LAPS deployed?
- Break-glass: does it exist, is it monitored to scream on use, and when was it last *tested* — not created, tested?
- How many paths to Domain Admin / Global Admin exist right now, and is that number going up or down?
- What does an admin use to reach a domain controller remotely — and if that path is compromised, what does the attacker get? Is the management access path independent of the estate it manages?
- Are privileged credentials ever typed into or stored on a device that is also used for email and browsing? If yes, the session isolation that PAWs are meant to provide does not exist, regardless of what the policy says.
---
*Book III of the Antifragile Handbook. Privilege is blast radius with a clock on it. Shrink the reach, shrink the time, and watch the credentials that can rebuild the world. Move fast and fix things.*