Files
antifragile/antifragile-consulting/books/01-hybrid-identity.md
T

168 lines
16 KiB
Markdown

# The Antifragile Handbook for M365 & Active Directory
## Book II — Hybrid Identity
> *Draw the wall between on-prem and cloud. In most estates there isn't one — there's a hallway with the door propped open.*
---
## Why this is the keystone
If you only ever fix one domain, fix this one. Every other book — privileged access, devices, data — assumes identity holds. In a hybrid M365 + AD estate, identity usually doesn't hold, and the reason is always the same: on-prem AD and Entra ID are not two systems with a guarded border. They are **one organism wearing two badges**, joined by a bridge that most organisations cannot draw, do not monitor, and have never tested severing.
The governing question, applied here:
> **If on-prem AD is ransomwared or domain-dominated tonight, does the cloud survive — or is it already poisoned by inheritance?**
For the overwhelming majority of estates the honest answer is "poisoned," and nobody has ever said it out loud. Your job is to say it out loud, then build the wall.
---
## 1. Fragility inventory — anatomy of the bridge
You cannot harden what you can't draw. Here is the bridge, piece by piece, with the blast radius of each. Learn to find all of these on day one of an engagement.
### The sync engine (the single most dangerous server you'll forget about)
Entra Connect Sync (the old Azure AD Connect) or Entra Cloud Sync runs the synchronisation. Whatever the diagram says, **this server is Tier 0** — because of the accounts it holds:
- **The on-prem connector account.** Under the old "express" install, this account was granted *Replicate Directory Changes* and *Replicate Directory Changes All* — which is **DCSync**. That means the sync server holds an identity that can pull every password hash in the domain. Read that again. The box your infra team treats as a middling utility VM can dump the entire domain.
- **The Entra connector account** (Directory Synchronization Accounts role) — can manipulate synced objects in the cloud.
So: compromise the sync server → DCSync on-prem **and** tamper with cloud objects. One box, both kingdoms. If this server is domain-joined to the production domain (it usually is), then anything that reaches prod-tier reaches your DCSync machine. That is the central coupling of the entire estate.
**Where it's worse than you think:** the sync server is often internet-facing for updates, runs a local SQL Express nobody patches, sits on an OS build from the project that installed it, and has not had its connector account rights reviewed since go-live.
### The authentication method (decides whether the cloud lives or dies with AD)
Three options, three completely different fragility profiles. Know which one you're actually on before you say anything — the diagram and the reality often disagree.
- **Password Hash Sync (PHS).** A hash-of-a-hash is synced to Entra; the cloud can authenticate on its own. *This is the most resilient for availability* — if on-prem dies, cloud auth keeps working. The transport is fine and not trivially reversible to the plaintext password; the risk is **not** "PHS leaks passwords," it's that the connector account doing the sync can DCSync. Don't let anyone fragilise availability to "fix" a risk that lives in the connector account, not the hash.
- **Pass-through Authentication (PTA).** Credentials are validated against on-prem AD in real time by PTA agents. **Coupling: on-prem outage = cloud auth outage.** Worse, the agent must handle the credential to validate it, so a compromised PTA agent is a plaintext-credential harvesting position. PTA agents are Tier 0 and a juicy target, and PTA is a conduit, not a firebreak. (You can enable PHS *alongside* PTA as failover — cheap optionality, see §4.)
- **Federation / AD FS.** The catastrophe. See below — it gets its own treatment because it's usually the single largest fragility in the estate.
### AD FS and Golden SAML (the thing that ends careers)
If AD FS issues tokens, then the **token-signing key** can forge a SAML assertion for *any* user — including bypassing MFA when MFA is enforced at the federation layer — and the cloud will trust it because it's validly signed. This is **Golden SAML**. It is how nation-state actors turned a single on-prem foothold into silent, total, persistent cloud impersonation (the SolarWinds intrusions). It is nearly invisible: the IdP is forging legitimate tokens, so there's no failed login, no anomalous password, nothing for a benchmark to catch.
The token-signing certificate is a single catastrophic point of failure that most orgs never rotate, store poorly, and don't monitor. If you take one thing from this book: **AD FS is fragility incarnate, and the correct long-term answer is to remove it** (§2), not to harden it.
### Seamless SSO (the forgotten Kerberos key)
Seamless SSO creates the `AZUREADSSOACC` computer account in AD. Its Kerberos decryption key, if never rotated (it usually never is), is a silver-ticket / token-forging exposure. Classic Lindy fragility: old, unrotated, forgotten, exploitable.
### The writebacks (reverse conduits nobody counts)
Every writeback turns the bridge two-way and creates *reverse* blast radius:
- **Password writeback** — cloud SSPR can change on-prem passwords. Useful; also a path from cloud to on-prem.
- **Device writeback / group writeback** — cloud objects written into AD. Group writeback (v2), where cloud security groups become AD objects that gate on-prem resource access, means a **cloud group compromise now affects on-prem access** — a coupling people rarely diagram.
Each writeback may be justified. None should be silent. Count them, name the blast radius of each.
### The admin coupling (one organism, two badges)
The deepest fragility isn't a setting, it's the people and accounts:
- The same humans are Domain Admins **and** Global Admins.
- Cloud admin accounts are **synced from on-prem**, so on-prem compromise → harvest → cloud admin.
- Admins use the same workstation for AD and Entra, and that workstation is also their email/MFA device.
If on-prem privilege flows into cloud privilege through any of these, there is no wall. There's a hallway.
### Source of authority (why you can't fix it in the cloud)
For synced objects, **on-prem is authoritative**. You cannot durably fix a synced object purely cloud-side; the next sync cycle overwrites you. This matters enormously in incident response: if AD is owned, your cloud objects are downstream of poison and "just fix it in Entra" doesn't hold.
---
## 2. Via negativa — what to remove (in priority order)
Hybrid identity is where subtraction pays the highest dividend in the whole estate. In rough order of leverage:
1. **Remove AD FS. Migrate to cloud authentication** (PHS, or PTA if you have a hard real-time-validation requirement), and move MFA and access decisions to Conditional Access in Entra where they belong. This deletes Golden SAML as a class, shrinks attack surface massively, and removes a SPOF you were never rotating anyway. This is the single highest-leverage deletion in this book.
2. **Stop syncing privileged on-prem accounts to the cloud.** Domain Admins, Enterprise Admins, Tier 0 — filter them *out* of sync scope. They have no business being cloud objects. A synced privileged account is a free bridge for the attacker.
3. **Make cloud admins cloud-only.** Global Admins and other Entra privileged roles should be cloud-only accounts (`.onmicrosoft.com`), phishing-resistant, never derived from or synced with on-prem identity. This is the firebreak in one move (see §3).
4. **Trim the writebacks.** Keep only the ones with a named owner and a justified reverse blast radius. Delete the rest.
5. **Rotate or remove Seamless SSO.** If you don't need it, remove the `AZUREADSSOACC` account. If you keep it, rotate the key on a schedule — and the fact that nobody has is itself a finding.
6. **Reduce sync scope.** OU-filter aggressively. Don't sync what the cloud doesn't need. Every synced object is attack surface and a potential bridge. The default "sync everything" is laziness, not architecture.
For each deletion the test from Book I applies: *if I removed this, would anyone notice in 90 days?* For AD FS the honest answer, after migration, is usually "no — and the attackers will notice it's gone."
---
## 3. The barbell — what gets paranoia, what stays cheap
**The irreplaceable few (paranoid protection, redundancy, monitoring):**
- **The sync server.** Treat it as Tier 0 *in practice*, not just on the diagram: dedicated admin tier, no internet browsing, hardened OS, least-privileged connector account (use a gMSA; strip DCSync rights if your topology allows the scoped permission model), restricted logon, alerting on the connector account's behaviour.
- **The connector accounts.** Least privilege, gMSA where supported, monitored. An account that can DCSync should scream in your SIEM if it ever behaves like a domain controller from the wrong host.
- **The AD FS token-signing key** — if AD FS still exists, the key belongs in an HSM, monitored, rotated on a real schedule (remember the rollover cert). But the better barbell move is §2.1: don't own this liability at all.
- **Cloud-only break-glass Global Admins** (from Book I) — phishing-resistant, excluded from the CA policy that would lock them out, tested.
**The firebreak — the one design decision that builds the wall:**
> **Cloud privilege must not be reachable from on-prem compromise.**
Cloud-only admin accounts + not syncing privileged on-prem accounts + separate privileged workstations = on-prem can fall completely and the attacker still hits a wall at the cloud admin boundary. *That wall is the entire point of this book.* Draw it, then verify an attacker can't walk around it through the sync server (which is why the sync server is in the paranoid bucket).
**Everything else stays cheap.** Standard user sync, normal device registration, the bulk of the directory — these are replaceable and don't deserve the attention that the sync server and the admin boundary demand. Don't gold-plate the directory while the connector account can dump it.
---
## 4. Optionality & recovery — escape hatches, tested
- **The "kill the sync" runbook.** A written, rehearsed procedure to stop sync fast when on-prem is compromised, so poison stops flowing cloud-ward. Know the nuance per auth method, because severing behaves differently:
- *PHS:* disabling sync stops new changes flowing, but already-synced hashes remain — containment of *propagation*, not instant revocation. Pair with token revocation and credential resets.
- *PTA / Federation:* severing the bridge can take cloud auth down with it unless you've pre-staged a fallback. Which is why —
- **Pre-stage the federated-to-managed conversion.** Know, in advance, how to convert the domain from federated (or PTA) to managed/cloud auth (PHS) *fast*, so that during an on-prem incident you can cut the dependency and keep the cloud alive on its own. Rehearse it. "We think we could" is not a plan.
- **PHS as failover under PTA.** Cheap optionality: run PHS alongside PTA so a PTA-agent or on-prem outage doesn't lock everyone out of the cloud. Small certain cost now, large uncertain payoff later. Classic Book I optionality.
- **Cloud-only admin path that survives AD death.** Because cloud admins are cloud-only (§3), you retain full control of the tenant even if AD is gone. This *is* the recovery path — verify it actually works without any on-prem dependency (including MFA that doesn't secretly route through on-prem).
- **Accept the source-of-authority reality.** Your IR plan must account for the fact that synced objects are downstream of on-prem. Decide *in advance* whether, during a domain-dominance incident, you sever first and rebuild authority cloud-side. Discovering this mid-incident is how recoveries fail.
---
## 5. Stressor — break it on purpose
Untested = broken. Game-days for hybrid identity, smallest/safest first:
- **Pull the sync server** (planned window). Does cloud auth survive? The answer *proves* which auth method you're really on and whether your availability assumptions are true. Most teams are surprised. That surprise is the point.
- **Revoke / disable the connector account and watch your SIEM.** Did anything alert? An account that can DCSync going dark, or behaving oddly, should be the loudest alarm you own. If nothing fired, you've found a detection gap worth more than any control you could add.
- **Golden SAML tabletop** (if AD FS exists). Walk through: attacker has the token-signing key — what do you detect, how do you contain, how fast can you rotate, and could you tell at all? If the honest answer is "we couldn't tell," escalate the §2.1 removal from "roadmap" to "now."
- **Break-glass under sync-down.** Test the cloud-only break-glass account *while the bridge is severed*. It must work with zero on-prem dependency. If it silently relied on something on-prem, you just found it on a Tuesday instead of during the breach.
- **DCSync detection drill.** Have someone simulate DCSync from an unexpected host and confirm detection fires. The connector account is the one place DCSync is "normal," which is exactly why attackers love to look like it.
Every one of these, per Book I principle 6: whatever breaks must produce a **structural** change, not a calendar reminder.
---
## Honest uncertainty (read this, don't trust a handbook on moving parts)
This book teaches stable mechanisms — the coupling between AD and Entra, Golden SAML, the DCSync-via-connector path, the PHS/PTA/federation trade-offs. Those don't change much; they're Lindy.
What **does** move, and what you must verify against current Microsoft documentation rather than trusting any 2026-vintage handbook:
- **Connect Sync vs Cloud Sync feature parity.** Microsoft has been steering new deployments toward the lighter Cloud Sync agent (no SQL, multiple agents for HA — better optionality), but parity for specific scenarios (certain writebacks, device sync, large/complex topologies, passthrough nuances) has been evolving. **Check the current parity matrix before you recommend a migration.** Don't let me, or any document, freeze this for you.
- **AD FS deprecation / migration tooling.** Direction of travel is clearly away from AD FS toward Entra-native auth, with staged-rollout and migration tooling to ease it. Exact timelines, tool capabilities, and supported paths shift — verify current state when you scope the work.
- **Connector account hardening guidance** (gMSA support, least-privilege permission models, the scoped alternative to full DCSync rights) continues to improve — confirm what's available for your topology and version.
If a client's safety depends on a current-version specific, **look it up and cite it**, don't quote your memory or this book. Honest "I need to verify the current parity" beats confident and wrong every time. That's not weakness; that's the job.
---
## Consolidated judgement prompts
The questions to carry into any hybrid estate:
- Which auth method are we *actually* on — and does the cloud survive on-prem death? (Verify by testing, not by asking.)
- Is the sync server Tier 0 in practice or only on the diagram? What can its connector account reach? Can it DCSync?
- Are any privileged on-prem accounts synced to the cloud? Are Global Admins cloud-only or synced?
- Can on-prem privilege reach cloud privilege by *any* path — accounts, workstations, the sync server, writebacks? Draw every path. Each one is a hole in the wall.
- Do we have AD FS? *Why?* What exactly would removing it take, and what's the honest reason it hasn't happened?
- When was the Seamless SSO key / AD FS token-signing cert last rotated? ("Never" is a finding, not an answer.)
- Which writebacks are on, and what reverse blast radius does each create?
- If we severed the bridge in the next 30 minutes, what breaks, and is the procedure written where someone panicking can run it?
---
*Book II of the Antifragile Handbook. The wall between on-prem and cloud is the most important structure you will ever draw — because in most estates, it isn't there. Move fast and fix things.*