Files

15 KiB
Raw Permalink Blame History

The Antifragile Handbook for M365 & Active Directory

Book VI — Recovery & Detection-as-Feedback

Robust means you survive the shock unchanged. Antifragile means you come back stronger. The shock is coming either way — the only choice is what you do with it.


The governing question

This is the capstone, because it's the book that decides whether everything before it was merely robust or genuinely antifragile. The first five books harden the estate; this one builds the machine that turns every shock into improvement. Ask:

When — not if — this fails, do you come back weaker, the same, or stronger?

A fragile estate comes back weaker (if at all). A robust estate comes back the same and waits for the next identical hit. An antifragile estate comes back different and harder to hit the same way twice — because it ran the shock through a feedback loop and changed its own structure. That loop is the entire subject of this book.

The reframe that powers it: most organisations treat detection and recovery as the sad afterthought — the thing they hope never to need. Invert it. Incidents, alerts, failed drills, and near-misses are the most valuable intelligence the system ever produces — honest, real-world data about where the fragility actually is, bought in the cheapest currency available if you harvest it. The org that buries incidents stays fragile. The org that treats them as fuel becomes antifragile. Your job is to build the machine that converts disorder into structural strength.


1. Fragility inventory — where recovery and detection rot

Backups that have never been restored

The biggest recovery lie in the industry: "we have backups." Having a backup is not the same as being able to recover, and an untested backup is Schrödinger's recovery — simultaneously fine and worthless until someone actually opens the box. Two M365-specific traps make this worse:

  • "Microsoft backs it up for us." Microsoft provides geo-redundancy, recycle bins, and limited native retention — not point-in-time backup against your own ransomware, malicious deletion, or retention expiry. Under the shared- responsibility model, your data is your responsibility. Most tenants have no real, independent, point-in-time M365 backup, and discover this during the incident.
  • Attackers target backups first. Ransomware operators delete or encrypt the backups before they hit production, because they know it's your only way out. A backup reachable from the compromised estate is not a backup; it's another victim.

AD forest recovery: the nightmare nobody rehearses

Recovering a compromised or destroyed AD forest is one of the hardest operations in all of IT — clean OS installs, authoritative restore of one DC per domain, metadata cleanup, double krbtgt reset, trust resets, the whole brutal sequence. Almost no one has practised it. So when ransomware takes AD, "restore from backup" is a multi-day, error-prone, improvised ordeal performed for the first time under maximum pressure. Entra recovery is less apocalyptic but has its own teeth: the hard-delete window for objects, and the fact that tenant configuration (CA policies, Intune, roles) has no native "undo" unless you captured it as code.

Recovery that depends on what the incident destroyed

The fatal circular dependency: backups authenticated by the AD that's down. The recovery runbook stored in the SharePoint that's encrypted. The break-glass that needs the MFA service that's offline. The recovery admin whose credentials the attacker already has. A recovery path that depends on the thing it's recovering is not a recovery path — it's the clean-source principle (Book III) applied to survival.

Detection that fires into a void

Logs not collected. Audit logging never enabled or silently aged out. A SIEM full of alerts nobody triages. And the specific blind spots the earlier books planted: the unmonitored DCSync (Book II), the unwatched break-glass use (Book III), the device-risk signal that dies on a dashboard (Book IV), the BEC forward rule nobody sees (Book V). Detection that nobody acts on is theatre with a subscription fee.

Alert fatigue: the boy who cried wolf, automated

Too many low-fidelity alerts is itself a fragility — the real signal drowns in noise, and the analyst who's dismissed a thousand false positives dismisses the one that mattered. More alerts is not more security; past a point it's less.

MTTR that exists only on paper

RTO/RPO numbers in a policy document, never once validated by an actual restore, are fiction. (Book I anti-benchmark: MTTR is measured by doing it, not by declaring it.)

Incidents that close without changing anything

The post-incident review that concludes "remind users to be more careful" has wasted the disorder entirely and guaranteed the recurrence. And a blame culture destroys the feedback loop at the source — if surfacing an incident gets you punished, incidents get buried, and the system goes blind.

No known-good to return to

If your tenant configuration lives only as click-ops in a portal, you have no golden image of "correct," so you can neither rebuild it fast nor detect drift from it — and you can't catch a ghost policy (Book I/IV) because you have nothing to diff against. No config-as-code means no known-good.


2. Via negativa — what to remove

  1. Delete the false comfort that Microsoft backs you up. Removing the dangerous belief comes before adding the real backup.
  2. Sever recovery's dependencies on the estate it recovers. Recovery credentials, runbooks, and backups must not depend on prod AD/Entra/SharePoint. Decouple, so the lifeboat doesn't sink with the ship.
  3. Cut alert noise. Ruthlessly remove low-fidelity alerts so the high-fidelity ones become visible. Via negativa applied to detection: fewer, louder, truer.
  4. Remove blame from the post-incident process. Blameless on people so people surface incidents — then ruthless on structure so the incident actually changes something. Removing the incentive to hide protects the feedback loop itself.
  5. Remove click-ops from critical configuration. Move control-plane config (CA, Intune, roles) to code, so a known-good exists to rebuild from and diff against.

3. The barbell — paranoid recovery for the irreplaceable, best-effort for the rest

The irreplaceable few — the identity control plane (Books II/III) and the crown-jewel data (Book V) — get real, tested, immutable, offline/isolated backup and rehearsed recovery. AD forest recovery is practised, not theorised. Recovery objectives for these are measured in a drill, in minutes or hours, not asserted in a policy.

The recovery capability is itself a crown jewel. Backups are a top attacker target, so protect them like break-glass: immutable, offline or in a separate trust domain, unreachable even from full domain dominance. A backup the attacker can reach is not a control.

Everything else is best-effort and tiered. Don't gold-plate recovery for the lunch-menu SharePoint. Tier recovery objectives to value — crown jewels get immutable and fast; bulk collaboration gets good-enough. And concentrate high-fidelity detection on the control-plane and crown-jewel signals (the screaming break-glass, the anomalous DCSync, the impossible-travel admin, the crown-jewel mass-download) rather than spreading shallow alerting evenly across everything.


4. Optionality & recovery — the heart of the book

  • Tested restores on a schedule. The only proof of recovery is a restore that happened. Make the restore drill routine, time it, and verify integrity — that time is your real MTTR.
  • Immutable + offline/isolated backups — the escape hatch that survives the attacker reaching production. Ransomware-resilient by design, not by hope.
  • Rehearsed AD forest and Entra recovery runbooks, stored independently — on paper or offline, reachable when the estate is dark, not in the SharePoint that's encrypted.
  • Configuration-as-code (IaC) for the control plane — instant rebuild and a known-good baseline to detect drift and ghost configuration against. This single practice serves recovery, drift detection, and the Book I corollary at once.
  • A clean-room / isolated recovery environment — somewhere to rebuild that the attacker isn't already inside.
  • The fail-over-vs-clean-in-place decision pre-made. When do we rebuild rather than try to clean a compromised estate? Decide the criteria before the incident; it's the Book II "sever the sync" decision generalised to the whole estate.

5. Stressor — the hormesis engine (the climax of the handbook)

This is where the entire handbook either runs or rusts. Everything else is preparation for the loop; this is the loop turning.

  • Live restore of a crown-jewel dataset and the control plane. Not a tabletop — an actual restore, integrity-verified and timed. The number you get is the truth; the number in the policy was always fiction.
  • Rehearse AD forest recovery. The first time you perform the hardest recovery in IT must not be during the real disaster. Run it. Find what's missing. Fix the runbook.
  • Inject attacks end-to-end and follow them all the way through. DCSync, malicious consent, break-glass use, impossible-travel admin, crown-jewel mass- download. Confirm not just that the alert exists, but that it's triaged, and someone acts. Detection that fires into a void fails this test on purpose, so you can fix it.
  • Run a ransomware game-day that assumes Tier 0 is owned and backups are the first target. Watch your decoupling hold or fail.
  • Purple-team as routine, not annually. Standing, escalating, blast-radius- controlled stress — hormesis, not a once-a-year audit ritual.
  • Measure the loop itself. Track time from incident to structural change. If drills and incidents close without a removed right, a severed coupling, or a new firebreak, the loop is broken and you are merely robust.

The feedback loop — what makes all six books antifragile

Name the loop explicitly, because it's the thread that ties the whole handbook together and the thing that converts robustness into antifragility:

Detect (see the stressor) → Respond (contain it) → Recover (come back) → Learn structurally (come back stronger) → which feeds back into Removal and redesign across every prior book — a fragilizer deleted (Book I via negativa), a coupling severed (Book II), a standing privilege collapsed (Book III), a device boundary tightened (Book IV), a data flow closed (Book V).

The first three steps are robustness; plenty of organisations reach them and call it security. The fourth step is the whole game. A shock that produces no structural change has been wasted, and the system will meet the same shock again, unchanged. A shock that does produce structural change has made the estate stronger — which is the literal definition of antifragile, and the only honest justification for everything in this handbook.


Honest uncertainty (verify the moving parts)

Stable and Lindy (teach with confidence): untested backup is no backup; attackers hit backups first; recovery must not depend on what it recovers; detection without action is theatre; alert fatigue is fragility; every shock must change the structure. None of that churns — these are the oldest truths in operational security.

What moves, and what you must verify:

  • M365 native backup/retention specifics and the shared-responsibility boundary — what Microsoft does and does not cover, recycle-bin and hard-delete windows — evolve. Verify current reality, and test what you can actually recover rather than trusting either "Microsoft has us covered" or a vendor pitch.
  • Entra recovery and configuration-backup tooling (deleted-object windows, Graph/IaC options for capturing CA, Intune, and roles as code) evolve — verify current capability.
  • AD forest recovery is Lindy in principle (it is brutal; rehearse it), but automation and tooling evolve — confirm the current supported procedure.
  • Detection tooling (the XDR/SIEM signal catalogue) churns continuously. Verify which detections exist today and test them end-to-end; the principle (high-fidelity over noise, tested through to action) is what's permanent.
  • Audit log retention and licensing have changed over time — confirm what's captured and for how long before relying on it for forensics.

If recovery hinges on a current specific, verify it and test it. "We confirmed the restore works and it takes four hours" beats any RTO ever written in a policy.


Consolidated judgement prompts

  • When this fails, do we come back weaker, the same, or stronger? What's the mechanism that makes it stronger?
  • When was a backup of the crown jewels and the control plane last restored — not taken, restored — and how long did it take?
  • Are the backups reachable from the estate they protect? (If yes, they're another victim.) Are they immutable and offline?
  • Has anyone ever rehearsed AD forest recovery? Is the runbook reachable when the estate is dark?
  • Does any part of the recovery path depend on the thing the incident destroyed — credentials, runbook location, MFA, the recovery admin?
  • Does detection fire into action, or into a void? Is there so much noise the real signal is lost?
  • Does control-plane config exist as code (a known-good to rebuild and diff against), or only as click-ops?
  • For the last three incidents and drills: what structural thing changed? If the answer is "a reminder," the loop is broken.
  • How long from incident to structural change — and is that time getting shorter?

Coda — the whole arc

Six books, one idea. Book I is the lens: subtract before you add, protect the irreplaceable, measure blast radius, buy optionality, stress on purpose, and make every shock change the structure — verifying by observation, never by inspection. Books IIV apply that lens to the containers and contents: the identity bridge made a firebreak, privilege collapsed in reach and time, the device assumed hostile and the boundary moved to the data, and the data itself made to carry its own protection as it flows. Book VI is the loop that makes it all antifragile rather than merely robust — the machine that feeds every incident back into removal and redesign.

None of this is a checklist, and if a consultant trained on it ever reaches for "because the benchmark says so," they've missed the point. The point is judgement: draw the wall, find the fragility, fix what matters, and let every stress make the estate stronger than it was.

Move fast and fix things.


Book VI of the Antifragile Handbook, and the close of the arc.