Files
antifragile/antifragile-consulting/books/00-principles-and-judgement.md
T

195 lines
19 KiB
Markdown

# The Antifragile Handbook for M365 & Active Directory
## Book I — Principles & Judgement
> *Move fast and fix things.*
---
## Why this book exists
This is not a benchmark. It will not give you a number to report to a steering committee. It will not tell you that your tenant is 87% compliant, because that number is a lie that makes everyone feel safe while the building burns. Compliance frameworks — CIS, NIST, ISO, the lot — answer one question: *did you do the things on the list?* That is a useful question. It is not the important one. The important question is: **when this gets attacked, does it get weaker, stay the same, or get stronger?** A system that gets stronger from being stressed is antifragile. Almost no M365 + AD estate is antifragile by default. Most are the opposite: a flat domain synced to a cloud tenant, where one phished helpdesk account quietly becomes domain dominance becomes Global Admin. That is fragility wearing a compliance certificate. A consultant trained on benchmarks knows *what* the settings should be. A consultant trained on this book knows *which settings matter, why, and what breaks if they're wrong* — and can walk into a tenant they've never seen and find the thing that will actually kill the client. That is the difference between a technician and an independent professional. We are trying to raise the second kind.
### What "move fast and fix things" actually means
It is a deliberate edit of the old Silicon Valley creed. The original assumed things were whole and that breaking them was the cost of speed. Our world is the reverse: **the things are already broken.** Legacy auth is still on. Service accounts from 2014 still have domain admin. Nobody has tested the break-glass account since it was created. Speed, here, is not recklessness — it is refusing to let a thirty-page risk-acceptance process protect a fragility that a teenager with a phishing kit will remove for free. So:
- **Fast** — bias to action. A fix shipped this week beats a perfect fix discussed for a quarter. Fragility compounds while you deliberate.
- **Fix** — actually change the structure, not the documentation. A risk you *accepted* is a risk you still have.
- **Things that matter** — and this is the whole craft — the discrimination to know that disabling legacy auth outranks renaming forty GPOs to match a naming standard. Most of the checklist is noise. Find the signal.
### How compliance still fits (read this before you get smug)
We are not anti-compliance. We are anti-*thoughtless* compliance. Your clients have auditors, contracts, and regulators, and you will still help them pass. The relationship is this:
> **Compliance is a floor and a by-product. It is never the target.**
If you build an antifragile estate, you will pass CIS almost by accident, and you will be able to explain *why* every control exists — which is more than most auditors can. But you will also do things no benchmark asks for (game-days, kill-switch drills, deliberate removal of features) and you will *skip* things benchmarks demand when they add fragility or cost without reducing blast radius. When you skip, you skip **on the record, with a written reason**. That is the difference between independent judgement and laziness.
---
## The governing question
Before the principles, the one question that sits above all of them. Ask it of every account, every trust, every sync, every app registration:
> **If this is owned tonight, what is the largest thing an attacker reaches before hitting a wall — and can I draw that wall?**
If you cannot draw the wall, there is no wall. In M365 + AD the wall is almost always missing in the same place: the **identity bridge** between on-prem AD and Entra ID. Internalise this and half the job is done.
---
## The Principles
Nine of them. They overlap on purpose — antifragility is a way of seeing, not a checklist (the irony would be unbearable). Each comes with **judgement prompts**: the questions an independent consultant asks instead of looking up the "correct" value. Learn the questions, not the answers. The answers change with every tenant; the questions don't.
---
### 1. Via Negativa — subtract before you add
The strongest control is the thing that no longer exists. It cannot be misconfigured, cannot be exploited, cannot drift, and costs nothing to maintain. Benchmarks are addition machines — every control is something *more* to deploy and watch. Start the other way: what can we **delete**? In M365 + AD, the highest-leverage deletions are usually: legacy/basic auth, NTLM and unconstrained delegation, standing privileged role assignments, dormant service accounts and their static secrets, unused federation, public folders, orphaned app registrations with tenant-wide consent, and "temporary" firewall or CA exclusions that became permanent. **Judgement prompts**
- If I removed this control/feature/account, would *anyone* notice within 90 days? If not, why does it exist?
- What is the oldest thing here still running, and who decided it should keep running — or did nobody decide?
- Every exclusion is a tiny hole punched in a wall. List the exclusions. Who asked for each, and is that person still here?
- Am I about to *add* a control to compensate for something I could *remove* instead?
---
### 2. The Barbell — protect the irreplaceable, let the rest stay cheap
Compliance scoring spreads effort evenly: every control worth the same point. Reality is not evenly distributed. A handful of things are irreplaceable — tenant root, Tier 0 / domain controllers, break-glass accounts, backups, the sync engine. Everything else is, in principle, rebuildable. Put **paranoid, expensive, redundant** protection on the irreplaceable few. Let everything else be **cheap, fast, and replaceable** — even disposable. Do not spend your political capital hardening a kiosk laptop while a Global Admin has no phishing-resistant MFA. The middle — moderate protection spread thinly over everything — is where budgets and attention go to die. **Judgement prompts**
- Name the five things in this estate that, if lost, cannot be rebuilt. Are they protected differently from everything else, or the same?
- Where is effort being spent evenly that should be spent asymmetrically?
- Is anything in the "cheap and replaceable" bucket actually load-bearing in disguise? (The "temporary" script on someone's laptop that runs payroll.)
- Could I afford to let this thing be *destroyed* and just rebuild it? If yes, stop gold-plating it.
---
### 3. Blast Radius is the metric — not the control count
This is the governing question turned into a habit. Compliance counts inputs (controls present). Antifragility measures **propagation** (how far a compromise travels). A tenant with 200 controls and a flat AD→Entra trust is more fragile than a tenant with 50 controls and a real tier boundary. The defining fragility of hybrid M365 is **coupling**: Password Hash Sync or PTA, Entra Connect running as a quasi-Tier-0 service, AD admins who are also cloud admins, devices that are both domain-joined and the user's MFA device. Each coupling means one compromise becomes two. Antifragile design **decouples** — it turns the identity bridge from a conduit into a firebreak. **Judgement prompts**
- Draw the attack path from a single phished standard user to Global Admin. How many *independent* barriers are there? Independent, not "two MFA prompts from the same provider."
- Which single account, if compromised, ends the engagement? How many are there? (If the answer is more than zero, that's the project.)
- If on-prem AD fell completely, would the cloud survive — and vice versa? Or are they one organism wearing two badges?
- What runs the sync, and what could that identity reach? Trace it.
---
### 4. Optionality — buy cheap escape hatches
Pay a small, certain cost now for the *option* to survive an uncertain disaster later. Break-glass accounts, a tested "kill the sync" runbook, a way to revoke all tokens at once, an offline copy of recovery keys, a documented path to a clean tenant. These look like waste to an auditor and like wisdom on the worst day of the client's year. Optionality is the opposite of optimisation. An optimised system has no slack and shatters at the first surprise. Deliberately keep some slack. **Judgement prompts**
- When the primary path fails, what's the second path — and has anyone walked it?
- If we had to sever AD from Entra in the next 30 minutes to contain a breach, *how*? Is that written down where someone panicking can find it?
- Break-glass: does it exist, is it phishing-resistant, is it excluded from the CA policy that would otherwise lock it out, and when was it last *used* in a drill (not just created)?
- What are we optimising so hard that we've removed all room to manoeuvre?
---
### 5. Stress it on purpose — hormesis, not hope
Muscle, bone, and immune systems get stronger from controlled stress and weaker from protection. Systems are the same. **An untested control is a broken control** — you simply don't know it yet. The benchmark says "the setting is configured." The antifragile consultant says "we revoked the token at 14:00 on a Tuesday and watched what actually happened." Run game-days. Disable a CA policy and observe the fallout in a controlled window. Simulate Entra Connect failure. Pull a Global Admin's session. Kill a DC. You *want* to discover brittleness on a quiet afternoon, cheaply, with the right people watching — not at 3 a.m. during a real intrusion. **The corollary: declared state is not enforced state.** Underneath "untested = broken" sits a harder truth about *why* you must test — every representation the platform hands you (a config blade, an inventory record, a compliance dashboard, a green tick) is a **claim about reality, not reality itself**, and the two diverge silently and routinely. Two examples that should haunt you:
- A Conditional Access policy can display a flawless configuration and **enforce nothing** — the evaluated object has desynced from the one you're looking at. Every config review, export-diff, and benchmark audit passes. Only a real sign-in reveals it fails open. (Worked example in Book IV.)
- A CMDB or device inventory shows a clean, managed fleet while the sign-in logs show a different, larger, partly-unknown population actually touching the data. The inventory is a wish; the authentication record is the fact. (Worked example in Book IV.)
So the rule that governs the whole craft: **verify by observation, never by inspection.** Trust what the system *does* under test over what any artefact *says* it does. Reading the config is not knowing the behaviour; counting the inventory is not knowing the fleet. Where the representation and the observed behaviour disagree, the behaviour is the truth and the representation is the bug. **Judgement prompts**
- What here has never once been tested by actually breaking it?
- What do we *believe* is true about this estate that we've never verified by observation? (Belief is not evidence. The portal showing a green tick is not the same as the control firing under attack.)
- Which "facts" about this estate come from a *representation* (config screen, CMDB, dashboard) rather than from *observed behaviour*? Which have we confirmed the system actually does, versus merely says?
- Where would a silent divergence between declared and enforced state hurt most — and how would we even notice it?
- When did this client last deliberately break something to learn from it? If "never," that's the most important finding in your report.
- What's the smallest, safest experiment that would tell us whether X is real?
---
### 6. Every incident must change the structure
This is the actual definition of antifragile — *gaining from disorder.* A robust system survives a shock unchanged. An antifragile system comes out **structurally different and harder to hit the same way twice.** Pain that closes a ticket without changing the architecture is wasted pain, and it guarantees the same incident again. After every incident, near-miss, failed game-day, or even a noisy false positive: what *structural* thing changes? Not "we reminded users to be careful." A removed permission, a severed coupling, a new firebreak, a deleted feature. **Judgement prompts**
- For the last three incidents (or alerts) here — what changed in the *structure* afterwards? If the answer is "a training reminder," nothing changed.
- Does this organisation treat incidents as embarrassments to bury or as fuel? (Blameless on people, ruthless on structure.)
- Are we fixing the instance or the class? Patching this account, or removing the pattern that made it possible?
- What did the last false positive *teach* us that we threw away?
---
### 7. Convexity — prefer bounded cost, unbounded upside
Choose controls whose downside is small and known, and whose upside is large and broad. Conditional Access is convex: cheap to run, fails gently, and one good policy blocks whole classes of attack. A sprawling, hand-tuned DLP ruleset is concave: expensive to maintain, brittle, and it fails in surprising, expensive ways at the worst moment. Favour the convex. Be deeply suspicious of any control that needs constant tending to keep working. **Judgement prompts**
- When this control fails, does it fail *safe and quietly*, or *open and catastrophically*? (Fail-open is concave and usually a trap.)
- How much ongoing care does this need to keep working? High-maintenance controls rot the moment attention moves on.
- Does this control block a *class* of attacks or just one specific instance? Prefer the class.
- Are we buying a complex product to solve a problem that one CA policy and a deletion would solve?
---
### 8. Lindy — trust what has survived
The longer a mechanism has survived, the longer it's likely to keep working. Boring, time-tested controls (least privilege, network segmentation done right, hardware-backed keys, tiered admin) beat the newest preview blade in the portal. New features arrive with unknown failure modes and unknown attack surface; they have not yet been stress-tested by the world. Use them when they earn it, not because they're new. Equally: an attack technique that has worked for fifteen years (NTLM relay, Kerberoasting, consent phishing) will probably work next year — prioritise accordingly. **Judgement prompts**
- Is this control time-tested, or are we the QA team for a feature that shipped last month?
- What are the oldest, most reliable attacks against this estate — and have we actually closed them, or chased novel ones while the classics stay open?
- If this shiny feature vanished tomorrow, would we be exposed? If yes, we built on sand.
- Are we solving a 2015 problem with a 2026 product because the product is new?
---
### 9. Skin in the game — whoever designs it, lives with it
Security theatre is what happens when the people imposing controls never carry the pager. A consultant who recommends a control they'd never have to operate is selling fragility dressed as diligence. The person who designs the break-glass process should be woken up by the drill. The architect who couples AD to Entra should be the one who has to uncouple it under fire. This applies to you. Don't recommend what you wouldn't run. Don't hand a client a 40-page hardening guide you've never operated. Your reputation is your skin in the game — stake it on advice that survives contact with reality. **Judgement prompts**
- Does the person who designed this control have to live with its consequences? If not, expect theatre.
- Am I recommending this because it's right, or because it's defensible if something goes wrong? (Defensive medicine is fragility you can bill for.)
- Would I bet my own reputation that this works under real attack? If I hesitate, why am I asking the client to bet theirs?
- Who gets the 3 a.m. call when this fails — and were they in the room when it was designed?
---
## How to spot fragility (the field skill)
You will walk into estates with no documentation and no time. Fragility has a smell. Train your nose on these tells:
- **Folklore.** Configurations only one person understands, justified by "we've always done it that way." If they leave, it becomes un-auditable. Folklore is fragility with tenure.
- **Single points of failure wearing a uniform.** One service account that runs everything. One admin who holds all the keys. One unreplicated DC. One sync server treated as cattle but actually a pet.
- **Tight coupling.** Compromise one thing → automatically own a second. AD↔Entra, identity-device-MFA all on one phone, prod and admin in one forest.
- **Things never tested.** Backups never restored. Break-glass never used. DR plans never run. "It should work" is the sound of a fragile system.
- **Permanent "temporary."** Exclusions, exceptions, pilot configs, and risk acceptances older than 18 months.
- **Even spreading.** Effort distributed uniformly is a sign nobody asked what matters. The barbell is missing.
- **Green dashboards, untested reality.** Everything compliant, nothing ever stress-tested. The most dangerous estate of all, because it feels safe.
---
## The anti-benchmark: what we measure instead of compliance %
We don't score controls passed. If the client needs a number, give them these — and explain why each beats a compliance percentage:
- **Blast radius** — from a single phished standard user, how many independent barriers to tenant/domain dominance? (Higher is better. Most estates: zero or one.)
- **Mean time to recover** — measured by *actually doing it* in a drill, not by the RTO written in a policy.
- **Single points of failure** — counted, named, and owned. The goal is a shrinking list, not a green tick.
- **Untested assumptions** — the number of load-bearing beliefs never verified by observation. The goal is to drive this toward zero.
- **Time-to-remove** — how fast can we delete a fragilizer (legacy auth, a standing admin) once found? Velocity *is* a security metric.
None of these are easy to fake, which is exactly why they're worth measuring.
---
## How to use this handbook
Book I is the lens. The domain books that follow — Hybrid Identity, Privileged Access, Devices, Data & Collaboration, Recovery, Detection-as-feedback — each apply this same lens in the same shape:
1. **Fragility inventory** — where does this domain break, and what's the blast radius?
2. **Via negativa** — what do we remove first?
3. **The barbell** — what gets paranoid protection, what stays cheap?
4. **Optionality & recovery** — what are the escape hatches, and are they tested?
5. **Stressor** — how do we deliberately break this to learn?
If you ever find yourself reaching for "because the benchmark says so," stop. Go back to the governing question. Draw the wall. If you can't draw it, you've found your work.
---
*Book I of the Antifragile Handbook. Principles over checklists. Judgement over obedience. Move fast and fix things.*