Files
antifragile/antifragile-consulting/books/03-devices-and-intune.md
T

173 lines
21 KiB
Markdown

# The Antifragile Handbook for M365 & Active Directory
## Book IV — Devices & Endpoint (Intune)
> *The device will be compromised. Compliant is not the same as secure, and the portal toggle is not the same as the device's behaviour. Build for the compromise, not against it.*
---
## The governing question
Most endpoint programmes are built on a wish: *make the device trusted.* That wish is unwinnable — a device in a user's hand, on a network you don't control, running an OS you didn't write, will eventually be compromised, and no amount of hardening changes that. So flip the question:
> **Assume every device is already compromised. What still holds?**
If the answer is "nothing, because a compromised-but-compliant device gets full access," you've built fragility with a green tick on it. The antifragile endpoint posture stops trying to own the device and instead builds a boundary that **survives an untrusted device**: the data lives behind a wall, the device is cheap and disposable, and "compliant" is treated as what it actually is — a *signal that can be wrong*, not a guarantee.
That reframe — **compliance is a signal, not a checkbox** — is the spine of this whole book.
---
## 1. Fragility inventory — where the endpoint betrays you
### The fleet is a fiction: managed, unmanaged, shadow, dark
Before any of the controls below mean anything, confront the foundational lie of endpoint security: **you do not know your fleet.** The whole book so far has said "the managed devices" as if that set is the fleet. It isn't. The managed devices are the part you *chose to count* — and in most estates they're the bigger part only *if you're lucky.* The blast radius lives in everything else.
The honest spectrum of what touches your data:
- **Managed** — enrolled (MDM) or app-managed (MAM). The devices you can see and control. The part the programme is about, and the part everyone fixates on.
- **Known-but-unmanaged** — devices that authenticate and reach data but aren't managed. Entra-registered-but-not-compliant, BYOD that hit OWA or a SharePoint link in a browser. They're in the sign-in logs; they're not under your control.
- **Shadow** — devices the org never sanctioned but users brought anyway: a personal phone, a contractor's laptop, a home PC pulling files through the web client. Shadow IT at the device layer.
- **Dark** — access you have *no device-level visibility into at all.* Legacy- protocol sign-ins that bypass Conditional Access and never produce a clean device signal. Long-lived tokens issued once and never re-evaluated. App passwords. Service principals and automation that aren't devices but reach data like one (the "dark matter" of Book III, wearing a different hat). This is the end of the spectrum that should frighten you, because it never trips a sensor.
And the inventory of record — the CMDB — is almost always **more wish than reality.** It's populated by *process* (someone files a ticket), and process decays the moment attention moves on. The real device population is populated by *behaviour* — what is actually authenticating right now. The gap between those two is precisely your shadow and dark population, and it's invisible exactly where it matters most.
This is the Book I corollary made flesh: **the inventory is a claim; the sign-in log is the fact.** Stop deriving your fleet from the CMDB (declarative, decaying, wishful) and start deriving it from observed authentication (behavioural, current, honest). You can't manage what you can't see, and you can't see what you decided not to look at.
The reframe that saves you is the same barbell from §3: the goal is **not** to manage every device — that's impossible, and chasing it is fragile. The goal is (a) to *know the real population* by observation, and (b) to *gate the data* so that an unmanaged or unknown device gets limited, app-contained, or no access. The question was never "is this device managed." It's **"can a device I don't control reach the data, and what happens when it does?"** An unmanaged device forced through an app-protection boundary in a browser session control is contained. An unmanaged device holding a fat client and a never-re-evaluated token is a hole in the wall you didn't know was open.
### The compliance signal lies (in both directions)
"Require compliant device" in Conditional Access is the real control. But the compliance signal underneath it is softer than the toggle suggests:
- **It's stale.** Compliance is evaluated on a check-in cadence, not continuously. There's a window where a device falls out of compliance — gets rooted, drops encryption, falls behind on patches — and still carries a "compliant" state and a valid token. The signal lags reality.
- **It's spoofable.** Root/jailbreak detection is an arms race, not a wall. A motivated attacker (or a determined user with a YouTube tutorial) steps over the tripwire. Treat detection as a tripwire, never as a barrier.
- **It's shallow.** "Compliant" usually means a handful of boxes — PIN set, encrypted, OS version, not-jailbroken. None of those stop malware running with the user's own token on a device that passed every check.
- **It fails both ways.** A false *compliant* over-trusts a hostile device. A false *non-compliant* locks a legitimate user out at the worst possible moment — and anyone who's run endpoint at scale has watched a flaky signal brick access for someone important mid-flight. Both failure modes are real; design for both.
### The ghost policy: displayed config ≠ enforced config
This one is field-earned and genuinely frightening, because it defeats every form of inspection there is. A Conditional Access policy can show a **perfectly correct configuration in the portal** — every condition, assignment, and grant exactly as intended — and yet **never enforce anything.** The backend state has desynced or corrupted; the object you're *looking at* is not the object being *evaluated*. Recreating the policy from scratch with byte-identical parameters restores enforcement. Nothing in the displayed config ever told you it was broken.
Sit with what that means. A config review passes. An export-and-diff passes. A CIS audit ticks it green. Every parameter is "correct." And the control is doing nothing — a CA policy that **fails open, silently.** This is the worst failure on the convexity axis: the control you trusted to be convex (fails safe, blocks a class) is quietly behaving concave (fails open, protects nothing), and *no artefact you can read reveals it.* A benchmark cannot catch this. It is invisible to inspection by construction.
There is exactly one thing on earth that detects it: **observed enforcement under test.** This is not an edge case to file away — it is the single hardest piece of evidence for why the entire stressor discipline in this handbook exists. The iron rule that follows (and it is non-negotiable):
> **A CA policy's displayed configuration is a claim, never proof. The only proof is a real sign-in producing the expected outcome. Define the expected results *before* you build or change the policy, and test against them every time.**
Concretely: for the users and conditions that matter, write down the required outcome first — *user X, condition Y → MUST be blocked / granted / MFA-prompted* — so you're testing against a pre-committed expectation, not rationalising whatever you observe. Use the What If tool as a first pass, but understand its limit: What If evaluates the *configuration logic*, so it will happily tell you a ghost policy "applies" while the live evaluator ignores it. **Only a real authentication attempt is proof.** And when behaviour and config disagree, **recreate the policy from scratch — do not re-edit it**, because editing a corrupt object can carry the corruption forward. Re-test after tenant-level changes too, not just after policy edits; the desync can appear without you having touched the policy at all.
### The join-state coupling (Book II reaches the desktop)
Entra hybrid join drags the Book II fragility down to the device: the device identity now depends on on-prem AD, the SCP, the sync, and line-of-sight to a DC for some flows. It's the device-layer version of "one organism, two badges," and it exists almost entirely to service legacy app/auth dependencies. Pure Entra join + Intune is the cloud-native path that severs that coupling.
### The PRT is the device's golden ticket
The Primary Refresh Token on a managed device is its key to seamless cloud SSO. A compromised endpoint with a live PRT is a serious blast-radius problem. TPM binding (the session key sealed in hardware) is what raises the cost of stealing it — so "is the PRT TPM-bound?" is a real question, not a checkbox.
### MAM / App Protection is a *porous* boundary
Managing the data layer without owning the device (MAM-WE / App Protection Policies) is the right idea — wall the data, don't try to own a personal phone. But the wall has seams, and the data leaks through them: the OS share sheet, copy/paste where it isn't blocked, screenshots, "open in unmanaged app," local save paths, backups and cloud sync, and unmanaged browsers. A **"Block" in the policy is a claim, not a guarantee** — there are documented cases where the data goes out a path the policy was supposed to close. And enforcement is **not symmetric across iOS and Android**: different OS capabilities, different companion app requirements, different gaps that shift release to release. Never assume parity, and never trust the toggle without watching the device.
### Enrollment is a trust-establishment moment
Autopilot and enrollment are when a device becomes "trusted." That makes the enrollment path — tokens, the Autopilot device list, enrollment restrictions — a target: hijack it and you enrol a hostile device as a friend. Most programmes harden the device after enrollment and never look hard at the enrollment trust itself.
### The legacy and standing-privilege drag
- **GPO + co-management overlap** — on-prem-coupled config (Book II again), conflicts with Intune, and a migration most estates have half-finished for years.
- **Standing local admin** on endpoints — the device-layer version of Book III's original sin; one cracked local admin path = lateral movement.
- **Legacy auth that bypasses CA entirely** — the device controls are irrelevant on a protocol that never consults Conditional Access.
### Patch velocity, and its evil twin
A fleet you can patch in 24 hours is antifragile; one that takes six weeks of change control is fragile, and the attackers know your patch latency better than you do. But the *opposite* failure is just as real: a fast push to **everything at once** with no staging is how a single bad update bricks an entire fleet — the 2024 CrowdStrike mass-BSOD event was exactly this, a security vendor's own update shipped fast to everyone with no canary. Velocity without an escape hatch is concave (see §4).
---
## 2. Via negativa — what to remove
1. **Go cloud-native.** Move to Entra join + Intune + Autopilot and retire hybrid join, domain join, and GPO wherever the legacy dependency can actually be killed. This severs the Book II coupling at the device layer and deletes a whole class of "the desktop broke because the DC/sync/SCP did" failures.
2. **Stop trying to trust the device.** This is a *deletion* — stop pouring effort into making BYOD a trusted device. Wall the data instead (MAM/App Protection) and treat the device as untrusted by default. Subtracting the impossible goal is the move.
3. **Remove data from the endpoint.** If the data lives in managed apps and the cloud, there's less on the device to leak or lose. Shrink the local footprint and the compromise gets cheaper to absorb.
4. **Remove standing local admin.** JIT elevation (Endpoint Privilege Management) instead — Book III's "shrink the time" at the desktop.
5. **Kill legacy auth and the protocols that bypass CA.** A device control you can route around isn't a control.
6. **Prune the cruft** — conflicting/duplicate config profiles, dead enrollment profiles, stale Autopilot registrations, orphaned compliance policies nobody can explain. Each one is drift waiting to surprise you.
---
## 3. The barbell — cheap devices, protected boundary
**The device is cattle, not a pet.** This is the central barbell of the book. A lost, stolen, or compromised endpoint should be a **shrug**: selective-wipe the corporate data (BYOD) or full-wipe and re-provision via Autopilot in about an hour (corporate). If losing a laptop is a crisis, you've made the device irreplaceable — which means you protected the wrong thing.
**Protect the irreplaceable boundary instead:**
- **The access decision** — Conditional Access. This is the convex control of the endpoint world (Book I): one well-built policy blocks whole classes of attack, cheaply. It is also one of the few things that can brick an entire tenant if misconfigured, so it gets paranoid change discipline (§4).
- **The data boundary** — the managed-app container / App Protection policy set, tested at the seams (§5), not trusted at the toggle.
- **The PRT and enrollment trust** — TPM-bound credentials, hardened enrollment restrictions, device-bound phishing-resistant auth (links Book III).
**Don't gold-plate the disposable.** Spending weeks locking down a kiosk's wallpaper policy while the CA policy set has a legacy-auth hole is the endpoint version of even-spreading. Concentrate on the decision and the data wall.
---
## 4. Optionality & recovery — escape hatches, tested
- **Wipe-and-reprovision as the recovery primitive.** Autopilot makes the device replaceable; *that* is your endpoint recovery plan. But "replaceable in an hour" is a slide claim until you've timed it on a real device. Drill it.
- **Selective wipe for BYOD** — the clean escape hatch that pulls corporate data without touching the user's photos. The thing that makes MAM politically survivable.
- **Update rings and canaries — velocity *with* a brake.** The answer to the CrowdStrike failure mode isn't "patch slowly," it's "patch fast through rings with a real canary, and keep the ability to **halt or roll back** a bad push before it reaches everyone." Fast *and* reversible. This is the barbell and optionality fused: speed on the upside, a bounded blast radius on the downside.
- **Break-glass exclusion from device requirements.** A flaky compliance signal must never lock out recovery. The break-glass accounts (Book I/III) sit outside the "require compliant device" gate — and that exclusion is monitored, not forgotten.
- **Fast device-trust revocation.** A one-move way to disable a device, revoke its tokens, and drop it from CA trust. Rehearse it.
- **Continuous Access Evaluation** is the mechanism shrinking the stale-token window — near-real-time response to critical events instead of waiting for token expiry. It narrows §1's "the signal is stale" gap. Coverage is not universal across every app and flow (verify current state, §honest uncertainty).
---
## 5. Stressor — break it on purpose
This domain rewards hands-on stress more than any other, because the gap between *policy* and *behaviour* only shows up on a real device.
- **Reconcile the four lists and hunt the deltas.** Pull Intune-enrolled devices, Entra-registered devices, devices appearing in sign-in logs, and the CMDB. None of them will agree. The **disagreements are the findings**: devices authenticating that nobody manages, CMDB entries that never sign in, registered devices that fell out of management. Then go further — count legacy-auth sign-ins and long-lived sessions (the dark end), and run network device discovery for the unmanaged things on the wire. The size of the gap between "the fleet we think we have" and "the population actually touching data" is one of the most honest metrics you can put in a report.
- **Attack your own MAM boundary, per platform.** Try to get corporate data out through every seam: share sheet, copy/paste, screenshot, save-as-local, open-in- unmanaged-app, backup/sync, an unmanaged browser. Find where "Block" doesn't actually block. Do it **separately on iOS and Android** — they will not behave the same, and the difference is the finding. (When you find a gap that survives reinstall and reset, that's an escalation to the vendor, not a config you missed.)
- **Spoof the compliance signal.** Root/jailbreak a test device. Is it caught? How long until the signal flips and CA reacts? That latency is your real exposure window.
- **Prove every CA policy actually enforces.** Never sign off a policy on its displayed config. With expected results written down beforehand, drive real sign-ins for each user/condition that matters and confirm the *observed* outcome matches. Treat What If as a hint, not proof. If a policy that looks correct doesn't enforce, recreate it from scratch rather than editing — the displayed object and the evaluated object can diverge silently, and a ghost policy fails open without ever telling you.
- **Lock yourself out on purpose.** In report-only mode, simulate a false non-compliant on a privileged user. Watch the CA decision. Confirm break-glass sails through. Better to find the lockout in a drill than during an outage.
- **Push a deliberately bad config/update to the canary ring.** Confirm the ring *contains* it and that halt/rollback works. An untested canary is just the first domino with a friendly name.
- **Time a wipe-and-reprovision.** Is the device truly replaceable in an hour, or is that a fiction the recovery plan rests on?
- **Compromise a test endpoint.** What does its PRT reach? Does EDR detect it? Does the device-risk signal actually flow into CA and revoke access — or does it stop at a dashboard nobody watches?
Per Book I principle 6: every gap found becomes a **structural** change — a closed seam, a tightened ring, a severed coupling, an escalation raised — not a line in a test log that dies there.
---
## Honest uncertainty (endpoints are the worst offender — verify on a real device)
Stable and Lindy (teach with confidence): the device will be compromised; trust the boundary, not the device; cheap-and-reprovisionable beats hardened-and- precious; compliance is a signal; velocity needs a brake. None of that churns.
What moves — and on the endpoint, it moves *faster and more quietly* than anywhere else in this handbook:
- **MAM / App Protection enforcement is version-, platform-, and OS-build- dependent, and it has gaps that shift release to release.** iOS and Android are not symmetric and never have been; companion app requirements and managed- browser support change. The portal will tell you a policy is enforced while the device quietly does something else. **The only reliable test is on a real device, on the current OS build, every release** — the documentation and the hardware disagree more than Microsoft likes to admit. If you live anywhere in this handbook, live here.
- **Continuous Access Evaluation coverage** is expanding but not universal — which apps and flows honour near-real-time revocation changes; verify current coverage before you promise it closes the stale-token window.
- **Windows LAPS, Endpoint Privilege Management, Autopatch, Smart App Control / WDAC** capabilities and management surfaces all evolve; confirm current state and licensing before recommending.
- **Cloud-native vs hybrid-join guidance and the GPO→Settings-Catalog migration tooling** keep shifting toward cloud-native; check what's actually supported for the client's app estate before promising the coupling can be cut.
If a client's safety hinges on a specific enforcement behaviour, **test it on the device and, if needed, cite the current Microsoft doc** — and when the device behaviour contradicts the doc, believe the device. Confident-but-wrong about an endpoint control is how data walks out a seam everyone swore was closed.
---
## Consolidated judgement prompts
- If this device is compromised right now, what does the attacker get, how fast do we know, and how fast is it gone? Is the device a shrug or a crisis?
- Do we know our *real* device population — derived from what's authenticating — or are we trusting a CMDB that's more wish than reality? How big is the gap between managed, known-unmanaged, shadow, and dark? What dark access bypasses CA entirely?
- Is "compliant" being treated as a guarantee or as a signal that can be stale, spoofed, or shallow? What happens when it's wrong — in *both* directions?
- Is the boundary the data (MAM/CA) or the device? Have we tested the data wall at every seam, on every platform, on the current OS build — or just toggled it?
- Are devices hybrid-joined out of genuine need, or out of habit? What would it take to go cloud-native and cut the Book II coupling?
- Can we patch the fleet fast — and can we *halt* a bad push before it reaches everyone? Do we have rings and a real canary, or hope?
- Is the PRT TPM-bound? Is enrollment trust hardened, or can a hostile device enrol as a friend?
- Does standing local admin still exist? Does legacy auth still bypass CA?
- For every CA policy that matters: has it been proven to enforce by a *real sign-in* against pre-written expected results — or are we trusting the displayed config of a policy that might be a ghost?
- Has anyone timed a wipe-and-reprovision, tested break-glass against the device gate, or watched the device-risk signal actually reach a CA decision?
---
*Book IV of the Antifragile Handbook. Stop defending the device; assume it's already lost and build the boundary that survives it. Trust the device behaviour over the portal toggle, every time. Move fast and fix things.*