feat: Add management overlay pattern (Nebula T0 / Tailscale T1) and cloud admin VM guidance

This commit is contained in:
2026-06-09 14:40:34 +02:00
parent 5264f7b439
commit 7ff4fad953
4 changed files with 173 additions and 20 deletions
@@ -221,11 +221,65 @@ ADCS is Tier 0. It sits on whatever server it runs on, and that server should ha
---
### Privileged Access Workstations — scope the conversation honestly
### Admin workstations — the cloud VM is the deployable PAW
PAWs are right in principle. In 2026, the practical conversation with most mid-market clients is: **dedicated devices for Tier 0 administration** (Global Admins and Domain Admins use a separate machine for those tasks, even if that machine is just a hardened Windows device or a VM they launch for admin work).
Physical PAWs are right in principle and almost never get deployed. Hardware procurement, second device, behaviour change — the project does not survive contact with a real IT budget. Do not open the conversation with "you need a dedicated PAW laptop." Open it with the cloud admin VM.
The minimum viable version: a dedicated Intune-enrolled, Entra-joined device with no email, no browser for general use, and a Conditional Access policy that restricts Global Admin and Domain Admin-equivalent activity to that device only. Not perfect PAW architecture but a massive improvement over "I use my laptop for everything."
**The cloud admin VM:** a Windows 365 or Azure Virtual Desktop instance provisioned from a hardened template. The admin connects from their normal device via browser or RDP. Privileged credentials — including WireGuard keys for the management overlay — live in the cloud VM, not on the admin's local device. Compromise response: wipe it, reprovision from template in under 20 minutes.
**Provisioning the cloud admin VM:**
1. Create a Windows 365 or AVD instance from a hardened base image (CIS L2 baseline or equivalent)
2. Enrol in Intune, apply a configuration profile: no internet browsing, no personal email, no Microsoft Store apps, screen lock on idle, BitLocker enforced
3. Scope a CA policy restricting Global Admin and privileged role activation to this device (device compliance + named Intune group)
4. Install the Nebula client (if deploying T0 overlay) and distribute the pre-signed node certificate
5. Install the Tailscale client (if deploying T1 overlay) and enrol with the Entra OIDC identity
**Minimum viable without the overlay:** a dedicated Intune-enrolled, Entra-joined cloud VM with no email and no general browsing, and a CA policy restricting GA activation to it. Not perfect, but it will actually get deployed and maintained.
---
### Management overlay — Nebula for T0, Tailscale for T1
**When a client needs this:** SME and mid-market clients with multi-cloud resources, DevOps workloads, or remote admins — and no physical data centre with a proper management VLAN. The overlay builds the management plane that the physical network cannot provide.
**When a client does not need this:** organisations with their own data centres and physical network infrastructure already in place. Traditional management VLAN segmentation plus jump boxes is the right answer there. Adding an overlay creates a new Tier 0 component without proportional benefit.
**The T0 overlay — Nebula:**
Nebula has no coordinator in the runtime path. Once certificates are distributed, the overlay runs with zero external dependencies. This is the right property for T0: a compromised or unavailable external service cannot affect access to your domain controllers.
Deployment steps:
1. Provision the Nebula CA on a dedicated air-gapped machine (a dedicated laptop that is never networked, or a cheap PC kept in a drawer)
2. Generate and sign node certificates for each T0 node (DCs, sync server, ADCS, cloud admin VMs/PAWs)
3. Distribute the signed certificates and the CA certificate to each node
4. Configure the Nebula ACL policy: cloud admin VMs can reach DCs on port 3389 (RDP) and 5985/5986 (WinRM); nothing else. DCs do not reach each other through Nebula (they have their own replication channel)
5. Start the Nebula service on each node. Test connectivity from the cloud admin VM to a DC
6. Document the CA signing ceremony: who can sign new certs, what approval is needed, where the CA key is stored, how to revoke (distribute updated blocklist to all nodes)
**Realistic T0 node count:** 1525 nodes for a 5,000-person organisation. Certificate management is a documented ceremony run a few times a year, not an ongoing operational burden.
**The T1 overlay — Tailscale:**
Tailscale with Entra OIDC + key expiry gives you device trust (WireGuard node key) plus per-session identity assertion (Entra MFA on re-authentication). Configure key expiry to force re-authentication on a schedule aligned with the session risk tolerance (824 hours for admin access).
Deployment steps:
1. Create a Tailscale account or deploy Headscale (for sovereign requirements)
2. Configure the OIDC integration with Entra ID. Set the MFA requirement to phishing-resistant (FIDO2) in the Entra Conditional Access policy that governs Tailscale authentication
3. Set key expiry: 824 hours for admin nodes, 2472 hours for standard nodes
4. Define ACL policy: cloud admin VMs reach T1 servers on management ports only; standard user devices do not appear in the T1 ACL
5. Enrol cloud admin VMs as nodes. Enrol T1 servers (member servers, cloud management hosts, K8s API server endpoints)
6. Test: attempt to reach a T1 server from a non-enrolled device. Expected: no route. From an enrolled cloud admin VM: connected
**What Tailscale carries for multi-cloud:** kubectl access to K8s clusters, SSH/RDP to member servers and cloud VMs, cloud CLI access where the management API is behind a private endpoint. It does not carry M365 admin traffic — that goes direct to Microsoft over the internet, gated by Conditional Access.
**The Nebula CA — the one critical operation:**
The CA key is the trust anchor for the entire T0 overlay. Its compromise means an attacker can enrol their own node and grant it access to every DC. Treat it accordingly:
- Air-gapped machine, never networked after initial setup
- CA key encrypted at rest on the machine and backed up separately
- Certificate lifetime: 180 days maximum, so non-renewal handles most revocation cases
- Revocation: generate and distribute an updated `blocklist.pem` to all nodes if a PAW is lost or an admin departs before cert expiry
- At least two named people who know the ceremony and can perform it
---