feat: Add management overlay pattern (Nebula T0 / Tailscale T1) and cloud admin VM guidance

This commit is contained in:
2026-06-09 14:40:34 +02:00
parent 5264f7b439
commit 7ff4fad953
4 changed files with 173 additions and 20 deletions
@@ -17,6 +17,51 @@ The antifragile answer is a two-layer architecture: **network access** (Tailscal
---
## When overlay management networks help — and when they don't
**Enterprises with their own data centres** already have the physical substrate for a proper management network: dedicated VLANs, hardware segmentation, jump boxes. Adding an overlay management network introduces a new Tier 0 component (the coordinator) on top of infrastructure that already solves the problem. The complexity cost outweighs the benefit. Traditional management VLAN segmentation, done properly, is the right answer.
**SME clients with multi-cloud resources, containers, and DevOps workloads** have a different problem: there is no physical network to segment. Resources are scattered across Azure, AWS, a colo, and maybe on-prem. The management plane does not exist yet — you are building it. An overlay is how you build it, and it is the right answer for this context.
**The T0/T1 split** — applying the tier model to the overlay itself:
- **T0 systems** (domain controllers, ADCS, Entra Connect sync server — the identity control plane): use **Nebula**. No coordinator in the runtime path — once certificates are distributed, the overlay functions with zero external dependencies. The Nebula CA is the only Tier 0 component, and it can be kept offline. This means no coordinator to compromise, no external API call, no cloud service availability dependency for reaching your most critical systems.
- **T1 systems** (member servers, cloud workloads, Kubernetes clusters, multi-cloud management): use **Tailscale** (or Headscale for sovereign requirements). Per-node ACLs, Entra OIDC integration, per-session MFA via key expiry and IdP enforcement. The coordinator trust concern is more acceptable at T1 — a compromised coordinator affects T1 access, not T0.
**The T0 node count is not scary.** For a 5,000-person organisation, the realistic T0 Nebula population is:
| Component | Count |
|-----------|-------|
| Domain Controllers | 48 |
| Entra Connect / Cloud Sync server | 12 |
| ADCS issuing CA | 12 |
| AD FS servers (if not yet removed) | 04 |
| Cloud admin VMs / PAWs | 510 |
| **Total** | **~1525 nodes** |
Certificate management for 1525 nodes is a documented procedure, not an operational burden. The CA signing ceremony happens a few times a year when a PAW is replaced or an admin leaves. This is tractable.
---
## The PAW problem and the cloud admin VM
Physical PAWs are the right principle. They almost never get deployed. Hardware procurement, second device on the desk, behaviour change — the project dies before it starts.
The **cloud-hosted admin workstation** preserves the essential security properties without the hardware problem:
- A Windows 365 or Azure Virtual Desktop VM provisioned from a hardened template
- Used only for privileged tasks (no email, no general browsing)
- Connected to the Nebula T0 overlay (for DC access) and Tailscale T1 overlay (for server/cloud access)
- Accessed by the admin from their normal device via browser or RDP client
- Privileged credentials live in the cloud VM, not on the admin's local device
- Compromise response: wipe the VM, reprovision from template in 20 minutes
The security property that matters — privileged credentials do not touch the device used for email and browsing — is preserved. An attacker who compromises the admin's local device gets a browser session to a cloud VM that requires phishing-resistant MFA to reach. They do not get cached credentials, session tokens, or WireGuard keys for the management overlay.
**When to use a physical PAW instead:** clients with a strong security culture and genuine appetite for the operational overhead, OT/ICS environments where the management workstation may need to be air-gapped, or engagements where the threat model includes a sophisticated attacker who would attempt to compromise the RDP session interactively.
---
## The Two Layers
### Layer 1: Network Access — Tailscale / Headscale + WireGuard
@@ -130,6 +175,30 @@ This catches more clients than it appears. A manufacturing company with 800 empl
---
### Nebula — T0 Management Overlay
| Attribute | Detail |
|-----------|--------|
| **What it does** | WireGuard-based overlay mesh with no coordinator in the runtime path. Nodes authenticate via pre-distributed certificates signed by a local CA. Lighthouse nodes handle NAT traversal only — they are not in the authentication path. |
| **Why it is right for T0** | No external runtime dependency. A compromised or unavailable coordinator cannot affect T0 access. The CA (the actual trust anchor) can be kept offline and brought up only for certificate issuance. |
| **Trade-off vs Tailscale** | No dynamic node management (adding/removing a node requires a CA operation and cert redistribution); no cloud-managed control plane; higher initial setup complexity; certificate revocation requires distributing an updated blocklist |
| **Why the trade-off is acceptable for T0** | T0 node population is small (1525 nodes) and stable. Revocation events (lost PAW, departing admin) are rare and known immediately. The operational overhead is a documented ceremony run a few times a year, not a recurring burden. |
| **Antifragile pillar** | Structural Decoupling, Sovereign Intelligence |
| **When to deploy** | T0 systems (DCs, sync server, ADCS) in any estate; air-gapped or restricted environments; clients where the management plane must have zero external runtime dependencies |
**Nebula CA management — the one non-trivial operation:**
The Nebula CA private key is the trust anchor for the entire T0 overlay. It must be treated accordingly:
- Air-gapped machine (a dedicated laptop that is never networked, or a hardware security module)
- Documented signing ceremony: who is authorised to sign a new certificate, what approval is required, what the procedure is
- Named individuals (minimum two) who know the procedure and can perform it
- CA key backup: encrypted, stored separately from the signing machine, tested
- Short certificate lifetimes (90180 days) so revocation is handled implicitly by non-renewal as much as by explicit blocklist distribution
This is the same discipline as an offline root CA — because that is functionally what it is.
---
### Smallstep — Certificate-Based SSH Access
| Attribute | Detail |
@@ -145,20 +214,34 @@ This catches more clients than it appears. A manufacturing company with 800 empl
## The Decision Framework
```
Does the client have legacy VPN sprawl or flat-network vendor access?
├── YES → Deploy Layer 1 (network access) first
├── Wants managed service + commercial support → Tailscale (partnership)
Does the client have their own data centre with physical network infrastructure?
├── YES → Traditional management VLAN segmentation + jump box
Overlay adds complexity without proportional benefit here
└── NO / Multi-cloud / Scattered resources → Overlay is the right management plane
Does the client need a T0 management overlay (DC, ADCS, sync server access)?
├── YES → Nebula (no external runtime dependency, CA offline)
│ └── Admin workstation: cloud admin VM (W365/AVD) or physical PAW, enrolled in Nebula
Does the client need a T1 overlay (servers, cloud workloads, K8s, DevOps)?
├── YES → Layer 1 (network access)
│ ├── Wants managed service + commercial support → Tailscale + Entra OIDC + key expiry MFA
│ └── Wants full sovereignty / data residency → Headscale + WireGuard
Does the client need protocol-aware session recording / JIT / DB access?
├── YES → Add Layer 2 (PAM)
│ ├── < 100 employees AND < $10M revenue → Teleport CE (free, self-hosted)
│ ├── Larger org / needs support → Teleport Enterprise (commercial)
│ └── SSH-only, budget-constrained → Smallstep (certificates only)
│ ├── Larger org / needs support → Teleport Enterprise (commercial, verify current pricing)
│ └── SSH-only, budget-constrained → Smallstep (certificates only, no session recording)
Does the client need both layers?
├── MOST CLIENTS → Tailscale (network) + Teleport CE/Enterprise (PAM)
── OT/CRITICAL INFRA → Headscale (sovereign network) + Teleport (recorded vendor access)
Typical SME multi-cloud client:
├── T0: Nebula + cloud admin VMs
── T1: Tailscale + Entra OIDC
└── Session recording: Teleport CE if eligible, otherwise accept the gap and compensate with
cloud VM audit logging and Tailscale connection logs
OT / Critical infrastructure:
└── Headscale (sovereign T1) + Nebula (T0 where applicable) + Teleport (vendor session recording)
```
---