Files
antifragile/antifragile-consulting/playbooks/privileged-access-architecture.md
T

21 KiB
Raw Blame History

Privileged Access Architecture

"Your VPN authenticates people to your network. Your PAM authenticates people to specific resources inside it. Most organisations deploy neither correctly. The result is a flat network where a compromised laptop reaches every server, and a stolen VPN credential reaches everything else."

For the Executive Reader

Every organisation has two access control problems hiding behind the label "VPN":

  1. Who can reach the network? The VPN problem — getting authorised people onto the network at all.
  2. Who can touch which specific systems? The PAM problem — ensuring that once inside the network, users can only reach what they need, nothing more, and every action is recorded.

Most organisations solve the first problem badly (legacy VPN, IP whitelisting, overlapping access methods nobody remembers creating) and ignore the second entirely. An attacker who compromises a VPN credential in this configuration has access to everything.

The antifragile answer is a two-layer architecture: network access (Tailscale or Headscale) sitting in front of protocol-aware privileged access (Teleport). Each layer can be deployed independently. Together they close the most common kill chain in the playbook.

For module selection, see Modular Engagements. For the asset classification that determines which systems require PAM, see T0 Asset Framework.


When overlay management networks help — and when they don't

Enterprises with their own data centres already have the physical substrate for a proper management network: dedicated VLANs, hardware segmentation, jump boxes. Adding an overlay management network introduces a new Tier 0 component (the coordinator) on top of infrastructure that already solves the problem. The complexity cost outweighs the benefit. Traditional management VLAN segmentation, done properly, is the right answer.

SME clients with multi-cloud resources, containers, and DevOps workloads have a different problem: there is no physical network to segment. Resources are scattered across Azure, AWS, a colo, and maybe on-prem. The management plane does not exist yet — you are building it. An overlay is how you build it, and it is the right answer for this context.

The T0/T1 split — applying the tier model to the overlay itself:

  • T0 systems (domain controllers, ADCS, Entra Connect sync server — the identity control plane): use Nebula. No coordinator in the runtime path — once certificates are distributed, the overlay functions with zero external dependencies. The Nebula CA is the only Tier 0 component, and it can be kept offline. This means no coordinator to compromise, no external API call, no cloud service availability dependency for reaching your most critical systems.
  • T1 systems (member servers, cloud workloads, Kubernetes clusters, multi-cloud management): use Tailscale (or Headscale for sovereign requirements). Per-node ACLs, Entra OIDC integration, per-session MFA via key expiry and IdP enforcement. The coordinator trust concern is more acceptable at T1 — a compromised coordinator affects T1 access, not T0.

The T0 node count is not scary. For a 5,000-person organisation, the realistic T0 Nebula population is:

Component Count
Domain Controllers 48
Entra Connect / Cloud Sync server 12
ADCS issuing CA 12
AD FS servers (if not yet removed) 04
Cloud admin VMs / PAWs 510
Total ~1525 nodes

Certificate management for 1525 nodes is a documented procedure, not an operational burden. The CA signing ceremony happens a few times a year when a PAW is replaced or an admin leaves. This is tractable.


The PAW problem and the cloud admin VM

Physical PAWs are the right principle. They almost never get deployed. Hardware procurement, second device on the desk, behaviour change — the project dies before it starts.

The cloud-hosted admin workstation preserves the essential security properties without the hardware problem:

  • A Windows 365 or Azure Virtual Desktop VM provisioned from a hardened template
  • Used only for privileged tasks (no email, no general browsing)
  • Connected to the Nebula T0 overlay (for DC access) and Tailscale T1 overlay (for server/cloud access)
  • Accessed by the admin from their normal device via browser or RDP client
  • Privileged credentials live in the cloud VM, not on the admin's local device
  • Compromise response: wipe the VM, reprovision from template in 20 minutes

The security property that matters — privileged credentials do not touch the device used for email and browsing — is preserved. An attacker who compromises the admin's local device gets a browser session to a cloud VM that requires phishing-resistant MFA to reach. They do not get cached credentials, session tokens, or WireGuard keys for the management overlay.

When to use a physical PAW instead: clients with a strong security culture and genuine appetite for the operational overhead, OT/ICS environments where the management workstation may need to be air-gapped, or engagements where the threat model includes a sophisticated attacker who would attempt to compromise the RDP session interactively.


The Two Layers

Layer 1: Network Access — Tailscale / Headscale + WireGuard

What it solves: Replace the legacy VPN sprawl. Admins and remote workers get secure, identity-aware access to internal networks without exposing services to the internet.

How it works: WireGuard mesh VPN managed by a control plane (Tailscale as a service, or Headscale self-hosted). Every device gets a node identity. Access is controlled by ACL policies, not IP rules. No open firewall ports required on servers.

Why it matters for security:

  • Eliminates the "VPN = everything" flat-network problem via ACL policies
  • Every connection is mutually authenticated (device certificate + identity)
  • Audit log of who connected to what, when
  • Access can be revoked instantly by removing a node from the control plane

Layer 2: Protocol-Aware PAM — Teleport

What it solves: Once someone is on the network, enforce which specific servers, databases, and Kubernetes clusters they can access — and record every session in a tamper-evident audit trail.

How it works: Teleport proxies connections to SSH servers, Windows hosts (RDP), Kubernetes clusters, and databases. Users authenticate once (SSO/MFA); Teleport issues short-lived certificates. Sessions are recorded and searchable. No static credentials stored on servers.

Why it matters for security:

  • Eliminates shared/static credentials on servers (root, administrator)
  • Just-in-time access: permissions expire, removing standing access
  • Session recording: every sudo, every SQL query, every RDP session
  • Auditor-ready evidence: access logs that regulators actually accept

Tool Details

Teleport

Attribute Detail
What it does Protocol-aware privileged access proxy for SSH, RDP, Kubernetes, databases, and internal web applications. Short-lived certificates. Full session recording.
Antifragile pillar Sovereign Intelligence, Structural Decoupling
Open-source status Community Edition (CE) is open-source and self-hosted

CE Eligibility — Be Honest With Clients

Teleport CE is an excellent, capable product. The licensing constraint is important to communicate clearly:

Teleport CE is free for organisations with fewer than 100 employees AND less than $10M annual revenue. Both conditions must be met.

This catches more clients than it appears. A manufacturing company with 800 employees and 6 administrators who would touch Teleport cannot legally deploy CE, even though it would work perfectly for their use case. When in doubt, check with the client's legal team before deploying CE at scale.

Scenario Recommendation
< 100 employees, < $10M revenue Teleport CE — free, self-hosted, full feature set for this scale
> 100 employees OR > $10M revenue Teleport Enterprise (commercial) or see Alternatives below
Client needs vendor support Teleport Enterprise regardless of size
Client has sovereign data mandate Teleport CE or Enterprise self-hosted (both are self-hosted)
OT/SCADA vendor remote access at scale Teleport Enterprise — session recording and just-in-time access are critical

Teleport CE vs Enterprise Feature Comparison

Feature CE Enterprise
SSH, RDP, K8s, DB access proxying
Session recording
Short-lived certificates
SSO integration
Just-in-time (JIT) access
Access request workflows
Device trust (trusted devices only) Limited
Access monitoring & alerts Limited
FedRAMP / compliance reports
Commercial support SLA
High availability clustering Limited
License restriction < 100 employees AND < $10M revenue None

The conversation for non-qualifying clients:

"Teleport CE would work technically — your admins would love it. The license terms prohibit it for organisations your size. We can deploy Teleport Enterprise (priced per protected resource, not per user), or we can architect the network access layer with Tailscale and use certificate-based SSH access for the protocol layer. Both are valid paths. The right choice depends on whether session recording and JIT workflows are on your auditor's checklist."


Tailscale — Commercial Partnership

Attribute Detail
What it does Managed WireGuard mesh VPN. Every device gets a node identity. Access controlled by ACL policies. Works on any device, any OS, any cloud.
Why we partner Tailscale provides the managed control plane, commercial support, and SSO integrations that make enterprise deployment painless. Per-user pricing is predictable.
Sovereign alternative Headscale (open-source self-hosted control plane for WireGuard) — see below
Antifragile pillar Structural Decoupling, Optionality Preservation
Engagement modules Module 2 (Identity Security), Module 6 (AD Hardening), Module 8 (OT Security), Module 13 (this module)

When to recommend Tailscale (commercial):

  • Client wants commercial support and SLA
  • Client needs Tailscale's SSO integrations (Okta, Azure AD, Google)
  • Client has a mixed-device estate that benefits from Tailscale's client apps
  • Client's procurement requires a vendor contract

The conversation:

"You currently have a legacy VPN that requires a specific client, routes all traffic through your data centre, and gives everyone access to the same network. Tailscale replaces it with a mesh that puts every authorised device directly in contact with every authorised resource — no central bottleneck, no broad network exposure. An admin in Prague connects to the server in Vienna as if they are on the same LAN. A supplier accesses only the one application they need, nothing else. When you revoke access, it is immediate and complete."


Headscale + WireGuard — Sovereign Alternative

Attribute Detail
What it does Self-hosted control plane (Headscale) for WireGuard mesh networks. Functionally equivalent to Tailscale without the external control plane. Data never leaves client infrastructure.
Why we use it For clients with sovereign-data mandates, air-gapped environments, or regulated industries where data about network topology and device identities cannot reside with a third party.
Trade-off vs Tailscale More engineering overhead; no managed apps; SSO integration requires custom OIDC configuration; no commercial support
Antifragile pillar Sovereign Intelligence, Structural Decoupling
When to deploy Clients with NIS2/DORA requirements on data residency; utilities/OT environments; clients who have explicitly declined SaaS control planes

Deployment model: Headscale server on client infrastructure or CQRE-managed VM; WireGuard clients on all devices. Managed by us as a retained service or handed over to the client's infrastructure team.


Nebula — T0 Management Overlay

Attribute Detail
What it does WireGuard-based overlay mesh with no coordinator in the runtime path. Nodes authenticate via pre-distributed certificates signed by a local CA. Lighthouse nodes handle NAT traversal only — they are not in the authentication path.
Why it is right for T0 No external runtime dependency. A compromised or unavailable coordinator cannot affect T0 access. The CA (the actual trust anchor) can be kept offline and brought up only for certificate issuance.
Trade-off vs Tailscale No dynamic node management (adding/removing a node requires a CA operation and cert redistribution); no cloud-managed control plane; higher initial setup complexity; certificate revocation requires distributing an updated blocklist
Why the trade-off is acceptable for T0 T0 node population is small (1525 nodes) and stable. Revocation events (lost PAW, departing admin) are rare and known immediately. The operational overhead is a documented ceremony run a few times a year, not a recurring burden.
Antifragile pillar Structural Decoupling, Sovereign Intelligence
When to deploy T0 systems (DCs, sync server, ADCS) in any estate; air-gapped or restricted environments; clients where the management plane must have zero external runtime dependencies

Nebula CA management — the one non-trivial operation:

The Nebula CA private key is the trust anchor for the entire T0 overlay. It must be treated accordingly:

  • Air-gapped machine (a dedicated laptop that is never networked, or a hardware security module)
  • Documented signing ceremony: who is authorised to sign a new certificate, what approval is required, what the procedure is
  • Named individuals (minimum two) who know the procedure and can perform it
  • CA key backup: encrypted, stored separately from the signing machine, tested
  • Short certificate lifetimes (90180 days) so revocation is handled implicitly by non-renewal as much as by explicit blocklist distribution

This is the same discipline as an offline root CA — because that is functionally what it is.


Smallstep — Certificate-Based SSH Access

Attribute Detail
What it does SSH certificate authority. Issues short-lived SSH certificates tied to identity (SSO/OIDC). Eliminates static SSH keys. No agent required on target servers.
Why we use it For clients who need certificate-based SSH access control but cannot justify Teleport. Covers the most common privileged access vector (SSH) at low cost and complexity.
Limitation vs Teleport No session recording; no RDP/Kubernetes/DB proxying; no GUI
Antifragile pillar Sovereign Intelligence, Structural Decoupling
When to deploy Linux-heavy clients; DevOps teams; as a stepping stone before Teleport

The Decision Framework

Does the client have their own data centre with physical network infrastructure?
├── YES → Traditional management VLAN segmentation + jump box
│          Overlay adds complexity without proportional benefit here
└── NO / Multi-cloud / Scattered resources → Overlay is the right management plane

Does the client need a T0 management overlay (DC, ADCS, sync server access)?
├── YES → Nebula (no external runtime dependency, CA offline)
│   └── Admin workstation: cloud admin VM (W365/AVD) or physical PAW, enrolled in Nebula
│
Does the client need a T1 overlay (servers, cloud workloads, K8s, DevOps)?
├── YES → Layer 1 (network access)
│   ├── Wants managed service + commercial support → Tailscale + Entra OIDC + key expiry MFA
│   └── Wants full sovereignty / data residency → Headscale + WireGuard
│
Does the client need protocol-aware session recording / JIT / DB access?
├── YES → Add Layer 2 (PAM)
│   ├── < 100 employees AND < $10M revenue → Teleport CE (free, self-hosted)
│   ├── Larger org / needs support → Teleport Enterprise (commercial, verify current pricing)
│   └── SSH-only, budget-constrained → Smallstep (certificates only, no session recording)
│
Typical SME multi-cloud client:
├── T0: Nebula + cloud admin VMs
├── T1: Tailscale + Entra OIDC
└── Session recording: Teleport CE if eligible, otherwise accept the gap and compensate with
    cloud VM audit logging and Tailscale connection logs

OT / Critical infrastructure:
└── Headscale (sovereign T1) + Nebula (T0 where applicable) + Teleport (vendor session recording)

OT and Critical Infrastructure Considerations

This module is especially valuable for Module 8 (OT Security Assessment) clients. The most common and dangerous finding in OT environments is uncontrolled vendor remote access: SCADA vendors, maintenance contractors, and automation engineers with persistent VPN credentials and no session recording.

The OT-specific requirements:

Requirement Solution
Vendor access without standing credentials Teleport JIT access: vendor requests access, engineer approves, session recorded, credential expires
No persistent VPN for OT networks Tailscale/Headscale ACL policy: vendor node can reach only the specific OT asset, nothing else
Auditability for regulators (NIS2, CER) Teleport session recordings: complete record of every vendor action on every OT system
Air-gapped or restricted networks Headscale on-premise: no outbound control plane dependency
Separation of IT and OT access Separate Tailscale/Headscale networks with explicit, audited bridge points

The executive pitch for utilities and telco:

"Your SCADA vendor has a VPN credential that gives them access to your control network. It has not been rotated in three years. You do not know when they last used it or what they did. If that credential is compromised, an attacker has access to your control systems without ever touching your IT network. We replace that with a session that the vendor requests on the day they need it, that an engineer approves, that is recorded start to finish, and that expires the moment the maintenance window closes. This is not extra bureaucracy. This is the audit evidence your regulator will ask for under NIS2."


CQRE Deployment Tiers

Tier Description Best for
Assessment & Design Architecture review, tool selection, design document, implementation roadmap Clients with existing VPN/PAM debt; pre-deployment planning
Managed Deployment CQRE installs and configures the chosen stack; hands over to client team Clients without internal infrastructure expertise
Fully Managed Service CQRE operates the network access and PAM layer as a managed service Clients who want the capability without the operational burden
Retained Advisory Quarterly reviews, policy updates, incident support Clients who have deployed and want ongoing assurance

Per-Module Tool Pairing

Module Access Architecture Role
Module 2: M365 Identity Security Tailscale/Headscale for admin access to cloud management plane; Teleport for server access
Module 6: On-Premise AD Hardening Teleport CE as PAW replacement for domain controller access; recorded sessions for all Tier 0 admin activity
Module 8: OT Security Assessment Headscale for sovereign OT network access; Teleport for vendor access with full session recording
Module 10: Red Team & Validation Verify that Tailscale ACLs actually enforce segmentation; test Teleport JIT bypass scenarios
Module 13: This module Full deployment of chosen network + PAM stack

Integration With Existing Frameworks

Document Integration
T0 Asset Framework T0 assets (domain controllers, key servers, OT controllers) require Teleport session recording; Tailscale ACLs isolate T0 network segments
AD and Endpoint Hardening PAW architecture is enhanced by Teleport; privileged accounts should authenticate through PAM, not direct RDP
Sovereign Tool Stack Tailscale/Headscale extends the network access layer; Teleport extends the identity and session intelligence layer
Vertical: Power and Utilities Vendor remote access to OT is addressed directly by this module
Vertical: Telco Network operations centre access, vendor access to network elements

For the OT security context, see Vertical: Power and Utilities. For identity and T0 asset protection, see T0 Asset Framework. For the full module menu, see Modular Engagements.