Initial commit: antifragile cybersecurity consulting blueprint

Complete repository of frameworks, playbooks, and assessment resources
for cybersecurity consultations focused on antifragile enterprise design.

Includes:
- Core philosophy and manifest (5 pillars)
- 12 modular engagement packages
- AI sovereignty and operations frameworks
- Zero-budget vulnerability discovery and hardening playbooks
- M365 E3 hardening and antifragile project plans
- Osquery sovereign discovery platform blueprint
- Perimeter scanning capability guide
- AI-assisted TVM blueprint for AI-powered adversaries
- Vertical specializations: banking, telco, power/utilities
- CIS Controls v8 and NIST CSF 2.0 mappings
- Risk registers and assessment templates
- C-suite conversation guide and business case templates
This commit is contained in:
2026-05-09 16:53:22 +02:00
commit 763da003d3
35 changed files with 9711 additions and 0 deletions

View File

@@ -0,0 +1,108 @@
# Antifragile Enterprise Consulting Repository
> *"Wind extinguishes a candle and energizes fire. You want to be the fire and wish for the wind."* — Nassim Nicholas Taleb
This repository contains reusable frameworks, playbooks, and assessment resources for consulting engagements focused on building **antifragile organizations**—enterprises that do not merely survive disruption but grow stronger from it.
## What Is Antifragile?
Most security and resilience frameworks optimize for **robustness**—the ability to withstand shocks. Antifragility goes further. An antifragile system:
- **Benefits from volatility** and stressors
- **Learns faster** from failures than from successes
- **Decentralizes critical functions** to avoid single points of failure
- **Treats optionality as a strategic asset**, not overhead
## Repository Structure
```
├── core/ # Foundational frameworks and principles
│ ├── move-fast-and-fix-things.md # Company philosophy: speed, repair, existing tools
│ ├── antifragile-manifest.md # The five pillars of antifragile enterprise
│ ├── modular-engagements.md # Menu of independent, self-contained modules
│ ├── ai-sovereignty-framework.md # AI sovereignty as a strategic mandate
│ ├── ai-operations-inevitability.md # Why defensive AI is inevitable (business AI is optional)
│ ├── azure-openai-sovereignty-bridge.md # Azure OpenAI/Foundry as sovereignty stepping stone
│ ├── organizational-resilience.md # Dev/Sec/Ops merger and shift-left arguments
│ ├── quality-management-engagement.md # Embedded process assurance for teams feeling "not in control"
│ ├── blue-purple-team-foundation.md # Building defensive capability from existing tools
│ ├── retained-capability.md # What to keep in-house when outsourcing security (MSSP, pentest, compliance)
│ ├── executive-summary.md # One-page board brief
│ ├── c-suite-conversation-guide.md # Persuasion scripts for top management
│ └── t0-asset-framework.md # Tier 0 asset classification and protection
├── playbooks/ # Executable modernisation and response plans
│ ├── rapid-modernisation-plan.md # 30-60-90-180 day transformation roadmap
│ ├── endpoint-management-entry-vector.md # Intune/device management as engagement entry point
│ ├── ai-assisted-tvm.md # AI-powered vulnerability management blueprint
│ ├── zero-budget-vulnerability-discovery.md # Script-based vuln discovery without commercial scanners
│ ├── perimeter-scanning-capability.md # External attack surface scanning strategy
│ ├── osquery-custom-platform.md # Build a sovereign vuln/asset discovery platform on osquery
│ ├── m365-antifragile-project.md # M365 greenfield/modernisation with antifragile design
│ ├── m365-e3-hardening.md # M365 E3-specific tactical hardening
│ ├── ad-endpoint-hardening.md # On-prem AD, Windows endpoint, hybrid identity
│ ├── zero-budget-hardening.md # Maximize existing tool investment
│ ├── implementation-playbook.md # Step-by-step operational guide
│ └── business-case-template.md # Financial justification and ROI framework
├── assessment-templates/ # Diagnostic tools and maturity models
│ ├── README.md # Assessment roadmap and development plan
│ ├── antifragile-risk-register.md # Antifragile risk taxonomy and register template
│ └── m365-project-risk-register.md # M365 project-specific risk register
├── reference/ # External standards, mappings, and citations
│ ├── cis-controls-mapping.md # CIS Controls v8 alignment
│ ├── nist-csf-mapping.md # NIST CSF 2.0 alignment
│ ├── vertical-power-utilities.md # Power generation, transmission, water utilities
│ ├── vertical-telco.md # Telecommunications and mobile operators
│ └── vertical-banking.md # Financial services regulatory alignment
└── assets/ # Diagrams, visuals, and presentation materials
```
## Our Posture: Move Fast and Fix Things
This practice is built on a simple, actionable stance: **move fast and fix things**. We do not wait for perfect plans. We identify the kill chain, extract value from existing investments, and close existential gaps before they become incidents.
- **Speed is a security control.** A 90% solution deployed today outperforms a 100% solution that ships in six months.
- **Work beats purchases.** Most organizations own 60-80% of the capabilities they need. We configure and operationalize before we shop.
- **Every fix must produce a signal.** A remediation without telemetry is a remediation that will rot.
Read the full [Move Fast and Fix Things](core/move-fast-and-fix-things.md) philosophy.
## Core Pillars
1. **[Structural Decoupling](core/antifragile-manifest.md#pillar-1-structural-decoupling)** — Remove hidden dependencies before they become fatal ones
2. **[Optionality Preservation](core/antifragile-manifest.md#pillar-2-optionality-preservation)** — Maintain strategic exits and alternatives at every layer
3. **[Stress-to-Signal Conversion](core/antifragile-manifest.md#pillar-3-stress-to-signal-conversion)** — Turn failures, attacks, and outages into intelligence
4. **[Sovereign Intelligence](core/antifragile-manifest.md#pillar-4-sovereign-intelligence)** — Own your cognitive infrastructure; never rent your ability to think
5. **[Asymmetric Payoff Design](core/antifragile-manifest.md#pillar-5-asymmetric-payoff-design)** — Engineer outcomes where small investments yield disproportionate protection
## Standards Alignment
Our approach is not an alternative to established frameworks. It is the fastest path to meeting them while building real resilience:
- **[CIS Controls v8](reference/cis-controls-mapping.md)** — IG1 as a non-negotiable 90-day floor, achieved primarily through existing tool configuration
- **[NIST CSF 2.0](reference/nist-csf-mapping.md)** — All six functions addressed with emphasis on GOVERN as the missing keystone
## Quick Start for Executives and Board Members
1. **Read** [Executive Summary](core/executive-summary.md) — one page, five minutes, the full case
2. **Review** [Business Case Template](playbooks/business-case-template.md) — financial justification, ROI, and risk quantification
3. **Browse** [C-Suite Conversation Guide](core/c-suite-conversation-guide.md) — how your advisors should frame the conversation
## Quick Start for Consultants
1. **Open** `core/move-fast-and-fix-things.md` — understand the engagement posture
2. **Read** `core/antifragile-manifest.md` — understand the philosophy
3. **Study** `playbooks/m365-e3-hardening.md` — master the primary client environment (most clients are E3)
4. **Study** `playbooks/ad-endpoint-hardening.md` — cover on-premises AD and endpoint gaps
5. **Study** `playbooks/zero-budget-hardening.md` — extract value from existing tools in 30 days
6. **Deploy** `playbooks/rapid-modernisation-plan.md` — run the 30-60-90-180 day roadmap
7. **Reference** `core/t0-asset-framework.md` and `core/ai-sovereignty-framework.md` — classify assets and own intelligence
8. **Map** `reference/cis-controls-mapping.md` and `reference/nist-csf-mapping.md` — align to standards
9. **Adapt** `reference/vertical-power-utilities.md`, `reference/vertical-telco.md`, or `reference/vertical-banking.md` — tailor for regulated critical infrastructure clients
## Usage and Licensing
These documents are designed for reuse across client engagements. Adapt, remix, and extend. Credit the framework when presenting externally.
---
*Built for practitioners who defend the future, not just the perimeter.*

View File

@@ -0,0 +1,77 @@
# Assessment Templates
> *"What gets measured gets managed. What gets managed honestly becomes antifragile."*
This directory contains diagnostic tools, maturity models, and assessment resources for evaluating organizational antifragility. Two production-ready tools are available now; additional assessments are in active development.
## Planned Assessments
### 1. Antifragile Maturity Model (AF-MM)
A five-level maturity model covering:
- **Level 1: Fragile** — Reactive, undocumented, dependent on single vendors
- **Level 2: Robust** — Documented, monitored, but static
- **Level 3: Resilient** — Automated recovery, tested backups, incident response operational
- **Level 4: Adaptive** — Chaos engineering, continuous learning, structural improvement from failure
- **Level 5: Antifragile** — Volatility is exploited for gain, optionality is strategic, intelligence is sovereign
### 2. AI Sovereignty Readiness Assessment
Evaluates:
- Current AI usage inventory completeness
- Data classification and leakage risk
- Local infrastructure readiness
- Vendor dependency and exit feasibility
- Regulatory compliance posture
### 3. T0 Asset Discovery Scanner
Planned scripted assessment to:
- Enumerate critical assets across on-premises and cloud environments
- Classify assets by tier based on dependency mapping
- Identify gaps in protection, monitoring, and recovery
- Generate prioritized remediation roadmap
### 4. Dependency Risk Mapper
Planned tool to:
- Map vendor and technology dependencies
- Calculate coupling depth and exit difficulty
- Identify hidden single points of failure
- Simulate failure cascades
### 5. Incident Learning Index
Measures the organization's ability to convert incidents into structural improvements:
- Mean time to structural fix
- Post-mortem completion rate
- Structural changes implemented per incident
- Repeat incident rate
## Development Roadmap
| Quarter | Deliverable | Format |
|---------|-------------|--------|
| Q1 | AF-MM v1.0 questionnaire and scoring guide | Markdown + spreadsheet |
| Q2 | AI Sovereignty Readiness Assessment v1.0 | Interactive web form or CLI tool |
| Q3 | T0 Asset Discovery Scanner v0.1 | Python script (cloud APIs + on-premises) |
| Q4 | Dependency Risk Mapper v0.1 | Python + network analysis libraries |
## Contributing
When adding new assessments:
1. Document the purpose, methodology, and limitations
2. Include scoring rubrics with clear criteria
3. Provide sample outputs and interpretation guidance
4. Version assessments and maintain changelogs
5. Test on at least two different organizational profiles before release
---
*Return to [Repository Index](../README.md)*

View File

@@ -0,0 +1,204 @@
# Antifragile Risk Register Template
> *"Traditional risk registers count vulnerabilities. Antifragile risk registers map the kill chain, preserve optionality, and engineer convexity."*
This template replaces conventional risk management with an antifragile approach. It is designed to identify not just what can go wrong, but **how the organization benefits from addressing it**—and what structural improvement emerges from each risk realization.
---
## The Antifragile Risk Dimensions
Traditional risk registers track Probability and Impact. We add five antifragile dimensions:
| Dimension | Traditional Equivalent | Antifragile Question |
|-----------|----------------------|---------------------|
| **Kill Chain Position** | Asset location | "If this risk materializes, what is the shortest path to organizational failure?" |
| **Optionality Impact** | N/A | "Does this risk, if unaddressed, remove our ability to change direction?" |
| **Convexity** | Risk score | "Is the payoff asymmetric—small investment to prevent, catastrophic cost if realized?" |
| **Stress-to-Signal** | Lessons learned | "If this risk materializes, what structural improvement must result?" |
| **T0 Classification** | Criticality | "Is this existential (T0), major (T1), significant (T2), or standard (T3)?" |
---
## Risk Register Template
### Metadata
```
Organization: ________________________________
Assessment Date: ________________________________
Assessor: ________________________________
Review Cadence: Monthly / Quarterly
Next Review Date: ________________________________
```
### Risk Entries
| Field | Description | Example |
|-------|-------------|---------|
| **Risk ID** | Unique identifier (e.g., AF-2024-001) | AF-2024-001 |
| **Risk Name** | Short, specific description | Domain Admin Account Compromise |
| **Description** | Detailed scenario | A standing Domain Admin account is compromised via phishing, allowing adversary to create persistent access and exfiltrate data |
| **T0 / T1 / T2 / T3** | Tier classification | T0 |
| **Kill Chain Position** | Shortest path to failure | Direct: compromised admin → full domain takeover → all systems compromised |
| **Probability** | Likelihood (1-5) | 4 (High: admin accounts are high-value phishing targets) |
| **Impact** | Consequence (1-5) | 5 (Existential: total organizational compromise) |
| **Traditional Risk Score** | P × I | 20 (Critical) |
| **Optionality Impact** | Does this remove strategic options? | High: if AD is compromised, migration to cloud-native identity becomes impossible until recovery |
| **Convexity** | Asymmetric payoff? | Extreme: MFA deployment costs €0 (E3); domain compromise costs €500K+ |
| **Current Control** | What exists today? | Password policy; no MFA on admin accounts; no PIM |
| **Antifragile Move** | What structural change is required? | 1. Remove standing Domain Admin assignments 2. Deploy PIM (or manual JIT process) 3. Enforce MFA with hardware tokens 4. Deploy PAWs for all admin activity |
| **Owner** | Who is accountable? | CISO |
| **Target Date** | When must this be addressed? | 14 days |
| **Status** | Open / In Progress / Closed / Accepted / Transferred | Open |
| **Stress-to-Signal Mandate** | If this risk materializes, what must change? | Post-incident: all admin activity permanently moved to PAWs; quarterly access reviews institutionalized; admin accounts reduced to minimum viable count |
| **Verification Method** | How do we prove the fix works? | Monthly PIM audit; quarterly red team targeting admin credentials; Secure Score admin control metric |
---
## Risk Categories (Antifragile Taxonomy)
### Category 1: Sovereignty Risks
Risks related to loss of control over data, intelligence, or infrastructure.
| Risk | Kill Chain | T0? | Antifragile Move |
|------|-----------|-----|-----------------|
| Proprietary data trains competitor AI models | Data → cloud AI → model improvement → competitive erosion | Yes | Deploy local or Azure OpenAI with data protection guarantees; classify AI data flows |
| Cloud vendor changes terms or pricing | Terms change → operational disruption → forced migration under duress | Yes | Document exit architecture; maintain data portability; dual-vendor readiness |
| Vendor discontinues critical service | Service ends → workflow collapse → emergency procurement | T1 | Maintain abstraction layers; escrow agreements; 90-day exit plans |
### Category 2: Identity Risks
Risks related to authentication, authorization, and account lifecycle.
| Risk | Kill Chain | T0? | Antifragile Move |
|------|-----------|-----|-----------------|
| Standing privileged account compromise | Phish → admin account → lateral movement → domain takeover | Yes | Eliminate standing privileges; deploy PIM or manual JIT; PAWs |
| Orphaned account resurrection | Former employee account not disabled → credential sale → unauthorized access | T1 | Automated orphan detection; quarterly access reviews; offboarding workflow tied to HR |
| MFA bypass via legacy authentication | Legacy protocol → password spray → account access without MFA | T1 | Block legacy auth tenant-wide; monitor for legacy auth attempts |
### Category 3: Resilience Risks
Risks related to the organization's ability to survive and recover from failure.
| Risk | Kill Chain | T0? | Antifragile Move |
|------|-----------|-----|-----------------|
| Backups unrecoverable | Ransomware → backup failure → data loss → business termination | Yes | Quarterly recovery drills; immutable backups; tested runbooks |
| Single point of failure in critical system | Component failure → cascade → service outage | T1 | Chaos engineering; redundancy; graceful degradation design |
| Untested disaster recovery plan | Incident → DR plan fails → extended outage → regulatory fine | T1 | Quarterly DR drills; documented and practiced runbooks; automated failover where possible |
### Category 4: Organizational Risks
Risks related to structure, culture, and process.
| Risk | Kill Chain | T0? | Antifragile Move |
|------|-----------|-----|-----------------|
| Security team as gatekeeper, not enabler | Security blocks releases → development bypasses controls → shadow IT proliferation | T1 | Embed security in teams; shared metrics; automated security gates in CI/CD |
| Knowledge concentrated in single individual | Key person departure → operational paralysis → recovery delay | T1 | Cross-training; runbook documentation; bus factor > 1 for all critical functions |
| Incident findings not converted to structure | Incident occurs → post-mortem written → no changes made → repeat incident | T1 | Blameless post-mortems with structural mandates; mean-time-to-structural-fix metric |
### Category 5: AI-Specific Risks
Risks introduced by artificial intelligence adoption.
| Risk | Kill Chain | T0? | Antifragile Move |
|------|-----------|-----|-----------------|
| Prompt injection on business-critical AI workflow | Malicious input → AI generates harmful output → business decision based on bad data | T1 | Input validation; output filtering; human-in-the-loop for critical decisions |
| AI model poisoning via training data | Adversarial training data → model behaviour change → security control failure | Yes | Data provenance tracking; training data validation; model integrity monitoring |
| Shadow AI usage leaks crown jewels | Employee uses public AI → proprietary data exfiltrated → competitive disadvantage | Yes | Sanctioned AI alternative (Azure OpenAI bridge); DLP monitoring; user education |
---
## The Kill Chain Risk Register
For the highest-priority risks, map the full kill chain:
```
RISK ID: ________________
RISK NAME: ________________
KILL CHAIN ANALYSIS:
Step 1 (Initial Access): ________________________________________________
Step 2 (Persistence): ________________________________________________
Step 3 (Privilege Escalation): ________________________________________________
Step 4 (Lateral Movement): ________________________________________________
Step 5 (Impact): ________________________________________________
SHORTEST PATH TO FAILURE: _____ steps
CRITICAL NODE (break the chain here): ___________________________________
ANTIFRAGILE MOVE AT CRITICAL NODE: _____________________________________
VERIFICATION: __________________________________________________________
```
---
## Scoring and Prioritization
### Traditional Score
```
Risk Score = Probability (1-5) × Impact (1-5)
```
| Score | Priority |
|-------|----------|
| 20-25 | P0 — Address within 14 days |
| 15-19 | P1 — Address within 30 days |
| 10-14 | P2 — Address within 90 days |
| 5-9 | P3 — Address within 180 days |
| 1-4 | P4 — Monitor and schedule |
### Antifragile Score (Supplemental)
```
Antifragile Priority = Traditional Score + Optionality Impact (0-5) + Convexity (0-5)
```
Risks that remove optionality or have extreme convexity receive elevated priority even if traditional probability is moderate.
| Antifragile Score | Interpretation |
|-------------------|----------------|
| 30+ | Existential + optionality-destroying. Address immediately. |
| 25-29 | High risk with structural implications. Address within 30 days. |
| 20-24 | Significant risk. Address within standard timeline. |
| < 20 | Manage through existing controls. |
---
## Review and Governance
### Monthly Tactical Review
- Open risks: status, blockers, escalation needs
- Closed risks: verification that controls are working
- New risks: emerging from incidents, changes, or threat intelligence
### Quarterly Strategic Review
- Risk trend: Are we reducing existential risks faster than new ones emerge?
- Kill chain coverage: Are there unprotected paths we have not mapped?
- Optionality audit: Have any changes reduced our strategic flexibility?
- Stress-to-signal conversion: How many incidents produced structural improvements?
### Annual Board Review
- Risk register summary: T0 risks, open vs. closed, trend
- Kill chain assurance: Independent validation of critical node protection
- Antifragile maturity: Mean time to structural fix, chaos experiment results, recovery drill outcomes
---
## Integration With Other Documents
| Document | Integration |
|----------|-------------|
| [T0 Asset Framework](../core/t0-asset-framework.md) | T0 classification determines which risks are existential |
| [Rapid Modernisation Plan](../playbooks/rapid-modernisation-plan.md) | Phase priorities map directly to P0/P1/P2 risk closure |
| [C-Suite Conversation Guide](../core/c-suite-conversation-guide.md) | Risk register produces the "cost of inaction" narrative |
| [Business Case Template](../playbooks/business-case-template.md) | Risk scores convert to expected financial loss |
---
*For the M365-specific risk register, see [M365 Project Risk Register](m365-project-risk-register.md).*

View File

@@ -0,0 +1,170 @@
# M365 Project Risk Register
> *"Most M365 projects fail not because Teams does not work, but because governance was an afterthought and the tenant became an ungovernable monoculture."*
This risk register applies the antifragile risk methodology specifically to Microsoft 365 projects—greenfield deployments, tenant modernisations, migrations, and consolidations. It is designed for M365/Azure consultancies to identify, classify, and mitigate project-specific risks before they become tenant-wide liabilities.
---
## M365-Specific Risk Taxonomy
### Category 1: Identity and Access Risks
| Risk ID | Risk Name | Description | T0/T1/T2 | Kill Chain | Antifragile Move | Owner |
|---------|-----------|-------------|----------|-----------|-----------------|-------|
| M365-001 | Excessive Global Admins | More than 3-5 Global Admins with standing access | T0 | Compromise any admin → full tenant control → data exfiltration / deletion | Reduce to minimum; deploy PIM; use delegated roles | Identity Team |
| M365-002 | No MFA on Admin Accounts | Admin accounts lack multi-factor authentication | T0 | Phish password → direct tenant access → no second factor to stop | Enforce MFA for all admins; hardware tokens for break-glass | Security |
| M365-003 | Legacy Authentication Enabled | Legacy auth protocols allow MFA bypass | T1 | Password spray via IMAP/POP3/SMTP → account access without MFA | Block legacy auth tenant-wide; monitor for attempts | Security |
| M365-004 | Stale Guest Accounts | Former partners/vendors retain guest access indefinitely | T1 | Stale guest → credential compromise → Teams/SharePoint access | Quarterly guest access review; time-bounded invitations | Collaboration Team |
| M365-005 | Unmanaged OAuth Consents | Users granted permissions to unauthorized applications | T1 | Malicious app → mailbox access / data exfiltration / phishing | Disable user consent; admin consent workflow; quarterly audit | Security |
| M365-006 | Shared Mailboxes with Login | Shared mailboxes configured with user passwords and sign-in enabled | T2 | Shared credential compromise → email access → BEC / data theft | Disable sign-in on shared mailboxes; convert to proper delegation | Exchange Team |
| M365-007 | No Conditional Access (E5/P1) | Missing location, device, or risk-based access controls | T1 | Compromised credentials usable from any device, any location | Deploy conditional access: MFA, device compliance, location, risk | Identity Team |
| M365-008 | Hybrid Identity Stuck | AAD Connect configured with no plan to migrate to cloud-native | T1 | AAD Connect compromise → cloud identity manipulation → tenant takeover | Document cloud-native migration path; secure AAD Connect server | Identity Team |
### Category 2: Data Governance Risks
| Risk ID | Risk Name | Description | T0/T1/T2 | Kill Chain | Antifragile Move | Owner |
|---------|-----------|-------------|----------|-----------|-----------------|-------|
| M365-009 | No Data Classification | Documents and emails stored without sensitivity labels | T1 | Proprietary/confidential data mixed with public data → uncontrolled sharing → leakage | Deploy sensitivity labels (Purview) or manual classification guidance | Compliance |
| M365-010 | Open External Sharing | SharePoint/OneDrive default allows anyone-links or external sharing | T1 | Accidental or malicious public link → data exposure → regulatory fine / reputational damage | Default sharing: internal only; anyone-links disabled; per-site justification | SharePoint Team |
| M365-011 | No Retention Policy | No defined retention for email, Teams, or files; data accumulates indefinitely | T2 | Excessive data → discovery cost → compliance failure → inability to respond to legal hold | Deploy retention policies for all workloads; legal hold procedures | Compliance |
| M365-012 | Teams Channel Sprawl | Uncontrolled team creation; stale teams with sensitive data | T2 | Stale team with external access → forgotten but accessible → data leakage | Governed team creation; expiration policies; access reviews | Collaboration Team |
| M365-013 | OneDrive as Shadow IT | Users store business-critical data in personal OneDrive without backup | T1 | User departure / account deletion → data loss; no organizational recovery | Migrate business data to SharePoint; backup strategy; user education | SharePoint Team |
| M365-014 | Copilot Without Governance | Microsoft 365 Copilot deployed without data governance baseline | T0 | Copilot surfaces sensitive data to unauthorized users → internal data breach | Deploy sensitivity labels BEFORE Copilot; conditional access; user training | Security / Compliance |
| M365-015 | eDiscovery Unprepared | No eDiscovery processes, legal hold capability, or retention for litigation | T2 | Litigation → inability to produce documents → adverse inference / sanctions | eDiscovery training; retention hold procedures; Purview eDiscovery licensing | Legal / Compliance |
### Category 3: Security and Threat Risks
| Risk ID | Risk Name | Description | T0/T1/T2 | Kill Chain | Antifragile Move | Owner |
|---------|-----------|-------------|----------|-----------|-----------------|-------|
| M365-016 | Business Email Compromise (BEC) | Executive mailbox compromised; fraudulent payment instructions sent | T1 | Phish executive → mailbox control → invoice fraud / wire transfer | Impersonation protection; mailbox auditing; MFA; financial process verification | Security |
| M365-017 | EOP Misconfiguration | Basic Exchange Online Protection not tuned for client's threat profile | T1 | Phishing email reaches inbox → user compromise → lateral movement | Tune anti-phishing, anti-malware, anti-spam; impersonation protection | Security |
| M365-018 | No Audit Logging | Unified Audit Log disabled or unmonitored | T1 | Incident occurs → no forensic evidence → cannot determine scope or contain | Enable UAL immediately; forward to SIEM; 90-day minimum retention | Security |
| M365-019 | Device Unmanaged | Corporate devices accessing M365 without MDM or compliance policy | T1 | Compromised personal device → M365 access → data exfiltration | Intune enrollment; conditional access requiring compliance | Endpoint Team |
| M365-020 | No Backup Beyond Native | Reliance on recycle bin and soft delete as "backup" | T1 | Ransomware / malicious admin / sync error → data loss → no recovery | Third-party immutable backup; quarterly recovery testing | Backup Team |
### Category 4: AI and Emerging Technology Risks
| Risk ID | Risk Name | Description | T0/T1/T2 | Kill Chain | Antifragile Move | Owner |
|---------|-----------|-------------|----------|-----------|-----------------|-------|
| M365-021 | Shadow AI via M365 Apps | Employees paste proprietary data into Copilot, Bing, or third-party AI through browser | T0 | Proprietary data → public AI model → competitive intelligence loss | Deploy Azure OpenAI bridge; DLP policies blocking AI uploads; user education | Security |
| M365-022 | Copilot Data Overexposure | Copilot synthesizes and surfaces data the user should not have access to | T1 | Overpermissioned user → Copilot reveals sensitive synthesis → internal breach | Zero-trust permissions review; sensitivity labels; just-in-time access | Security |
| M365-023 | AI-Generated Misinformation | Users make business decisions based on unverified AI-generated content | T2 | AI hallucination → bad decision → financial loss / compliance failure | Human-in-the-loop for critical decisions; source attribution requirements; user training | Compliance |
| M365-024 | No AI Governance Policy | Organization has no policy for approved AI tools, data handling, or vendor evaluation | T1 | Uncontrolled AI adoption → data leakage → regulatory / legal exposure | AI governance framework; approved tool list; data classification for AI inputs | Security / Legal |
### Category 5: Project and Organizational Risks
| Risk ID | Risk Name | Description | T0/T1/T2 | Kill Chain | Antifragile Move | Owner |
|---------|-----------|-------------|----------|-----------|-----------------|-------|
| M365-025 | Tenant as Monoculture | All data, identity, and collaboration in one tenant with no exit architecture | T0 | Tenant compromise / lockout / vendor change → total organizational paralysis | Domain ownership by client; data portability architecture; documented tenant exit | Architecture |
| M365-026 | Scope Creep Without Governance | Workloads deployed incrementally without security review | T2 | New app/service → unmapped risk → incident | Governance gate before new workload; security review checklist | Project Manager |
| M365-027 | Insufficient Admin Training | Client team lacks skills to operate and secure the tenant post-handover | T2 | Misconfiguration → vulnerability → incident | Structured training program; runbook documentation; knowledge transfer sessions | Training |
| M365-028 | Power Platform Shadow IT | Citizen developers create apps and flows with ungoverned data access | T1 | Unmanaged flow → external data sharing / credential exposure → breach | DLP policies; environment governance; citizen developer training | Power Platform Team |
| M365-029 | Migration Data Loss | Legacy data lost or corrupted during migration to M365 | T1 | Corrupted migration → missing records → compliance / operational failure | Pre-migration backup; validation sampling; rollback plan | Migration Team |
| M365-030 | Vendor Lock-in via Add-ons | Heavy reliance on third-party M365 add-ins that create dependency | T2 | Add-on vendor discontinues / changes terms → workflow collapse | Evaluate add-ons for portability; maintain native fallback; contractual exit clauses | Procurement |
---
## Risk Scoring for M365 Projects
### Probability Scale
| Score | Definition | M365 Example |
|-------|-----------|--------------|
| 1 | Rare (< 1% annually) | Total Azure region failure |
| 2 | Unlikely (1-10%) | Major zero-day in Exchange Online |
| 3 | Possible (10-50%) | Successful phishing campaign against users |
| 4 | Likely (50-90%) | Stale guest account remains accessible |
| 5 | Almost certain (> 90%) | Shadow AI usage if no sanctioned alternative |
### Impact Scale
| Score | Definition | M365 Example |
|-------|-----------|--------------|
| 1 | Negligible | Minor inconvenience; no data loss |
| 2 | Minor | Single user/service affected; recoverable in hours |
| 3 | Moderate | Departmental impact; recoverable in days; potential compliance notice |
| 4 | Major | Organizational impact; recoverable in weeks; regulatory fine likely |
| 5 | Catastrophic | Existential threat; business termination possible; criminal liability |
### M365-Specific Convexity Assessment
| Convexity | Definition | M365 Example |
|-----------|-----------|--------------|
| **Extreme** | €0 control prevents €500K+ loss | Enabling MFA (free in E3) prevents total tenant compromise |
| **High** | Small labor investment prevents major incident | Quarterly guest access review prevents data breach via stale account |
| **Moderate** | Moderate investment prevents significant loss | Third-party backup prevents data loss from ransomware |
| **Low** | Investment comparable to potential loss | Advanced threat protection add-on vs. basic EOP |
---
## Project Phase Risk Gates
### Greenfield Deployment Gates
| Phase | Gate | Risk Closure Requirement |
|-------|------|-------------------------|
| **Architecture** | Go/No-Go before provisioning | M365-025 (tenant monoculture) assessed and mitigated; M365-030 (add-on lock-in) evaluated |
| **Foundation** | Go/No-Go before user onboarding | M365-001 (excessive admins), M365-002 (no MFA), M365-018 (no audit) closed |
| **Workload Rollout** | Go/No-Go per workload | M365-009 (no classification), M365-010 (open sharing), M365-028 (Power Platform) addressed |
| **Go-Live** | Go/No-Go before production | M365-016 (BEC), M365-017 (EOP), M365-020 (no backup) mitigated; M365-027 (training) completed |
| **30-Day Post** | Review | M365-021 (shadow AI) inventoried; M365-024 (AI governance) drafted |
### Modernisation Gates
| Phase | Gate | Risk Closure Requirement |
|-------|------|-------------------------|
| **Audit** | Complete before changes | All 30 risks assessed; T0 and T1 risks prioritized |
| **Kill Chain Closure** | Day 30 checkpoint | All T0 risks closed or accepted with board sign-off |
| **Governance Deployment** | Day 60 checkpoint | All T1 identity and data risks closed |
| **Sovereignty** | Day 90 checkpoint | M365-021 (shadow AI) mitigated via sanctioned alternative; M365-020 (backup) tested |
| **Antifragility** | Day 180 checkpoint | Automated monitoring for M365-003, M365-005, M365-010; quarterly review cadence established |
---
## The M365 Risk Dashboard (For Steering Committee)
```
M365 PROJECT RISK DASHBOARD — [Client] — [Date]
T0 RISKS (Existential)
├─ Open: [X] ├─ In Progress: [X] └─ Closed: [X]
├─ [Risk ID] [Name] — Owner: [Name] — Target: [Date]
└─ [Risk ID] [Name] — Owner: [Name] — Target: [Date]
T1 RISKS (Major)
├─ Open: [X] ├─ In Progress: [X] └─ Closed: [X]
├─ [Risk ID] [Name] — Owner: [Name] — Target: [Date]
└─ [Risk ID] [Name] — Owner: [Name] — Target: [Date]
IDENTITY & ACCESS [████░░░░░░] [X]% mitigated
DATA GOVERNANCE [██████░░░░] [X]% mitigated
SECURITY & THREATS [█████░░░░░] [X]% mitigated
AI & EMERGING TECH [███░░░░░░░] [X]% mitigated
PROJECT & ORGANIZATIONAL [███████░░░] [X]% mitigated
TOP 3 RISKS REQUIRING ESCALATION
1. [Risk ID] — [Reason for escalation]
2. [Risk ID] — [Reason for escalation]
3. [Risk ID] — [Reason for escalation]
RECOMMENDATION: [Proceed / Pause / Escalate]
```
---
## Integration With Project Deliverables
| Deliverable | Risk Register Integration |
|------------|--------------------------|
| **Project charter** | Include T0 risk identification as success criterion |
| **Architecture document** | Map each design decision to risk mitigation |
| **Configuration baselines** | Reference risk IDs in change justification |
| **Test plan** | Include recovery drills for M365-020; penetration testing for M365-016 |
| **Training plan** | Address M365-027; include AI governance for M365-024 |
| **Handover document** | Transfer risk ownership to client team with review cadence |
---
*For the general antifragile risk register methodology, see [Antifragile Risk Register](antifragile-risk-register.md).*
*For the M365 antifragile project playbook, see [M365 Antifragile Project](../playbooks/m365-antifragile-project.md).*

View File

@@ -0,0 +1,205 @@
# AI for Operations and Security: The Inevitable Imperative
> *"We are not here to sell you AI. We are here to tell you that your adversaries are already using it—and that operational AI is no longer optional for defenders."*
This document clarifies the antifragile position on artificial intelligence adoption: **business-facing AI pilots are optional and should be evaluated on their merits; AI for security, operations, and resilience is becoming inevitable.** The two must not be confused.
---
## The Distinction That Matters
Most of your clients are currently running AI pilots for **business tools**: chatbots for customer service, content generation for marketing, summarization for legal, coding assistants for engineering. These are **revenue-adjacent experiments**. They should be evaluated like any other business investment—ROI, risk, strategic fit.
**This document is not about those pilots.**
This document is about **operational AI**: the use of artificial intelligence to defend systems, detect anomalies, prioritize vulnerabilities, accelerate incident response, and maintain operational continuity. This category is not an experiment. It is becoming **table stakes** for organizational survival.
| Category | Examples | Strategic Posture |
|----------|----------|-------------------|
| **Business AI** | Customer chatbots, marketing content, sales outreach, HR screening | Optional. Evaluate per pilot. Sovereign if proprietary. |
| **Operational AI** | Log anomaly detection, vulnerability prioritization, threat hunting, code security review, incident triage | Inevitable. The question is not whether, but who owns the models and the data. |
| **Strategic AI** | Competitive intelligence, scenario modeling, board decision support | High-value, high-risk. Must be sovereign. |
| **TVM / Vulnerability Management** | Vulnerability prioritization, exploit prediction, remediation generation, attack surface mapping | Inevitable. AI-powered adversaries scan faster than human teams. AI-assisted TVM is the only asymmetric response. |
---
## Why Operational AI Is Inevitable
### 1. The Attackers Are Already Using It
Adversaries—criminal and state-sponsored—are deploying AI to:
- **Generate polymorphic malware** that evades signature-based detection
- **Craft spear-phishing campaigns** at scale, personalized by scraped social media and leaked databases
- **Automate reconnaissance** of target infrastructure, identifying weakest paths in hours rather than weeks
- **Bypass CAPTCHAs, behavioural biometrics, and traditional fraud controls**
A defender operating without AI assistance is now fighting an **asymmetric battle**: human analysts versus machine-scale adversaries. The math does not favor the humans.
**The executive framing**:
> *"Your security team is not slower than the adversary. Your security team is smaller. AI is how we scale human judgment without scaling human headcount."*
### 2. The Volume Problem Is Unsolvable Without Machine Assistance
Modern enterprises generate:
- Billions of log events per day
- Hundreds of thousands of endpoint telemetry signals
- Tens of thousands of vulnerability findings
- Thousands of identity access events
- Hundreds of third-party risk indicators
No human team can review this volume. Current approaches rely on **rules and thresholds**—which adversaries study and evade. AI-driven detection looks for **behavioural anomalies** that rules cannot express.
**The executive framing**:
> *"We are not buying AI to replace your analysts. We are buying AI to ensure your analysts see the one signal that matters instead of drowning in a thousand false alarms."*
### 3. The Mythos Lesson: Technical Debt at Scale
The Anthropic Mythos incident demonstrated that even sophisticated AI providers carry **technical debt that can be weaponized**. The response to Mythos was not to abandon AI—it was to accelerate **defensive AI capabilities** that can scan, detect, and remediate faster than human teams.
Your clients' current vulnerability backlogs span months or years. A small team with reasonable AI tooling can:
- **Scan and prioritize** vulnerabilities across the entire estate in hours, not weeks
- **Identify configuration drift** before it becomes an incident
- **Generate and validate** remediation code for common misconfigurations
- **Simulate adversarial paths** through the environment to find the real kill chain
This is not science fiction. It is **defensive AI pilot territory**—and it is the fastest way to address decades of accumulated technical debt.
**The executive framing**:
> *"We cannot clear twenty years of technical debt with human labor alone. But a small team with defensive AI can do the work of dozens—finding, prioritizing, and proposing fixes for the vulnerabilities that actually matter."*
### 4. Regulatory Pressure Is Coming
Regulators are beginning to mandate **continuous monitoring and rapid remediation**:
- **DORA** requires ICT risk management that can adapt to evolving threats
- **NIS2** demands vulnerability handling with demonstrable timelines
- **Banking regulators** increasingly expect AI-assisted fraud detection and anomaly monitoring
- **Cyber insurers** are pricing premiums based on mean-time-to-remediate; AI-assisted prioritization directly reduces this metric
Organizations that cannot demonstrate AI-assisted security operations will face **higher premiums, stricter scrutiny, and competitive disadvantage** in regulated procurement.
---
## The Sovereignty Requirement for Operational AI
Here is where the antifragile posture becomes non-negotiable:
**Operational AI must be sovereign.**
When you use cloud AI for security operations, you are sending your **vulnerability data, your configuration details, your incident artifacts, and your network topology** to a third party. That third party is:
- Training its models on your defensive posture
- Potentially subject to jurisdictional access (e.g., CLOUD Act)
- Able to change terms, pricing, or availability without your consent
- A target for adversaries who understand that compromising the AI provider gives them insight into thousands of customers
**The rule**: Business AI pilots can be evaluated case-by-case. Operational AI must run on infrastructure you control, with data that never leaves your perimeter.
| AI Use Case | Can It Run in the Cloud? | Must It Be Local? |
|------------|-------------------------|-------------------|
| Marketing content generation | Yes (if no proprietary data) | No |
| Public-facing chatbot | Yes | No |
| Internal code review | **No** | **Yes** |
| Vulnerability scanning and prioritization | **No** | **Yes** |
| Security log anomaly detection | **No** | **Yes** |
| Incident response triage | **No** | **Yes** |
| Threat intelligence analysis | **No** | **Yes** |
| OT anomaly detection (power/telco) | **Absolutely no** | **Absolutely yes** |
---
## The "Not AI for Everything" Position
When clients ask why you are not pushing AI across every department, your answer is:
> *"AI is a tool, not a strategy. We support business AI pilots where they make sense and where data can be protected. But we are not here to automate your culture. We are here to ensure that the systems protecting your business can keep pace with the adversaries attacking it. Operational AI is not an experiment. It is infrastructure."*
This position:
- **Builds credibility**: You are not an AI hype merchant. You are a security architect.
- **Preserves trust**: Clients do not feel pressured to adopt AI in areas where it adds no value.
- **Concentrates investment**: Resources flow to operational AI where the return is survival, not convenience.
### The Contrast Statement
Use this to differentiate from AI consultants who push indiscriminate adoption:
> *"Most AI consultants are here to increase your consumption of cloud APIs. We are here to ensure your defensive capabilities match your adversaries' offensive capabilities. If a business AI pilot does not protect revenue, reduce risk, or create a defensible moat, it is not our priority. If a defensive AI pilot reduces your vulnerability backlog from months to weeks, it is not optional."*
---
## Implementation Posture: Operational AI
### Immediate (0-30 days): Assessment and Pilot Scope
- Inventory current AI usage: business vs. operational vs. shadow
- Identify the highest-volume, lowest-signal security workflow (usually vulnerability management or log review)
- Select one defensive AI pilot with clear success metrics
- **For vulnerability management**: Launch AI-assisted TVM baseline sprint. See [AI-Assisted TVM Blueprint](../playbooks/ai-assisted-tvm.md).
- Establish the sovereignty boundary: no security data leaves the perimeter
### Short-term (30-90 days): Defensive AI Pilot
- Deploy local inference for one security use case:
- **Vulnerability prioritization**: AI-assisted ranking of scan results by exploitability, asset criticality, and business context. See [AI-Assisted TVM Blueprint](../playbooks/ai-assisted-tvm.md) for the full 30-60-90 day program.
- **Log anomaly detection**: Baseline normal behaviour; alert on deviations
- **Code security review**: Local model trained on your codebase finds patterns human reviewers miss
- Measure: false positive rate, analyst time saved, mean time to prioritize
### Medium-term (90-180 days): Expansion
- Integrate defensive AI into SOC workflow: triage, enrichment, initial response recommendations
- Deploy OT anomaly detection for critical infrastructure clients (power, telco)
- Build internal capability to fine-tune models on proprietary data
### Long-term (180+ days): Autonomous Operations
- Closed-loop remediation: AI identifies, proposes, and (with human approval) applies fixes for common misconfigurations
- Predictive maintenance: AI forecasts system failures before they impact operations
- Continuous red-teaming: AI agents perpetually probe defenses and report findings
---
## Talking Points for the Board
| Concern | Response |
|---------|----------|
| "We are already running AI pilots in marketing and sales." | "Those are business experiments. This is defensive infrastructure. The question is not whether to adopt AI. It is whether your defenders can keep pace with AI-powered attackers." |
| "This sounds like another expensive technology project." | "Defensive AI runs on the same local infrastructure we are already proposing for sovereignty. The incremental cost is minimal. The incremental protection is disproportionate." |
| "Our security team is skeptical of AI hype." | "Good. Skepticism is warranted for business AI. It is not warranted for operational AI when adversaries are already using it against you. We will prove value with a bounded pilot before any expansion." |
| "We do not have the expertise to run AI models." | "Modern tooling has reduced the barrier dramatically. We are not training foundation models. We are deploying quantized open models on your hardware with your data. This is achievable in weeks, not years." |
| "Will this replace our security team?" | "No. It will make them effective. Your analysts currently spend 80% of their time on noise. AI reduces the noise so they can focus on judgment, investigation, and structural improvement." |
---
## Integration With Existing Frameworks
### The Rapid Modernisation Plan
Operational AI appears in:
- **Phase 1 (Hygiene)**: AI-assisted identity and asset discovery
- **Phase 2 (Control)**: AI-assisted vulnerability prioritization and configuration validation
- **Phase 3 (Sovereignty)**: Local AI infrastructure deployment; defensive AI pilot
- **Phase 4 (Antifragility)**: Continuous AI-assisted red teaming; autonomous remediation loops
### The Zero-Budget Hardening Approach
Defensive AI can run on:
- Existing server hardware (quantized models require modest resources)
- Retired workstations with GPU
- Sovereign cloud instances (for clients without on-premises capacity)
The incremental cost is primarily **labor for configuration**, not hardware or licensing.
---
*For the AI sovereignty strategic argument, see [AI Sovereignty Framework](ai-sovereignty-framework.md).*
*For the business case including defensive AI ROI, see [Business Case Template](../playbooks/business-case-template.md).*

View File

@@ -0,0 +1,242 @@
# AI Sovereignty Framework
> *"The cloud model is smarter at everything, which makes it dumb at your specific thing."*
## For the Executive Reader
Your organization is currently engaged in a **massive, unpaid research project for cloud AI providers**. Every proprietary document, every strategic query, every operational workflow sent to a third-party AI becomes training data for models that will eventually be sold to your competitors.
AI sovereignty is not an IT project. It is a **strategic asset protection mandate**. By running artificial intelligence on infrastructure you control, you:
- **Stop funding your competitors** through proprietary data leakage
- **Eliminate vendor lock-in** for your organization's cognitive infrastructure
- **Reduce long-term costs** from unpredictable per-query pricing to fixed capital
- **Demonstrate regulatory maturity** on data residency and third-party risk
**The economic argument**: A mid-sized organization spending €5,000-€15,000 monthly on cloud AI APIs will break even on local infrastructure within 12-18 months. After break-even, the cost is a fraction of cloud pricing—and the data remains exclusively yours.
**The competitive argument**: A fine-tuned local model trained on your proprietary data will outperform a general cloud model on your specific workflows. The cloud model improves at everyone's tasks. Your local model improves at only your tasks. That is sustainable differentiation.
*For board conversation scripts, see [C-Suite Conversation Guide](c-suite-conversation-guide.md).*
*For financial justification, see [Business Case Template](../playbooks/business-case-template.md).*
---
## For the Practitioner
This framework provides the strategic, technical, and ethical arguments for treating artificial intelligence as **sovereign infrastructure** rather than rented utility. It is designed for consultants and architects who must persuade boards, CISOs, and engineering leaders to invest in locally controlled intelligence.
---
## Executive Summary
Most organizations are currently engaged in a **massive, unpaid R&D project for cloud AI providers**. Every proprietary prompt, every internal document fed into a third-party model, every workflow built on an external API is a transfer of intellectual capital to an entity whose interests are not aligned with the organization's survival.
AI sovereignty reverses this extraction. It restores the boundary of trust. It converts intelligence from a rented commodity into an owned asset.
---
## The Five Strategic Arguments
### 1. The Data Sovereignty Argument (The Trojan Horse)
**The Problem**
When proprietary data is sent to cloud AI providers, it does not merely get "processed." It becomes part of a feedback loop that improves general models—models that will eventually be sold to competitors, used to commoditize the client's industry, or deployed to replicate the client's unique edge.
Every query is a lesson. Every document is a training sample. The client is not a customer; they are an **uncompensated research contributor**.
**The Pitch**
> *"By sending our internal data to the cloud, we are effectively training the very system that will eventually commoditize our industry and replace our proprietary edge. We are not just 'using' AI; we are contributing our secrets to the public model."*
**The Antifragile Move**
Running local models creates a **closed intellectual loop**. The organization's data remains an asset, not a training set for a competitor. It creates a moat that cloud giants cannot cross because they never receive the raw material to replicate it.
**Key Points for the Room**
- Cloud AI providers are incentivized to aggregate and generalize. You are incentivized to differentiate and protect.
- What you consider proprietary operational data, they consider valuable training signal.
- A local model trained on your data becomes *better* at your workflows over time. A cloud model becomes *better at everyone's workflows*, diluting your advantage.
---
### 2. The Operational Resilience Argument (The "Pulling the Plug" Scenario)
**The Problem**
Cloud AI is a dependency with no service-level guarantee of continuity. Terms of service change. Pricing changes. API versions are deprecated. Geopolitical events disable access. "Safety" filters are updated to censor specific industries or use cases. The organization's core operations are, in effect, an application running on someone else's brain.
**The Pitch**
> *"What happens to our core operations if the cloud-AI provider changes its Terms of Service, raises prices by 1000%, or suffers a geopolitical blackout that disables their API? Our entire business model should not be an app running on someone else's brain."*
**The Antifragile Move**
Local models are **sovereign infrastructure**. They operate when:
- The internet is degraded or unavailable
- The provider is down, acquired, or embargoed
- The "safety" filters have been updated to block your use case
- Pricing has been restructured beyond recognition
This is the ultimate insurance policy—not against data loss, but against **capability loss**.
**Key Points for the Room**
- Vendor lock-in for compute is expensive. Vendor lock-in for *intelligence* is existential.
- Recovery from a cloud exit is measured in quarters if workflows are deeply integrated. Recovery from a local model is measured in minutes.
- Resilience is not about having a backup. It is about having no single point of failure in your cognitive pipeline.
---
### 3. The Intellectual Property Argument (The Asset Protection)
**The Problem**
When an organization uses cloud AI, it owns neither the weights, the architecture, nor the deterministic behaviour of the system. It cannot audit the reasoning. It cannot guarantee that the same prompt will produce the same result tomorrow. It cannot prevent its proprietary workflows from being absorbed into a general model.
**The Pitch**
> *"When we run models locally, we own the weights, the architecture, and the outputs. We are not tenants of an intelligence; we are the owners of it. We can tune it for our specific tasks, not the generic tasks the cloud provider cares about."*
**The Antifragile Move**
The organization moves from being a **consumer of AI** to a **manufacturer of its own intelligence**.
This is the difference between:
- A farm that buys seeds every year (cloud AI)
- A farm that saves, selects, and breeds its own (sovereign AI)
Over time, the sovereign farm develops cultivars perfectly adapted to its soil. The seed-buying farm is at the mercy of the seed catalog.
**Key Points for the Room**
- Fine-tuned local models on proprietary data outperform general models on domain-specific tasks.
- You can version, audit, and legally defend a local model. You cannot audit a cloud black box.
- The outputs of your local model are your intellectual property, unencumbered by third-party terms.
---
### 4. Overcoming the Complexity Objection
**The Objection**
> *"But the cloud models are smarter. And local deployment is complex."*
**The Counter**
Cloud models are smarter at *everything*, which makes them *dumb* at your specific thing. A general-purpose model optimized for broad benchmarks is not optimized for your internal processes, your data schemas, your regulatory constraints, or your proprietary logic.
By training or fine-tuning a smaller, local model on specific, proprietary data, the organization can achieve:
| Metric | Cloud General Model | Local Fine-Tuned Model |
|--------|--------------------|------------------------|
| Performance on generic tasks | 95% | 70% |
| Performance on proprietary tasks | 60% | 90% |
| Cost at scale | Linear / unpredictable | Sub-linear / fixed |
| Data leakage risk | Non-zero and growing | Zero |
| Operational ownership | None | Complete |
**The Honest Reframe**
> *"Most businesses do not need a model that can write Shakespeare. They need a model that knows their internal processes, their data, and their specific workflow. Local models are better at that—and they get better every day you feed them proprietary signal."*
**Technical Reality**
Modern quantized models, parameter-efficient fine-tuning (LoRA, QLoRA), and retrieval-augmented generation (RAG) have reduced the barrier to local deployment dramatically. A reasonable AI budget today can achieve what required a dedicated team two years ago.
---
### 5. The Professional Responsibility Angle
**The Problem**
As a security architect, consultant, or technical leader, you are the steward of the organization's crown jewels. Recommending that proprietary strategic intelligence be outsourced to an unauditable third-party black box is not a neutral technical decision. It is a **breach of fiduciary responsibility**.
**The Pitch**
> *"I cannot in good faith recommend that we outsource our strategic intelligence to a third-party black box that we cannot audit and that is actively incentivized to commoditize our data."*
**The Outcome**
This framing elevates the advisor from a "technical implementer" to a **Strategic Defender of the Company's Future**. It positions the recommendation not as a preference for complexity, but as a principled stand for structural integrity.
**Key Points for the Room**
- You are not selling local AI. You are protecting the organization's ability to think independently.
- The conflict of interest is real: cloud AI consultants are often incentivized by provider partnerships. Independent architects have no such conflict.
- This is the same logic that demands on-premises key management for cryptography. Intelligence is no different.
---
## The T0 Asset Classification
In cybersecurity and architecture, a **Tier 0 (T0) asset** is something that, if compromised, destroys the entire operation.
Local AI must be classified as T0. This framing speaks the language of high-stakes infrastructure and immediately elevates the conversation from "tech project" to **foundational pillar of survival**.
### Why T0?
1. **It defines the boundary of trust**: Moving intelligence inside the firewall re-establishes a perimeter that has been silently dissolving.
2. **It removes vendor risk**: A local model is vendor-independent. It remains functional regardless of Silicon Valley boardroom decisions.
3. **It signals strategic maturity**: While competitors chase shiny APIs, the T0 advocate is building durable infrastructure for a 5-to-10-year horizon.
See the full [T0 Asset Framework](t0-asset-framework.md) for implementation guidance.
---
## Implementation Posture
### Immediate (0-30 days)
- **Inventory**: Map all current AI usage—approved and shadow. Identify what data is leaving the perimeter.
- **Classify**: Label workflows by sensitivity. Anything involving IP, strategy, or customer data is a sovereignty candidate.
- **Pilot scope**: Select one non-critical, high-signal workflow for local model proof-of-concept.
### Short-term (30-90 days)
- **Deploy local inference**: Establish on-premises or sovereign-cloud inference infrastructure.
- **Fine-tune**: Train a small model (7B-13B parameters) on proprietary data for the pilot workflow.
- **Measure**: Compare accuracy, latency, cost, and leakage risk against the cloud baseline.
### Medium-term (90-180 days)
- **Expand**: Migrate additional workflows based on pilot results.
- **Integrate**: Connect local models to internal data pipelines, CMDB, and security tooling.
- **Govern**: Establish policies for approved AI usage, data handling, and model versioning.
### Long-term (180+ days)
- **Manufacture**: Build internal capability to train, evaluate, and deploy domain-specific models.
- **Distribute**: Extend sovereign intelligence to edge locations, OT environments, and disconnected operations.
- **Monetize**: Consider whether proprietary model capabilities represent a productizable asset.
---
## Common Objections and Responses
| Objection | Response |
|-----------|----------|
| "Cloud models are more capable." | For generic tasks, yes. For your proprietary tasks, a fine-tuned local model will outperform them—while keeping your data inside. |
| "Local deployment is too expensive." | Cloud AI pricing is linear with usage and unpredictable. Local is a fixed capital expense with predictable operating costs. At scale, it is cheaper. |
| "We don't have the expertise." | Start with a pilot. Modern tooling has reduced the expertise barrier dramatically. Partner for setup, own for operations. |
| "Our vendor says they don't train on our data." | Terms of service change. Verbal assurances are not architecture. If the data leaves your perimeter, you have lost control regardless of current policy. |
| "This will slow us down." | A temporary reduction in velocity is preferable to a permanent loss of strategic optionality. Build the vault first; fill it quickly after. |
---
## The Builder's Mandate
By pushing for local AI infrastructure in the corporate world, you are **decentralizing the Machine**. You are taking the intelligence that centralized cloud platforms are trying to monopolize and distributing it to the edges—where human-scale organizations live and operate.
You are building the infrastructure that allows businesses to remain **sovereign entities** rather than terminal sinks for centralized AI extraction.
This is the most responsible architecture work possible right now.
---
*Next: [T0 Asset Framework](t0-asset-framework.md)*
*Previous: [Antifragile Manifest](antifragile-manifest.md)*

View File

@@ -0,0 +1,174 @@
# The Antifragile Enterprise Manifest
> *"Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors."*
## For the Executive Reader
An antifragile enterprise is one that does not merely survive disruption—it grows stronger from it. While competitors panic when markets shift, regulators tighten, or adversaries strike, the antifragile organization converts each shock into structural improvement, competitive distance, and operational advantage.
This is not a security framework. It is a **strategic operating philosophy** for boards and executives who intend to outlast their competitors, their regulators, and their own assumptions.
**The business case in three sentences**:
1. Your organization is currently transferring proprietary intelligence to competitors through cloud AI usage.
2. Your operational continuity depends on vendors whose interests are not aligned with your survival.
3. In 180 days, we can reverse both conditions—primarily with configuration, not procurement—and produce the evidence regulators now demand.
*For the full executive summary, see [Executive Summary](executive-summary.md).*
*For board conversation guidance, see [C-Suite Conversation Guide](c-suite-conversation-guide.md).*
---
## For the Practitioner
This manifest defines the five foundational pillars of an antifragile enterprise. It is not a security framework. It is a **strategic operating philosophy** for organizations that intend to outlast their competitors, their regulators, and their own assumptions.
---
## Pillar 1: Structural Decoupling
### Principle
The most dangerous dependencies are the ones you have not mapped. An antifragile enterprise treats every integration, vendor relationship, and shared service as a **latent single point of failure** until proven otherwise.
### The Argument
Cloud architectures have created an illusion of resilience through scale. In reality, most organizations have become **deeply coupled** to opaque platforms whose incentives are not aligned with their survival. When a critical API changes its terms, pricing model, or availability, the dependent organization has no negotiation leverage—only panic.
### Antifragile Moves
- **Map the hidden coupling graph**: Inventory every third-party dependency that touches revenue-critical workflows. Include SaaS, PaaS, APIs, identity providers, and data pipelines.
- **Design graceful degradation**: Every critical function must have a fallback mode that operates at reduced capacity without the external dependency.
- **Practice controlled failure**: Introduce chaos into non-production environments. If a system cannot survive the simulated failure of a dependency, it will not survive the real one.
- **Establish exit architectures**: For every major platform dependency, maintain a technical and procedural path to migration that can be executed within 90 days.
### Executive Framing
> *"Every vendor relationship is a potential monopoly waiting to happen. We architect the organization so that no single vendor can hold us hostage."*
### Consultant Framing
> *"We are not optimizing for uptime. We are optimizing for the speed at which we can replace anything that fails us."*
---
## Pillar 2: Optionality Preservation
### Principle
Optionality is the right, but not the obligation, to take action. In antifragile systems, optionality is not a luxury—it is the primary store of value. Every decision that removes options is a decision that increases fragility.
### The Argument
Vendor lock-in is the most common and least visible form of optionality destruction. Organizations sign multi-year enterprise agreements, build deep technical integrations, and train their workforce on proprietary tools—then discover they cannot leave without existential disruption. The cost of exit becomes a weapon the vendor can wield.
### Antifragile Moves
- **Prefer open standards over proprietary APIs**: Where proprietary integration is unavoidable, abstract it behind internal interfaces.
- **Maintain dual-vendor readiness for critical categories**: Even if you do not split spend, maintain the technical capability to switch.
- **Keep data portable**: Store data in formats and locations that do not require a specific vendor to interpret or access.
- **Structure contracts for exit**: Negotiate data export, transition assistance, and escrow clauses as primary terms, not afterthoughts.
### Executive Framing
> *"The most expensive decision is not the tool you buy. It is the tool that makes leaving impossible. We preserve your right to change direction in 90 days."*
### Consultant Framing
> *"The most expensive technology decision you will ever make is the one that makes your next technology decision impossible."*
---
## Pillar 3: Stress-to-Signal Conversion
### Principle
Failure is not the opposite of success; it is the raw material of it. Antifragile organizations do not merely tolerate failure—they instrument it, measure it, and convert it into structural improvements faster than their competitors.
### The Argument
Most enterprises operate in **reactive mode**: detect, respond, recover, forget. The lessons of an incident dissipate into post-mortem documents that no one reads. The same failures recur because the organization has no mechanism for converting stress into signal and signal into structure.
### Antifragile Moves
- **Instrument everything that can fail**: If you cannot measure the pre-failure state, you cannot learn from the failure.
- **Run blameless post-mortems with structural mandates**: Every significant incident must produce at least one structural change—policy, architecture, or procedure.
- **Deploy chaos engineering in production**: Synthetic failures reveal weaknesses that testing environments cannot.
- **Build feedback loops shorter than your mean time to recovery**: If your feedback loop is slower than your recovery, you are learning too late.
### Executive Framing
> *"Every failure is free intelligence. The organizations that learn fastest from setbacks outperform those that merely prevent them."*
### Consultant Framing
> *"We do not want fewer incidents. We want incidents that teach us something we could not have learned any other way."*
---
## Pillar 4: Sovereign Intelligence
### Principle
An organization that outsources its cognition outsources its future. Sovereign intelligence means owning the models, data, and reasoning infrastructure that drive strategic and operational decisions.
### The Argument
The current AI paradigm is extractive. Every prompt sent to a cloud AI is a contribution to a competitor's training set. Every workflow built on a third-party model is a dependency on an intelligence you do not control, cannot audit, and cannot guarantee will serve your interests tomorrow. This is not a privacy concern. It is a **survival concern**.
Sovereign intelligence is the antifragile response: local models, proprietary data loops, and owned reasoning infrastructure that improves with use rather than leaking value to external platforms.
### Antifragile Moves
- **Classify intelligence as a Tier 0 asset**: Treat proprietary models, fine-tuned weights, and reasoning pipelines with the same protective rigor as cryptographic keys.
- **Deploy local AI infrastructure for sensitive workflows**: Run models on hardware you control, behind your own perimeter.
- **Close the data loop**: Ensure proprietary data used for training or inference never leaves your environment.
- **Build internal model manufacturing capability**: Move from consuming AI to producing intelligence tailored to your domain.
### Executive Framing
> *"You would not store your physical cash in a bank that lends it to competitors and reserves the right to change the currency. Your intellectual capital deserves the same protection. Local AI is the vault."*
### Consultant Framing
> *"If our company's intelligence were a physical pile of cash, would we store it in a public bank that takes a 'training fee' off every dollar and reserves the right to change the currency? Or would we keep it in our own vault?"*
See the full [AI Sovereignty Framework](ai-sovereignty-framework.md) for detailed arguments, counter-objections, and implementation guidance.
For the distinction between optional business AI and inevitable operational AI, see [AI Operations Inevitability](ai-operations-inevitability.md).
---
## Pillar 5: Asymmetric Payoff Design
### Principle
Antifragile systems are engineered so that small investments in protection yield disproportionately large reductions in catastrophic risk. The goal is not to eliminate all risk—it is to ensure that the remaining risks are **convex**: limited downside, unlimited upside from learning.
### The Argument
Traditional risk management treats all risks as equally worth mitigating. This is inefficient. An antifragile enterprise identifies the **small number of decisions and dependencies** whose failure would be existential, and concentrates disproportionate investment there. Everything else is managed with optionality and rapid recovery.
### Antifragile Moves
- **Identify your "kill chain"**: Map the shortest sequence of failures that would end the organization. Protect those nodes above all others.
- **Invest in recovery over prevention**: For complex systems, perfect prevention is impossible. Sub-second detection and minute-level recovery are achievable and more valuable.
- **Create convex experiments**: Run small, bounded-risk pilots that expose asymmetric upside—new capabilities discovered through controlled stress.
- **Never spend more preventing a risk than the risk would cost if realized**: Except at the kill chain, where the cost is existential.
### Executive Framing
> *"We are not buying insurance. We are engineering the geometry of risk so that market volatility, regulatory pressure, and competitive threats strengthen our position rather than weaken it."*
### Consultant Framing
> *"We are not buying insurance. We are engineering the geometry of our risk so that volatility makes us richer, not poorer."*
---
## Living Document
This manifest is a living framework. Each engagement will surface new stressors, new patterns, and new refinements. Update it. Challenge it. Make it stronger.
---
*Next: [AI Sovereignty Framework](ai-sovereignty-framework.md)*

View File

@@ -0,0 +1,215 @@
# Azure OpenAI / Foundry: The Sovereignty Bridge
> *"Full sovereignty tomorrow is impossible if you refuse to move today. Azure OpenAI is not the destination. It is the bridge that gets your organization walking in the right direction."*
This document provides the strategic framing, technical positioning, and migration pathway for consultants who want to move clients **away from public cloud AI APIs** and **toward controlled, resident AI infrastructure**—using Microsoft Azure OpenAI Service and Azure AI Foundry as the pragmatic intermediate step.
It is designed for M365/Azure consultancies whose clients are not ready for on-premises GPU clusters but must stop leaking proprietary data to public AI models.
---
## The Executive Summary
Your clients are likely using ChatGPT, Claude, or Gemini via public APIs and consumer accounts. Every prompt leaves their perimeter, and the terms of service allow model improvement using that data. This is the worst possible posture.
**Azure OpenAI Service is not fully sovereign.** Microsoft operates the infrastructure. The underlying models are shared. But it offers something critical that public APIs do not:
- **Your data does not train foundation models.** Microsoft's data processing agreement explicitly states that Azure OpenAI Service data is not used to train OpenAI's models.
- **Data residency.** Prompts and completions remain in your Azure region (EU, US, etc.).
- **Network isolation.** Private endpoints, VNet integration, and no public internet exposure.
- **Encryption with customer-managed keys.** You control the keys that encrypt your data at rest.
- **Audit and governance.** Full logging through Azure Monitor, diagnostic settings, and Microsoft Purview.
- **Path to future sovereignty.** Fine-tuned models, custom datasets, and embeddings remain portable assets that can migrate to local inference later.
**The argument**: Azure OpenAI is the **sovereignty bridge**. It is not the vault. But it moves the client from the public street into a **leased apartment in their own building**—and from there, they can build their own vault when ready.
---
## The Public API vs. Azure OpenAI vs. Local Spectrum
| Dimension | Public API (ChatGPT, Claude, etc.) | Azure OpenAI / Foundry | Local / Sovereign AI |
|-----------|-----------------------------------|------------------------|---------------------|
| **Data trains foundation models?** | Yes (check current terms; subject to change) | **No** (Microsoft DPA) | No |
| **Data residency** | Unknown / US-centric | **Customer's Azure region** | Your data centre |
| **Network exposure** | Public internet | **Private endpoints / VNet** | Air-gapped possible |
| **Encryption control** | Provider-managed | **Customer-managed keys (CMK)** | Full control |
| **Model customization** | Limited (prompt engineering) | **Fine-tuning, embeddings, RAG** | Full weights, architecture, training |
| **Auditability** | None | **Full Azure logging** | Complete |
| **Vendor lock-in** | Extreme | **Moderate (portable models)** | Minimal |
| **Operational cost** | Variable, unpredictable | **Predictable, metered** | Fixed capital |
| **Setup complexity** | Low | **Medium** | Higher |
| **Sovereignty maturity** | 0% | **60-70%** | 100% |
**The pitch**:
> *"Public APIs are a taxi: convenient, but you do not own the car, the driver works for someone else, and everything you say in the back seat becomes part of the driver's training. Azure OpenAI is a leased car in your garage: you control the keys, the trips stay in your neighborhood, and the driver does not learn from your conversations. Local AI is building your own car. We start with the leased car because it stops the bleeding today, and it keeps your options open for building your own tomorrow."*
---
## The Three Arguments for Azure OpenAI as a Bridge
### 1. Stop the Hemorrhage Now
**The Problem**: Shadow AI usage is rampant. Employees use personal ChatGPT accounts for code review, contract analysis, strategy documents, and customer data. This data is leaving the perimeter continuously.
**The Bridge**: Azure OpenAI Service deployed with private endpoints and conditional access gives employees a **sanctioned, governed alternative** that stops the shadow usage.
**The Metrics**:
- Week 1: Inventory shadow AI usage via proxy logs and surveys
- Week 2: Deploy Azure OpenAI with restricted access
- Week 4: Measure reduction in public API traffic; measure increase in sanctioned usage
**The executive framing**:
> *"We cannot achieve full sovereignty in 30 days. But we can stop funding your competitors' R&D in 30 days. Azure OpenAI gives your teams a better tool than the public API, with the guarantee that your data is not training anyone else's model."*
---
### 2. Build Portable Assets
**The Problem**: When a client uses public APIs, they own nothing. No models, no weights, no training data, no embeddings. They are pure consumers.
**The Bridge**: Azure AI Foundry (formerly Azure AI Studio) allows clients to:
- Create **custom fine-tuned models** on proprietary data
- Build **vector indexes and embeddings** from internal documents
- Develop **RAG pipelines** that combine retrieval with generation
- Export **model weights and datasets** for future migration
These are **assets**, not expenses. A fine-tuned model trained on a client's proprietary data is intellectual property that improves over time and can be moved to local infrastructure later.
**The executive framing**:
> *"With Azure Foundry, every prompt improves your internal capabilities. You build vector stores of your documents, fine-tuned models of your domain, and RAG pipelines of your workflows. When you are ready to move fully on-premises, you pack these assets and migrate them. You are not renting intelligence. You are building it—and storing it in a Microsoft warehouse until your own vault is ready."*
---
### 3. Maintain Optionality for Full Sovereignty
**The Problem**: Clients fear that choosing Azure OpenAI now will lock them into Microsoft forever, preventing a future move to local AI.
**The Bridge**: Azure OpenAI actually **preserves optionality** compared to public APIs because:
- Fine-tuned models can be exported and converted to ONNX or other formats
- Embeddings and vector stores are standard formats (OpenAI embeddings are compatible with local vector databases)
- RAG pipelines built on LangChain or Semantic Kernel are portable across inference backends
- Prompt templates and evaluation datasets are vendor-agnostic
**The Migration Path**:
```
Month 0-3: Azure OpenAI Service (sanctioned replacement for public APIs)
→ Private endpoints, CMK, conditional access, Purview governance
Month 3-6: Azure AI Foundry (customization)
→ Fine-tuning on proprietary data, RAG pipelines, vector stores
Month 6-12: Hybrid architecture
→ Sensitive workloads on local inference (Ollama, vLLM)
→ General workloads on Azure OpenAI
Month 12-24: Full sovereignty (if justified)
→ Local inference cluster for all proprietary workloads
→ Azure OpenAI retained only for non-sensitive, generic tasks
```
**The executive framing**:
> *"We are not betting on Microsoft forever. We are using Microsoft to stop the bleeding, build portable assets, and train your team on AI operations. When your local infrastructure is ready, your models, your embeddings, and your pipelines move with you. That is optionality preservation."*
---
## Technical Positioning for Security-Conscious Clients
### Data Protection Architecture
| Control | Azure OpenAI Capability | Configuration Required |
|---------|------------------------|------------------------|
| **Data residency** | Regional deployment | Deploy to client's primary Azure region (e.g., West Europe, Germany West Central) |
| **Network isolation** | Private Link / Private Endpoints | Disable public network access; route all traffic through VNet |
| **Encryption at rest** | Microsoft-managed or CMK | Enable customer-managed keys in Azure Key Vault |
| **Encryption in transit** | TLS 1.2+ | Enforce minimum TLS version |
| **Access control** | Azure RBAC | Role-based access with least privilege; no standing admin access |
| **Audit logging** | Azure Monitor, Diagnostic Settings | Enable all diagnostic logs; forward to SIEM |
| **Data loss prevention** | Microsoft Purview | Classify data; block high-sensitivity data from AI endpoints if required |
| **Retention** | Configurable | Set retention policies aligned with data governance |
### The Foundry / AI Studio Value Proposition
Azure AI Foundry provides:
- **Model catalog**: GPT-4, GPT-3.5, Embeddings, DALL-E, plus open models (Llama, Mistral, Phi) deployable in your Azure tenant
- **Prompt flow**: Visual pipeline builder for RAG and agent workflows
- **Evaluation tools**: Built-in evaluation for model performance, safety, and groundedness
- **Content safety**: Built-in filtering for harmful content, PII detection
- **Tracing and observability**: Full lineage of prompts, responses, and intermediate steps
**The security argument**: Foundry gives you **governance tooling that public APIs lack**. You can see who is asking what, evaluate whether responses are grounded in your data, and enforce content policies.
---
## When Azure OpenAI Is NOT Enough
Be honest with clients. Azure OpenAI has limits:
| Limitation | Implication | When to Escalate to Local |
|-----------|-------------|---------------------------|
| Microsoft still operates the infrastructure | Subpoena risk, geopolitical access | When handling classified, state-secret, or criminal-defense data |
| Shared model weights (for base models) | Other tenants use the same underlying model | When model behaviour must be fully deterministic and auditable |
| Requires internet connectivity (even with private endpoints) | Azure backbone dependency | For fully air-gapped environments (submarines, defense, some OT) |
| Per-token pricing for inference | Cost scales with usage | At very high volume, local inference becomes cheaper |
| Limited to Azure regions | Some nations require domestic cloud | When data sovereignty laws mandate in-country infrastructure not served by Azure |
**The honest pitch**:
> *"Azure OpenAI is not perfect sovereignty. It is 70% sovereignty. For most organizations, that is the right starting point because it stops the worst leakage immediately while you build toward 100%. If you handle state secrets or operate in fully air-gapped environments, we skip this step and go straight to local. For everyone else, the bridge is the fastest path to safety."*
---
## Integration With Existing Frameworks
### The AI Sovereignty Framework
Azure OpenAI maps to the sovereignty framework as **Phase 1** of the journey:
| Sovereignty Phase | Implementation |
|-------------------|----------------|
| **Phase 0 (Current)** | Public APIs, consumer accounts, shadow AI |
| **Phase 1 (Azure OpenAI)** | Sanctioned, governed, resident AI with data protection guarantees |
| **Phase 2 (Hybrid)** | Sensitive workloads local; general workloads on Azure |
| **Phase 3 (Full Sovereign)** | All proprietary workloads on local inference; Azure retained for generic tasks only |
### The Rapid Modernisation Plan
| Rapid Modernisation Phase | Azure OpenAI Integration |
|--------------------------|-------------------------|
| **Hygiene (Days 0-30)** | Inventory shadow AI; deploy Azure OpenAI as sanctioned alternative |
| **Control (Days 30-60)** | Private endpoints, CMK, RBAC, conditional access, Purview governance |
| **Sovereignty (Days 60-90)** | Foundry pilot: fine-tuning, RAG, vector store on proprietary data |
| **Antifragility (Days 90-180)** | Evaluate migration of high-sensitivity workloads to local inference; retain Azure for lower-sensitivity use cases |
### The M365 E3 Hardening Playbook
For E3 clients, Azure OpenAI is a **separate Azure subscription**—it does not require E5. The key integration points:
- **Entra ID conditional access**: Restrict Azure OpenAI access to compliant devices, trusted locations, and specific user groups
- **Microsoft Purview**: Classify documents before they enter RAG pipelines (requires Purview licensing)
- **Defender for Cloud Apps**: Monitor and control shadow AI usage alongside sanctioned Azure OpenAI usage
---
## Talking Points for the C-Suite
| Concern | Response |
|---------|----------|
| "Is this just another Microsoft lock-in?" | "It reduces lock-in compared to public APIs because your fine-tuned models, embeddings, and RAG pipelines are portable assets. When you are ready for full local AI, you migrate them. We are using Azure as a warehouse, not a prison." |
| "Why not go straight to local AI?" | "Local AI requires hardware procurement, infrastructure setup, and expertise development—typically 3-6 months. Azure OpenAI stops the data leakage in 2 weeks while we build the local capability in parallel." |
| "How is this different from just using ChatGPT?" | "ChatGPT trains on your data. Azure OpenAI explicitly does not. ChatGPT has no audit trail. Azure OpenAI logs every prompt. ChatGPT offers no data residency guarantee. Azure OpenAI keeps your data in your region. The difference is governance, not capability." |
| "What if Microsoft changes the terms?" | "The data processing agreement is contractually binding. More importantly, the assets we build in Foundry are portable. If terms change unfavorably, we exercise the exit option we have been building toward all along." |
| "Will this slow down our AI adoption?" | "It will accelerate safe adoption. Employees currently use unauthorized AI because there is no sanctioned alternative. Azure OpenAI gives them a better, safer tool. Adoption goes up; risk goes down." |
---
*For the full AI sovereignty argument, see [AI Sovereignty Framework](ai-sovereignty-framework.md).*
*For the operational AI inevitability argument, see [AI Operations Inevitability](ai-operations-inevitability.md).*
*For the M365 integration specifics, see [M365 Antifragile Project](../playbooks/m365-antifragile-project.md).*

View File

@@ -0,0 +1,335 @@
# Blue / Purple Team Foundation
> *"Most organizations own a Ferrari-grade security stack and drive it like a rental car. The tools are not the problem. The team's ability to use them is."*
This document defines an engagement model for building **sustainable defensive capability**—not by selling more tools, but by operationalizing what the client already owns. It is designed for Heads of Security who feel they are not in control despite owning Microsoft Defender, Sentinel, and other advanced security platforms.
The focus is on **Defender Exposure Management** (formerly Microsoft Defender Threat & Vulnerability Management / Secure Score), **Sentinel** (if deployed), and the **people and processes** required to turn telemetry into action.
---
## The "Tools-Without-Capability" Trap
Many organizations have purchased or inherited an impressive security stack:
| Tool | Typical Ownership State | What the Head of Security Feels |
|------|------------------------|--------------------------------|
| **Microsoft Defender for Endpoint** (E5) | Installed on 60% of endpoints; ASR rules in audit mode; alerts ignored | "We have EDR but nobody hunts" |
| **Microsoft Sentinel** | Log ingestion configured; 47 built-in analytic rules active; 200 alerts/day; 2 analysts | "Sentinel generates noise, not intelligence" |
| **Defender for Office 365** | Safe Links enabled; 10,000 quarantined emails/month; no review process | "We catch threats but do not learn from them" |
| **Defender for Cloud / Exposure Management** | Secure Score visible; recommendations listed; remediation rate < 20% | "We know what is wrong but cannot fix it fast enough" |
| **Entra ID Identity Protection** | Risk detections logged; no automated response; manual review weekly | "We detect risky sign-ins but respond too slowly" |
**The pattern**: They own the tools. They lack the **operating rhythm**.
- No tiered alert triage (everything is "P1" or nothing is)
- No hunt hypothesis (analysts wait for alerts, they do not seek anomalies)
- No metrics that matter (SOC reports ticket volume, not mean-time-to-contain)
- No purple team culture (offence and defence never talk)
- No continuous improvement loop (findings do not produce structural change)
---
## The Engagement Model: From Tool Ownership to Operational Capability
### Phase 1: Capability Audit (Week 1-2)
**Objective**: Assess not the tools, but the **team's ability to use them**.
> **Critical distinction for outsourced SOCs**: If the client uses an MSSP, the capability audit must assess the **MSSP's detection coverage in the client's environment**, not just the client's internal team. See [Retained Capability](retained-capability.md) for the full MSSP co-management model.
**Tool Capability Assessment**:
| Capability | Maturity Question | Score (1-5) |
|-----------|-------------------|-------------|
| **Alert Triage** | Can a Tier-1 analyst correctly prioritize a Defender alert without escalating? | |
| **Threat Hunting** | Has the team run a proactive hunt in the last 30 days? | |
| **Incident Response** | Is there a documented, tested IR playbook for M365 compromise? | |
| **Vulnerability Management** | Is there an SLA for critical vulnerability remediation? | |
| **Exposure Management** | Is Secure Score reviewed weekly with ownership assignments? | |
| **Metrics & Reporting** | Does the SOC report mean-time-to-detect and mean-time-to-contain? | |
| **Purple Team** | Have red and blue teams collaborated in the last 90 days? | |
| **Automation** | Are repeatable tasks automated (isolation, disable account, enrich alert)? | |
| **MSSP Detection Coverage** | If using an MSSP: have they detected >70% of emulated TTPs in your environment? | |
**Deliverable**: Capability Gap Report
- Current maturity score per capability
- Target maturity score (realistic 12-month goal)
- Priority gaps: which missing capabilities create the most risk?
- Tool utilization heatmap: which purchased features are unused?
**The conversation (in-house SOC)**:
> *"Your Defender Secure Score is 42 out of 100. But the score itself is not the problem. The problem is that you have 38 open recommendations, 12 of them critical, and no one owns the remediation of any of them. We are not here to raise your score. We are here to build the operating rhythm that keeps your score rising without consultant dependency."*
**The conversation (outsourced SOC / MSSP)**:
> *"Your MSSP generates 200 tickets per month and meets every SLA. But when we emulated five common attack techniques last week, the MSSP detected only two. The other three—lateral movement via RDP, data staging in unusual locations, and exfiltration via personal cloud storage—were invisible to them. Not because they are incompetent, but because their generic rules do not know your environment. We do not replace the MSSP. We build the 1.5-person detection engineering cell that writes custom rules for your environment and makes the MSSP actually effective."*
---
### Phase 2: Quick Wins & Operating Rhythm (Week 3-6)
**Objective**: Build the basic operating rhythm that makes the tools useful.
#### 2A: Defender Exposure Management Operationalization
**The tool**: Defender Exposure Management (formerly TVM / Secure Score) provides:
- Vulnerability inventory across endpoints
- Misconfiguration detection (Secure Score)
- Attack surface reduction recommendations
- Threat analytics and vulnerability exploitation intelligence
**What most organizations do**: Look at the dashboard once a quarter.
**What we implement**:
| Activity | Frequency | Owner | Output |
|----------|-----------|-------|--------|
| Secure Score review | Weekly | Security lead + IT owner | 3 prioritized remediation actions |
| Vulnerability prioritization | Weekly | Vuln management analyst | Risk-ranked list: exploitability × asset criticality |
| Exposure remediation sprint | Bi-weekly | IT + Security | Closed vulnerabilities, validated |
| Threat intelligence brief | Weekly | Threat intel analyst | New CVEs affecting our estate; hunting hypotheses |
| ASR rule review | Monthly | Endpoint security admin | Audit-mode hits analyzed; block-mode rules justified |
**The key discipline**: Every open recommendation must have an owner and a due date. No orphaned findings.
#### 2B: Alert Triage & Enrichment
**What most organizations do**: Alert arrives → analyst reads it → creates ticket → waits for senior analyst.
**What we implement**:
- **Tier-1 triage playbook**: Decision tree for common Defender alerts (suspicious PowerShell, credential dumping, lateral movement)
- **Automated enrichment**: Logic App or Power Automate flow that enriches alerts with user info, device info, recent sign-ins, geo-location
- **Auto-response for high-confidence alerts**: Isolate device, disable user, block IP for confirmed malicious indicators
- **Alert tuning**: Disable or suppress noisy rules; customize thresholds per client environment
#### 2C: The First Hunt
**What most organizations do**: "We would hunt if we had time."
**What we implement**:
- **Hunt hypothesis workshop**: 2-hour session where blue team proposes 3 hypotheses based on recent threat intelligence
- **Guided first hunt**: Consultant and blue team analyst pair on one hypothesis
- Example: "We believe an adversary might be using living-off-the-land binaries (LOLBin) for reconnaissance. Let us hunt for unusual WMIC, net.exe, or nltest usage."
- **Hunt report template**: Documented findings, evidence, and structural improvements (not just "found nothing")
- **Hunt calendar**: Commit to one hunt per month for the next quarter
**For MSSP clients**: The first hunt often reveals gaps in MSSP detection coverage. These gaps become the first custom detection rules the retained capability cell writes and deploys.
**Deliverable**: Operating Rhythm Playbook
- Weekly, bi-weekly, and monthly cadence definitions
- RACI matrix for each activity
- Dashboard definitions and data sources
- Automated enrichment and response runbooks
---
### Phase 3: Purple Team Foundation (Week 7-10)
**Objective**: Break the silo between offence and defence. Build collaborative muscle.
#### The Purple Team Exercise
Unlike a red team (adversarial, stealthy) or a blue team (defensive, reactive), a purple team is **collaborative and educational**:
| Phase | Red Team Action | Blue Team Action | Purple Team Outcome |
|-------|---------------|------------------|---------------------|
| **Plan** | Propose 3 TTPs to test | Evaluate detection coverage for each TTP | Agreed scope: which TTPs, which tools, which metrics |
| **Execute** | Attempt TTP in controlled manner | Observe and document what their tools see | Real-time comparison: what was expected vs. what was detected |
| **Analyze** | Explain technique and evasion methods | Explain detection logic and gaps | Shared understanding of why something was missed |
| **Improve** | Suggest additional TTPs for future | Implement detection rules, tuning, or architectural changes | Closed-loop: every missed detection becomes a structural fix |
#### First Purple Team Exercise (Example)
**Scope**: M365 identity compromise simulation
| TTP | Red Team Action | Blue Team Detection Target | Outcome |
|-----|---------------|---------------------------|---------|
| Password spray | Attempt 50 logins against 10 accounts | Entra ID Identity Protection risky sign-in alert | Did alert fire? Was it tuned? Was response automated? |
| OAuth consent grant | Create malicious enterprise app; trick user into consent | Defender for Cloud Apps anomaly alert | Is user consent blocked? Is app inventory current? |
| Mailbox rule manipulation | Create forwarding rule to external address | Defender for Office 365 alert | Is alert enabled? Who responds? How fast? |
| Lateral movement via Teams | Exfiltrate files via Teams external share | DLP / sharing anomaly alert | Are sharing policies enforced? Is external sharing monitored? |
**Duration**: One day (not a month-long red team)
**Audience**: Blue team analysts, IT admins, security architect
**Output**: Detection gap matrix; prioritized improvements; next exercise scheduled
#### Building the Purple Team Habit
| Cadence | Activity | Participants |
|---------|----------|--------------|
| Monthly | Purple team exercise (half-day) | 1 red teamer + 2-3 blue teamers + observer |
| Monthly | Threat intel brief + hunt hypothesis | Threat intel + SOC + IT |
| Quarterly | Tabletop exercise (ransomware, BEC, insider threat) | Security + IT + Legal + Comms + Executive |
| Quarterly | Detection engineering sprint | SOC + IT + Consultant |
**Deliverable**: Purple Team Charter
- Scope rules (what is in-bounds, what is out-of-bounds)
- Cadence calendar
- Metrics: detection rate, mean-time-to-detect, false positive rate, improvement closure rate
---
### Phase 4: Roadmap & Handover (Week 11-12)
**Objective**: The team owns the capability. The consultant provides advisory oversight only.
**Activities**:
- **12-month roadmap**: Prioritized capability improvements with timelines and resource estimates
- Month 1-3: Operating rhythm stabilized; weekly Secure Score reviews; monthly hunts
- Month 4-6: Automated response for tier-1 alerts; SOAR playbooks (or Logic Apps)
- Month 7-9: Advanced hunting training; custom KQL detection rules
- Month 10-12: Full purple team program; quarterly adversarial simulation; threat-led penetration testing (DORA)
- **Knowledge transfer**: Document every custom query, playbook, and tuning decision
- **Metrics baseline**: Establish the metrics dashboard the team will use to self-assess
- **Advisory retainer**: Optional monthly 4-hour check-in for escalation support and advanced scenarios
**Deliverable**: Blue Team Capability Roadmap
- Maturity targets per capability
- Resource requirements (headcount, training, tooling)
- Quarterly milestones and validation criteria
- RACI for ongoing operations
---
## Specific Tool Deep-Dives
### Defender Exposure Management (Secure Score + TVM)
**Current state at most clients**: Secure Score is a number they see but do not act on.
**Operationalization**:
1. **Weekly Secure Score standup** (15 minutes):
- What changed since last week?
- What are the top 3 easiest wins?
- What is blocked and needs escalation?
2. **Vulnerability SLA**:
- Critical (exploited in the wild): 48 hours
- High (exploit available): 7 days
- Medium: 30 days
- Low: 90 days
3. **Exposure-based prioritization**:
- Do not patch everything. Patch the vulnerabilities on the assets that are:
- Internet-facing
- Privileged access
- Unprotected by compensating controls
4. **Threat analytics integration**:
- Review Defender Threat Analytics weekly
- Map active threat actor TTPs to your environment
- Generate hunt hypotheses from threat intelligence
### Microsoft Sentinel (If Deployed)
**Current state at most clients**: Ingesting logs; generating alerts; drowning in noise.
**Operationalization**:
1. **Alert quality audit**:
- Review last 30 days of alerts
- Categorize: true positive, false positive, benign positive
- Target: >70% true positive rate before adding new rules
2. **Tiered response model**:
- Tier 1 (L1): Triage, enrichment, initial containment
- Tier 2 (L2): Investigation, deeper analysis, escalation
- Tier 3 (L3): Threat hunting, detection engineering, purple team
3. **Automation first**:
- Automate enrichment before human sees alert
- Automate containment for high-confidence indicators
- Automate closure documentation
4. **Custom detection rules**:
- Start with 3-5 high-value custom KQL rules based on your environment
- Example: "Detect login from impossible travel + sensitive file download"
- Validate with purple team exercise
---
## Talking Points for the Head of Security
**When they say**: *"We have all these tools but I still do not feel in control."*
**You respond**:
> *"That is because tools do not create control. Operating rhythm creates control. You have a Ferrari but no one taught your team to drive it. I help you build the weekly cadence, the tiered response, the hunt discipline, and the purple team culture that turns telemetry into action. In 12 weeks, your team will not just own the tools. They will own the capability."*
**When they say**: *"My analysts are overwhelmed."*
**You respond**:
> *"Overwhelmed analysts are usually drowning in noise. We tune the alerts, automate the enrichment, and build a triage playbook so your Tier-1 analysts know exactly what to do with the 20 alerts they see each morning. The goal is not fewer alerts. It is more actionable alerts."*
**When they say**: *"We cannot afford a 24/7 SOC."*
**You respond**:
> *"Most organizations do not need a 24/7 SOC. They need a team that can detect, contain, and recover during business hours—and automated response for the hours they are not watching. We design for your reality, not for a Gartner ideal."*
**When they say**: *"We have never done threat hunting."*
**You respond**:
> *"Perfect. We start with one guided hunt. A 4-hour session with a hypothesis, a search, and a finding. Most teams discover something they did not know within the first two hours. Hunting is not magic. It is structured curiosity. We teach the structure."*
**When they say**: *"Our red team and blue team do not talk."*
**You respond**:
> *"That is the norm, and it is destructive. Red team thinks blue team is incompetent. Blue team thinks red team is reckless. Purple team fixes both: red team teaches technique; blue team learns to detect; both improve. We run your first purple team exercise in Week 7. It is usually the most productive security meeting the organization has had all year."*
**When they say**: *"Our outsourced SOC underperforms."*
**You respond**:
> *"Your MSSP is not failing you. You are failing to give them the context and custom detection rules they need to succeed in your environment. They run generic rules for 200 clients. Generic rules catch generic threats. Your adversaries are not generic. We do not fire the MSSP. We build a 2-person detection engineering cell inside your organization that writes custom rules for your environment, audits the MSSP's coverage quarterly, and makes your existing €600K SOC spend actually work. For the cost of one senior analyst, you transform insurance theater into actual protection."*
---
## Metrics That Prove Capability
| Before | After | What It Measures |
|--------|-------|-----------------|
| "We have 200 Sentinel alerts per day" | "We have 12 actionable alerts per day; 88% are true positives" | Alert quality |
| "Mean time to respond: 4 hours" | "Mean time to contain: 15 minutes for high-confidence alerts" | Response speed |
| "We have never hunted" | "We run one hunt per month; last hunt found 3 dormant accounts" | Proactive defence |
| "Secure Score is 42 and falling" | "Secure Score is 72 and rising; remediation SLA is 90%" | Exposure management |
| "Red team findings sit in a PDF" | "Red team findings become detection rules within 2 weeks" | Closed-loop improvement |
| "Analyst turnover is high" | "Analysts report higher satisfaction; they feel effective" | Team health |
---
## Integration With Modular Engagements
This module naturally connects to technical hardening and validation:
```
Module 3 (M365 Security Hardening) or Module 6 (On-Premise AD Hardening)
↓ Tools deployed but underutilized
Module 12 (Blue/Purple Team Foundation)
↓ Team learns to operationalize tools; builds sustainable capability
Module 10 (Red Team & Validation)
↓ Independent validation proves the capability works
```
It can also follow endpoint management:
```
Module 1 (Endpoint Management)
↓ Devices visible and compliant
Module 12 (Blue/Purple Team Foundation)
↓ EDR alerts now actionable; hunt on endpoint telemetry
```
---
*For the modular engagement menu, see [Modular Engagements](modular-engagements.md).*
*For embedded process assurance, see [Embedded Quality & Process Assurance](quality-management-engagement.md).*
*For organizational structure transformation, see [Organizational Resilience](organizational-resilience.md).*

View File

@@ -0,0 +1,294 @@
# C-Suite Conversation Guide
> *"You are not selling security. You are selling survival, speed, and strategic optionality."*
This guide prepares consultants for conversations with CEOs, CFOs, COOs, board members, and divisional presidents. It translates every technical control into a business decision and provides scripts, objection handling, and psychological framing tested in regulated, high-stakes environments.
---
## The Golden Rule of Executive Communication
**Never lead with technology. Always lead with consequence.**
| Bad Opening | Good Opening |
|------------|-------------|
| "We need to deploy ASR rules and enable PIM." | "There are currently 12 administrator accounts that, if compromised, would allow an attacker to delete our entire digital operation in under an hour. We can eliminate that exposure in two weeks with tools you already own." |
| "We should implement local AI inference." | "Every strategic document your teams paste into ChatGPT is training data for a model that will eventually be sold to your competitors. We can stop that leakage this quarter for less than the cost of one mid-level hire." |
| "Your CIS Controls gap is significant." | "Regulators now treat cybersecurity gaps as governance failures. The board's personal liability exposure under NIS2 and DORA begins the day an incident is proven preventable." |
---
## Know Your Audience
### The CEO
**Primary concern**: Reputation, competitive position, speed of execution.
**Frame**: This is not an IT project. It is a **strategic repositioning** that makes the organization faster, more independent, and harder to replicate.
**Key messages**:
- "Your competitors are building on cloud AI. You are feeding it. Reversing that creates a moat."
- "We can demonstrate measurable risk reduction in 30 days. Most consultants need 90 days to produce a report."
- "This positions you as the strategic defender of the company's future, not just its perimeter."
**What to avoid**: Technical jargon, long timelines, requests for blanket budget approval.
**The ask**: Executive sponsorship, authority to make disruptive changes in the first 30 days, and a weekly 30-minute steering committee slot.
---
### The CFO
**Primary concern**: Cost, ROI, predictability of spend, regulatory liability.
**Frame**: This is **the highest-return risk reduction available** because it leverages existing investments before requesting new ones.
**Key messages**:
- "We start with configuration, not procurement. Most of the value comes from turning on what you have already paid for."
- "Cloud AI pricing is linear and unpredictable. Local AI is a fixed capital expense with zero per-query leakage risk."
- "DORA fines reach 2% of global turnover. NIS2 exposes board members to personal liability. The cost of this program is a fraction of one regulatory penalty."
- "We will produce a before-and-after risk quantification in 60 days. You will see the financial equivalent of what we have fixed."
**What to avoid**: Vague security promises, unlimited budgets, multi-year commitments without phase gates.
**The ask**: Approval for a 90-day pilot with a hard stop for financial review before any significant capital expenditure.
---
### The COO / Operations Director
**Primary concern**: Uptime, operational disruption, supply chain stability, workforce impact.
**Frame**: This reduces operational fragility and ensures the organization can continue functioning even when primary systems or vendors fail.
**Key messages**:
- "We are not adding complexity. We are removing hidden dependencies that currently threaten continuity."
- "If your primary cloud AI provider raises prices 500% tomorrow, what happens to the workflows built on it? We eliminate that single point of failure."
- "In the first 30 days, we will test your ability to recover one critical system from backup. Most organizations discover they cannot. We fix that before it matters."
- "OT and IT separation is not bureaucracy. It is what keeps a malware infection in accounting from reaching the control room."
**What to avoid**: Technical depth on endpoint policies, abstract risk discussions without operational context.
**The ask**: Authority to run a controlled recovery drill, permission to temporarily disable unused accounts and access paths, and operations team participation in the 30-day sprint.
---
### The Board / Audit Committee
**Primary concern**: Governance, liability, regulatory compliance, shareholder value.
**Frame**: This is **governance enhancement** with evidence-based risk reduction and full regulatory alignment.
**Key messages**:
- "The board's duty of care now explicitly includes cybersecurity under NIS2 and DORA. This program produces the evidence that duty is being met."
- "We classify intelligence as a Tier 0 asset—the same category as domain controllers and root certificate authorities. That classification elevates the conversation from IT to strategic asset protection."
- "Our 180-day roadmap maps directly to CIS Controls, NIST CSF, and DORA requirements. At each phase gate, we produce auditable evidence."
- "We conduct quarterly antifragility assessments that trend the organization's resilience over time. The board receives a single-page dashboard."
**What to avoid**: Operational detail, tool-specific discussions, anything that sounds like IT outsourcing.
**The ask**: Board-level endorsement of the antifragile mandate, quarterly 15-minute briefings, and support for the executive sponsor's authority.
---
## The Seven Strategic Arguments
### 1. The Competitive Moat Argument
**The Frame**: Your data is your only sustainable advantage. Giving it to cloud AI providers is arming your competitors.
**The Script**:
> *"When your engineering team sends proprietary code to a cloud AI for review, that code improves a model that will eventually be sold to your competitors. When your strategy team asks an AI to analyze market positioning, that reasoning becomes training signal for a general model. You are not using AI. You are contributing to a public good that erodes your private advantage. Local AI closes that loop. Your data improves only your model. That is a moat no competitor can cross."*
**Who it moves**: CEOs, CTOs, heads of strategy, product leaders.
---
### 2. The Regulatory License Argument
**The Frame**: Compliance is no longer about paperwork. It is about demonstrable resilience. Regulators are now empowered to fine boards personally.
**The Script**:
> *"DORA, NIS2, and national critical infrastructure laws have changed the game. A preventable incident is now a governance failure, not a technical one. The board's personal liability is on the line. Our program does not produce policies. It produces evidence: recovery drills, chaos experiments, tested backups, and vendor exit architectures. When the regulator asks, you show them proof—not promises."*
**Who it moves**: Board members, general counsel, chief risk officers, compliance heads.
---
### 3. The Insurance Policy Argument
**The Frame**: This is not an upgrade. It is an insurance policy against the obsolescence of your own company.
**The Script**:
> *"Think of local AI as a vault. Yes, it costs something to build. But if your company's intelligence were physical cash, would you store it in a public bank that charges a training fee on every deposit and reserves the right to change the currency overnight? Or would you keep it in your own vault, where you control the security, the access, and the value? We are building the vault."*
**Who it moves**: CFOs, risk committees, conservative boards.
---
### 4. The Speed Argument
**The Frame**: The organizations that survive are not the most protected. They are the fastest to adapt.
**The Script**:
> *"Your industry is being disrupted by companies that can reorient in weeks while their competitors need quarters. Antifragility is not about preventing change. It is about engineering systems that improve when change happens. Every incident becomes a lesson. Every vendor failure becomes an opportunity to switch. Every regulatory demand becomes a competitive differentiator. We make you the company that moves faster than the disruption."*
**Who it moves**: CEOs, COOs, digital transformation leaders.
---
### 5. The Cost-of-Inaction Argument
**The Frame**: The price of doing nothing is no longer hypothetical. It is quantifiable and catastrophic.
**The Script**:
> *"The average ransomware recovery cost in Europe is now €4.5 million. That does not include regulatory fines, customer churn, or litigation. A single DORA fine can reach 2% of global turnover. One compromised cloud AI workflow can leak your entire product roadmap. The question is not whether you can afford this program. The question is whether you can afford to discover your vulnerabilities the way most companies do: at 3 AM, during an active breach, with no recovery plan."*
**Who it moves**: CFOs, boards, risk committees.
---
### 6. The Talent Argument
**The Frame**: The best security and engineering talent wants to work for organizations that take resilience seriously.
**The Script**:
> *"Engineers and security professionals have choices. They want to work where their work matters, where systems are designed intelligently, and where they are not fighting fires caused by decades of neglect. An antifragile posture is a recruiting advantage. It signals that this organization respects craft, invests in durability, and operates at a strategic level—not a reactive one."*
**Who it moves**: CHROs, CTOs, CEOs in competitive labor markets.
---
### 7. The Professional Responsibility Argument
**The Frame**: As advisors, we cannot in good conscience recommend that you outsource your strategic intelligence to unauditable third parties.
**The Script**:
> *"I am not a reseller. I am an independent architect. My fiduciary responsibility is to your organization's survival. I cannot recommend that you continue sending proprietary strategy to a black box you cannot audit, that is actively incentivized to commoditize your data, and that can change its terms overnight. That is not technology adoption. That is strategic self-harm. My recommendation is to own your intelligence. I will show you exactly how."*
**Who it moves**: CEOs, boards, anyone who has been burned by vendor lock-in before.
---
## Objection Handling for the C-Suite
| Objection | Response | Follow-Up |
|-----------|----------|-----------|
| "We already have a security team." | "This does not replace them. It accelerates them. Most internal teams are underwater with incidents. We provide focus, methodology, and executive air cover for 180 days." | "Let us meet your CISO and identify the one project they have been trying to get approved for six months. We will deliver it in 30 days." |
| "Our auditors just signed off." | "Auditors verify that controls exist. We verify that they work under stress. Compliance is the floor. Resilience is the ceiling." | "When was your last live recovery drill? When did you last test a vendor exit?" |
| "This sounds expensive." | "The first 30 days are primarily configuration of existing tools. We extract value you have already paid for before recommending any purchase." | "Let us run a 30-day sprint. If you do not see measurable risk reduction, we stop." |
| "We are in the middle of a cloud migration." | "Perfect. Security should be architected in, not bolted on. We embed antifragile principles into the migration so you do not recreate the same dependencies in the cloud." | "Let us review your cloud architecture for hidden single points of failure." |
| "Our industry is different." | "The principles are universal. The implementation is tailored. We have specific playbooks for telco, power, and banking—regulatory alignment included." | "Which regulation keeps you awake at night? DORA? NIS2? SWIFT CSP? We map directly to all of them." |
| "We tried a security program before and it failed." | "Most programs fail because they are indefinite, untethered from business outcomes, and measured in compliance checkboxes. Ours is 180 days, phase-gated, and measured in risk reduction." | "What failed last time? Timeline? Budget? Executive support? We design specifically to avoid those failure modes." |
| "The board will never approve this." | "The board will approve evidence. We produce a one-page risk dashboard in 30 days. That dashboard is your approval mechanism." | "Let us schedule a 20-minute briefing. I will show you what other boards have seen—and approved." |
---
## The 20-Minute Board Briefing Structure
When you get 20 minutes with the board, use this structure:
**Minutes 0-3: The Threat**
- One sentence: "Your proprietary intelligence is currently training your competitors."
- One statistic: "The average ransomware recovery is €4.5M, and that does not include regulatory fines."
- One story: A comparable organization that suffered a preventable failure.
**Minutes 3-8: The Alternative**
- Introduce antifragility: "Systems that grow stronger from disruption."
- The five pillars in business language (see table above).
- AI sovereignty as the strategic differentiator.
**Minutes 8-13: The Program**
- 180 days, four phases, measurable outcomes.
- Existing tools first, purchases only if justified.
- Regulatory alignment: DORA, NIS2, CIS, NIST.
- **Modularity**: "We do not require a 180-day commitment upfront. We offer specific, bounded modules. You choose the one that solves your most urgent pain. If it works, we add the next one."
**Minutes 13-17: The Evidence**
- Week 1: Kill chain identified.
- Week 4: First recovery drill completed.
- Week 12: Local AI pilot operational.
- Week 24: Board dashboard with trending resilience metrics.
**Minutes 17-20: The Ask**
- Executive sponsor with authority.
- Weekly 30-minute steering committee.
- Tolerance for temporary disruption in days 1-30.
- Phase-gated budget: approve one module at a time.
**Leave behind**: The [Executive Summary](executive-summary.md) printed on one page. **And**: The [Modular Engagements](modular-engagements.md) module menu.
---
## The One-Page Dashboard (Week 30)
After the first month, produce a single-page dashboard for the executive sponsor and board:
```
ANTIFRAGILE DASHBOARD — [Client Name] — Month 1
RISK REDUCTION
├─ Critical identities secured: [X] of [Y] (target: 100%)
├─ Public-facing assets mapped: [X] of [Y]
├─ T0 assets identified: [X]
├─ Mean time to recover (tested): [X] hours (target: < 4)
└─ Vendor dependencies without exit plan: [X] (target: 0)
REGULATORY EVIDENCE
├─ CIS IG1 safeguards implemented: [X] of 56
├─ Recovery drill completed: [Yes / No]
├─ Incident response runbook tested: [Yes / No]
└─ AI sovereignty pilot operational: [Yes / No]
INVESTMENT
├─ New tooling purchased: €0 (Month 1)
├─ Existing tools activated: [X] capabilities
└─ Next phase budget required: €[X] (if any)
TOP 3 RISKS REMAINING
1. [Risk] — Mitigation timeline: [Date]
2. [Risk] — Mitigation timeline: [Date]
3. [Risk] — Mitigation timeline: [Date]
RECOMMENDATION: [Proceed to Month 2 / Pause and remediate / Escalate]
```
---
## Psychological Framing
### Loss Aversion
Executives feel losses more acutely than equivalent gains. Frame inaction as a loss:
> *"Every day you continue sending proprietary data to cloud AI, you are transferring intellectual capital to entities that will eventually compete with you. That is not a future risk. That is a current hemorrhage."*
### Social Proof
Use comparable organizations (anonymized if necessary):
> *"The power utility we worked with last quarter discovered they could not recover their Active Directory from backup. Their €50,000 program fixed that in 14 days. The test alone was worth the engagement."*
### Authority and Independence
Differentiate from vendor-aligned consultants:
> *"I do not represent Microsoft, AWS, or any AI provider. My only incentive is your resilience. If I recommend a purchase, it is because the gap genuinely requires it—not because I have a quota."*
### Urgency Without Panic
Create bounded urgency:
> *"We do not need to fix everything this quarter. We need to fix the kill chain this month. The rest can wait. But the kill chain cannot."*
---
*For the financial justification, see [Business Case Template](../playbooks/business-case-template.md).*
*For the strategic foundation, see [The Antifragile Manifest](antifragile-manifest.md).*

View File

@@ -0,0 +1,75 @@
# Executive Summary: The Antifragile Enterprise
> *For the Board, the CEO, and the Executive Committee. One page. Five minutes. A decision that determines whether the organization survives its next disruption.*
---
## The Problem in One Sentence
Your organization is currently engaged in a **massive, unpaid research project for its competitors**—sending proprietary data, strategic reasoning, and operational intelligence to cloud platforms that are incentivized to commoditize your industry.
## What Is at Stake
| Asset Category | Current Risk | If Compromised or Extracted |
|---------------|-------------|----------------------------|
| Strategic intelligence | Rented from cloud AI providers | Competitors replicate your edge; your strategy becomes public model training data |
| Customer trust | Protected by compliance theater | Regulatory fines, class-action liability, irreversible reputational damage |
| Operational continuity | Dependent on vendor stability | Single API change or geopolitical event halts revenue-critical workflows |
| Technical talent | Wasted on maintenance of fragile systems | Burnout, attrition, inability to attract security-conscious engineers |
| Regulatory license | Assumed, not proven | DORA, NIS2, PSD2, and national regulators now demand demonstrable resilience—not paperwork |
## The Antifragile Alternative
An antifragile organization does not merely survive shocks. It **grows stronger from them**. Every incident produces structural improvement. Every competitor's failure creates market opportunity. Every regulatory demand is met with evidence, not promises.
### The Five Pillars (Business Translation)
| Pillar | What the Board Hears |
|--------|---------------------|
| **Structural Decoupling** | "We will never again be held hostage by a single vendor's pricing, terms, or existence." |
| **Optionality Preservation** | "We maintain the right to change direction in 90 days, not 9 months." |
| **Stress-to-Signal Conversion** | "Every failure makes us smarter and structurally stronger." |
| **Sovereign Intelligence** | "Our proprietary data improves our own models, not our competitors'." |
| **Asymmetric Payoff Design** | "Small, focused investments protect us against existential risks." |
## The Strategic Mandate: AI Sovereignty
The current AI paradigm is **extractive**. Every prompt sent to a cloud AI teaches that system how to replace you. By running artificial intelligence on infrastructure you control, you:
- **Protect your intellectual property** from becoming public training data
- **Ensure operational continuity** regardless of vendor decisions, geopolitics, or API changes
- **Reduce long-term costs** from unpredictable per-token pricing to fixed infrastructure
- **Demonstrate regulatory maturity** to auditors who increasingly scrutinize data residency and third-party risk
> *"If our company's intelligence were a physical pile of cash, would we store it in a public bank that takes a 'training fee' off every dollar and reserves the right to change the currency? Or would we keep it in our own vault?"*
Local AI is the vault.
## The 180-Day Commitment
We do not propose a three-year transformation. We propose **four phases, 180 days, measurable outcomes**:
| Phase | Timeline | Business Outcome |
|-------|----------|-----------------|
| **Hygiene** | Days 0-30 | Visibility. We see every identity, every asset, every gap that could end the company. |
| **Control** | Days 30-60 | Containment. We close the highest-risk exposure with existing tools—no new procurement. |
| **Sovereignty** | Days 60-90 | Ownership. We reclaim proprietary intelligence and validate that we can recover from disaster. |
| **Antifragility** | Days 90-180 | Advantage. We convert disruption into learning, and learning into market position. |
## The Investment Framing
This is not a cost centre. It is **optionality insurance**.
- **Cost of the program**: Primarily configuration and process—existing tools are leveraged first.
- **Cost of inaction**: A single ransomware incident averages €4.5M in recovery. A single regulatory fine under DORA can reach 2% of global turnover. A single competitor trained on your data renders your proprietary advantage worthless.
- **ROI timeline**: Risk reduction is visible in 30 days. Regulatory evidence is demonstrable in 90 days. Competitive advantage from sovereign intelligence compounds over 12-24 months.
## The Decision Required
We need **one executive sponsor with authority**, **one steering committee meeting per week**, and **tolerance for temporary disruption** in the first 30 days. The alternative is to continue operating with unseen dependencies, unmapped risks, and an intelligence strategy that enriches competitors.
---
*For the detailed strategic argument, see [The Antifragile Manifest](antifragile-manifest.md).*
*For the board conversation guide, see [C-Suite Conversation Guide](c-suite-conversation-guide.md).*
*For financial justification, see [Business Case Template](../playbooks/business-case-template.md).*

View File

@@ -0,0 +1,569 @@
# Modular Engagement Architecture
> *"Not every client is ready for the full journey. Some need to solve one burning problem first. The antifragile approach is architected so that every module stands alone—and every module makes the next one easier."*
This document defines the antifragile consulting portfolio as a **menu of independent, self-contained modules**. Clients can purchase any module without committing to the full 180-day program. Each module delivers measurable value, produces transferable assets, and creates natural appetite for the next phase.
---
## The Philosophy: Progressive Resilience
We do not sell monolithic transformation projects. We sell **building blocks** that stack.
| Approach | Traditional Consulting | Antifragile Modular |
|----------|----------------------|---------------------|
| Sales motion | Sell a 12-month program or nothing | Sell a 30-day module; expand based on proven value |
| Client commitment | All-in or walk away | Start where the pain is highest |
| Risk to client | High (unknown ROI until month 6+) | Low (measurable value in 30 days) |
| Risk to consultant | High (scope creep, payment delays) | Low (bounded scope, phase-gated payment) |
| Political capital | Consumed defending the program | Generated by visible early wins |
**The rule**: Every module must be **sellable on its own**, **deliverable in 90 days or less**, and **must produce evidence that the next module is warranted**.
---
## The Module Menu
### Module 1: Endpoint Management Foundation
**The Entry Vector. The Most Common Starting Point.**
| Attribute | Detail |
|-----------|--------|
| **Typical duration** | 30-45 days |
| **Typical investment** | Low (labor only; Intune included in E3) |
| **Prerequisites** | M365 E3 or higher; Azure AD tenant |
| **Standalone value** | Full device visibility; compliance enforcement; remote management capability |
| **Typical client** | Remote-first organization; SCCM retiree; compliance-driven; Intune shelfware |
**What is delivered**:
- Device inventory and enrollment campaign (Windows, macOS, iOS, Android)
- Compliance baseline: encryption, OS version, password policy, firewall
- Application inventory and shadow IT discovery
- Basic conditional access integration (compliant device required for M365 access)
- Admin training and operational handover
**Executive pitch**:
> *"Your devices are in home offices, airports, and coffee shops. In 30 days, we will know exactly what you have, whether it is secure, and how to fix what is not. This is not surveillance. It is ensuring that only healthy devices access your data—wherever they are."*
**Natural next modules**: Module 2 (Identity Security), Module 5 (AI Sovereignty Bridge), Module 6 (On-Premise AD)
**See**: [Endpoint Management Entry Vector](../playbooks/endpoint-management-entry-vector.md)
---
### Module 2: M365 Identity Security
**The Foundation of Everything. The Most Undervalued Module.**
| Attribute | Detail |
|-----------|--------|
| **Typical duration** | 30-60 days |
| **Typical investment** | Low to medium (labor; E5/P2 licensing upgrade may be recommended selectively) |
| **Prerequisites** | M365 tenant (E3 minimum); administrative access |
| **Standalone value** | Elimination of standing privileged access; MFA enforcement; legacy auth blocked; guest access governed |
| **Typical client** | Post-breach hardening; auditor findings; rapid growth with identity debt; privileged account compromise |
**What is delivered**:
- Full identity census: human accounts, service accounts, guests, enterprise apps
- MFA enforcement for 100% of users (per-user MFA for E3; conditional access for E5)
- Legacy authentication blocked tenant-wide
- Privileged access workstation (PAW) architecture for admins
- PIM deployment (if E5/Entra ID P2) or manual JIT process (if E3)
- Guest access audit and time-bounding
- OAuth consent governance
**Executive pitch**:
> *"There are currently [X] administrator accounts in your tenant. If any one of them is compromised, an attacker owns your email, your documents, and your identity system. In 30 days, we reduce that to the minimum viable number, enforce multi-factor authentication, and ensure no admin ever logs in from a workstation with email and browsing."*
**Natural next modules**: Module 3 (M365 Security Hardening), Module 6 (On-Premise AD), Module 7 (Recovery & Resilience)
---
### Module 3: M365 Security Hardening
**The E3 Maximization Play. Configuration, Not Procurement.**
| Attribute | Detail |
|-----------|--------|
| **Typical duration** | 30-60 days |
| **Typical investment** | Low (primarily labor; no new licensing required for E3 clients) |
| **Prerequisites** | M365 tenant; Module 2 (Identity Security) strongly recommended first |
| **Standalone value** | EOP tuned to maximum aggression; audit logging operational; Secure Score trending upward; ASR rules (if E5) |
| **Typical client** | E3 clients with untapped security potential; post-M365-deployment hardening; Secure Score below 50 |
**What is delivered**:
- Exchange Online Protection tuning: anti-phishing, anti-malware, anti-spam
- Mailbox auditing enabled for all users
- Unified Audit Log enabled and forwarded to SIEM
- Microsoft Secure Score baseline and improvement plan
- ASR rule deployment in audit mode (E5) or Defender Antivirus maximization (E3)
- Windows Defender Firewall and exploit protection baseline
- LAPS deployment for local admin password randomization
**Executive pitch**:
> *"You own E3, which includes enterprise-grade antivirus, email filtering, and audit logging. Most organizations use less than 30% of these capabilities because no one configured them. We turn every available security control to maximum—and prove the improvement with before-and-after metrics. No new software. Just expertise applied to what you already paid for."*
**Natural next modules**: Module 4 (Data Governance), Module 5 (AI Sovereignty Bridge), Module 10 (Red Team & Validation)
**See**: [M365 E3 Hardening](../playbooks/m365-e3-hardening.md), [Zero-Budget Hardening](../playbooks/zero-budget-hardening.md)
---
### Module 4: Data Governance & Compliance
**The Regulatory Survival Module.**
| Attribute | Detail |
|-----------|--------|
| **Typical duration** | 45-90 days |
| **Typical investment** | Medium (labor; Purview licensing may be required for advanced features) |
| **Prerequisites** | M365 tenant; Module 3 (Security Hardening) recommended |
| **Standalone value** | Data classification deployed; retention policies enforced; DLP active; eDiscovery ready; regulatory evidence produced |
| **Typical client** | Regulated industries (banking, healthcare, critical infrastructure); litigation hold requirements; GDPR/DORA/NIS2 compliance |
**What is delivered**:
- Sensitivity label deployment (Public, Internal, Confidential, Highly Confidential)
- Retention policies for all M365 workloads (email, Teams, SharePoint, OneDrive)
- Data Loss Prevention (DLP) policies for high-sensitivity data types
- External sharing lockdown and per-site governance
- eDiscovery readiness: legal hold procedures, retention hold capability
- Teams governance: controlled creation, expiration, access reviews
- SharePoint site provisioning governance
**Executive pitch**:
> *"Your auditor does not want to see a policy document. They want to see evidence that sensitive data is classified, that emails are retained according to regulation, and that you can produce documents for legal hold within 48 hours. We build the evidence—not the theater."*
**Natural next modules**: Module 5 (AI Sovereignty Bridge), Module 7 (Recovery & Resilience), Module 10 (Red Team & Validation)
---
### Module 5: AI Sovereignty Bridge
**The Strategic Differentiator. The Conversation Starter.**
| Attribute | Detail |
|-----------|--------|
| **Typical duration** | 30-60 days |
| **Typical investment** | Low to medium (labor; Azure OpenAI consumption; optional local inference hardware) |
| **Prerequisites** | M365 tenant; Azure subscription; data governance baseline strongly recommended |
| **Standalone value** | Shadow AI eliminated; sanctioned Azure OpenAI deployed; proprietary data protected; first custom model or RAG pipeline operational |
| **Typical client** | Organizations using ChatGPT/Claude/Gemini without governance; leadership asking "what is our AI strategy?"; competitors investing in AI |
**What is delivered**:
- Shadow AI usage inventory (proxy logs, endpoint scans, surveys)
- Azure OpenAI Service deployment with private endpoints and customer-managed keys
- Conditional access policies restricting AI access to approved users and devices
- Azure AI Foundry pilot: one RAG pipeline or fine-tuned model on proprietary data
- AI governance policy: approved use cases, prohibited data types, human-in-the-loop requirements
- User education: why sanctioned AI is safer and often better than public alternatives
**Executive pitch**:
> *"Your teams are already using AI—through personal accounts, browser tabs, and mobile apps. Every proprietary document they paste into ChatGPT trains a model that will eventually be sold to your competitors. We stop that leakage in two weeks by giving them a better, safer alternative. Then we build your first custom AI asset on data that never leaves your Azure region."*
**Natural next modules**: Module 9 (Organizational Resilience), Module 4 (Data Governance), Module 10 (Red Team & Validation)
**See**: [Azure OpenAI Sovereignty Bridge](azure-openai-sovereignty-bridge.md), [AI Sovereignty Framework](ai-sovereignty-framework.md)
---
### Module 6: On-Premise AD & Endpoint Hardening
**The Legacy Debt Cleanup. For Organizations with Feet in Both Worlds.**
| Attribute | Detail |
|-----------|--------|
| **Typical duration** | 45-60 days |
| **Typical investment** | Medium (labor; Sysmon/Wazuh deployment; possible hardware for PAWs) |
| **Prerequisites** | On-premise Active Directory; administrative access to domain controllers |
| **Standalone value** | KRBTGT rotated; LAPS deployed; Sysmon operational; privileged access tiered; Azure AD Connect secured |
| **Typical client** | Hybrid identity environments; SCCM/AD shops; post-Active-Directory-compromise recovery; NIS2-critical infrastructure |
**What is delivered**:
- Full AD identity census with orphan and privilege analysis
- KRBTGT password rotation (if > 180 days stale)
- LAPS deployment to all domain-joined workstations
- Sysmon deployment with SwiftOnSecurity configuration
- Privileged Access Workstation (PAW) architecture for Tier 0 admins
- Azure AD Connect hardening and audit
- AD FS security review (if present)
- Windows Defender maximization and firewall hardening
**Executive pitch**:
> *"Your Active Directory has been running for fifteen years. It has accounts from employees who left a decade ago, service accounts with passwords that never expire, and administrator accounts that log in from the same laptops used for email and browsing. In 45 days, we clean the foundation—and make it significantly harder for an adversary to gain a foothold."*
**Natural next modules**: Module 2 (Identity Security), Module 7 (Recovery & Resilience), Module 8 (OT Security Assessment)
**See**: [AD and Endpoint Hardening](../playbooks/ad-endpoint-hardening.md)
---
### Module 7: Recovery & Resilience Validation
**The Insurance Policy. Prove You Can Rebuild Before You Need To.**
| Attribute | Detail |
|-----------|--------|
| **Typical duration** | 30-45 days |
| **Typical investment** | Low to medium (labor; third-party backup if not already owned) |
| **Prerequisites** | Backup solution in place (even if untested); administrative access to critical systems |
| **Standalone value** | One critical system recovered from backup; runbooks documented; CMDB seeded; quarterly drill cadence established |
| **Typical client** | Organizations that have never tested recovery; recent ransomware scare; DORA/NIS2 compliance preparation; board demanding evidence |
**What is delivered**:
- Backup coverage inventory: what is backed up, how often, where, by what mechanism
- Recovery drill: one critical system restored to isolated environment with full validation
- CMDB seeding: T0 and T1 assets documented with owners, dependencies, and recovery requirements
- Recovery runbooks: documented, tested, and transferable to non-designers
- Immutable backup validation: ensure backups cannot be deleted by compromised admin accounts
- Quarterly recovery drill calendar established
**Executive pitch**:
> *"Most organizations discover they cannot recover from backup at 3 AM during an active ransomware incident. We discover it in a controlled test during business hours—when we can fix it without pressure. The question is not whether you have backups. The question is whether you have ever proven they work. We prove it."*
**Natural next modules**: Module 10 (Red Team & Validation), Module 8 (OT Security Assessment), Module 3 (M365 Security Hardening)
---
### Module 8: OT Security Assessment
**The Critical Infrastructure Module. For Power, Utilities, and Telco.**
| Attribute | Detail |
|-----------|--------|
| **Typical duration** | 45-90 days |
| **Typical investment** | Medium to high (labor; potential network hardware for segmentation) |
| **Prerequisites** | OT network access; cooperation from operations and engineering teams |
| **Standalone value** | IT/OT connection matrix; vendor access audit; manual override procedures validated; NIS2 evidence produced |
| **Typical client** | Power utilities; water/wastewater; telecommunications; manufacturing with SCADA/DCS |
**What is delivered**:
- OT asset inventory: SCADA, DCS, EMS, protection relays, RTUs, AMI
- IT-to-OT network connection mapping with business justification
- Vendor remote access audit and time-bounding
- Network segmentation plan: IT/OT DMZ, unidirectional gateway recommendations
- Manual override procedure documentation and validation
- NIS2/CER compliance evidence package
- Black start / islanding procedure test (power utilities)
**Executive pitch**:
> *"Your control room does not need email. Your protection relays do not need internet access. Every connection between IT and OT is a bridge an adversary can cross. We map those bridges, justify the ones that must remain, and eliminate the ones that put physical safety at risk. This is not IT security. This is operational survival."*
**Natural next modules**: Module 6 (On-Premise AD), Module 7 (Recovery & Resilience), Module 10 (Red Team & Validation)
**See**: [Vertical: Power and Utilities](../reference/vertical-power-utilities.md), [Vertical: Telco](../reference/vertical-telco.md)
---
### Module 9: Organizational Resilience
**The People and Process Module. Fix the Structure, Not Just the Tools.**
| Attribute | Detail |
|-----------|--------|
| **Typical duration** | 60-90 days |
| **Typical investment** | Medium (labor; no tooling cost) |
| **Prerequisites** | Executive sponsor with authority; willingness to experiment with team structure |
| **Standalone value** | One product team with embedded security; shift-left pilot operational; shared metrics proving velocity and security can coexist |
| **Typical client** | Organizations with siloed Dev/Sec/Ops; slow release cycles blamed on security gates; talent retention problems |
**What is delivered**:
- Current-state Dev/Sec/Ops friction mapping
- Pilot team selection and embedded security engineer placement
- CI/CD security gate deployment (automated scanning, not manual review)
- Shared OKR definition: team owns vulnerability count, change failure rate, recovery time
- Platform team or SRE team architecture (if appropriate)
- Blameless post-mortem process with structural mandate
- 90-day metrics report: before-and-after velocity, defect rates, team satisfaction
**Executive pitch**:
> *"Your development team ships fast. Your security team says no. Your operations team keeps the lights on. None of them are wrong—but the organizational boundary between them destroys all three goals. We do not reorganize your departments on day one. We embed security into one product team, measure the results, and let the metrics make the case for broader change."*
**Natural next modules**: Module 2 (Identity Security), Module 5 (AI Sovereignty Bridge), Module 10 (Red Team & Validation)
**See**: [Organizational Resilience](organizational-resilience.md)
---
### Module 10: Red Team & Validation
**The Proof Module. Validate Everything You Have Built.**
| Attribute | Detail |
|-----------|--------|
| **Typical duration** | 15-30 days (engagement) + quarterly re-testing |
| **Typical investment** | Medium to high (external red team; internal coordination) |
| **Prerequisites** | At least one other module deployed; operational incident response capability |
| **Standalone value** | Independent validation of security posture; kill chain identification; board-ready evidence |
| **Typical client** | Regulated industries requiring annual penetration testing; post-transformation validation; boards demanding proof |
**What is delivered**:
- Scoping and rules of engagement (aligned to DORA TLPT or CIS requirements)
- Adversarial simulation: external reconnaissance, initial access, lateral movement, impact
- M365-specific attack paths: BEC, OAuth consent abuse, conditional access bypass attempts
- OT-bounded red team (for critical infrastructure clients)
- Report with kill chain analysis and prioritized remediation
- Board presentation: findings, risk quantification, and evidence of control effectiveness
- Quarterly purple team exercises (optional retainer)
**Executive pitch**:
> *"You have invested in security controls. But controls that have not been tested are assumptions, not facts. A red team exercise is a controlled failure that proves whether your defenses work before a real adversary tests them. The board receives independent evidence—not consultant promises."*
**Natural next modules**: Any module where gaps were identified; typically cycles back to hardening modules.
---
### Module 11: Embedded Quality & Process Assurance
**The Presence Module. For Leaders Who Feel They Are Not in Control.**
| Attribute | Detail |
|-----------|--------|
| **Typical duration** | 60-90 days (12 weeks embedded) |
| **Typical investment** | Medium (labor; no tooling cost) |
| **Prerequisites** | Executive sponsor; team willing to be observed; tolerance for process change |
| **Standalone value** | Repeatable processes; accurate documentation; team confidence; friction reduction |
| **Typical client** | Heads of Security or Operations who say "we don't feel in control"; project teams behind schedule; teams with tool-shelfware |
**What is delivered**:
- Immersion report: formal vs. actual process map; invisible risks identified
- Friction reduction: fast wins that reduce daily pain and vulnerability
- Capability handover: team-owned documentation, self-assessment checklists, metrics dashboard
- Validation: team operates independently for one week; consultant steps back to advisory
**Executive pitch**:
> *"You have capable people, but the gap between what is documented and what is actually happening has grown too wide. I do not audit you. I join your team for 12 weeks, observe the reality of daily work, and help you close that gap. You will have repeatable processes, accurate documentation, and a team that trusts its own capability."*
**Natural next modules**: Module 9 (Organizational Resilience), Module 12 (Blue/Purple Team Foundation), Module 3 (M365 Security Hardening)
**See**: [Embedded Quality & Process Assurance](quality-management-engagement.md)
---
### Module 12: Blue / Purple Team Foundation
**The Capability Module. From Tool Ownership to Operational Defense.**
| Attribute | Detail |
|-----------|--------|
| **Typical duration** | 60-90 days |
| **Typical investment** | Medium (labor; leverages existing Microsoft security stack) |
| **Prerequisites** | Microsoft Defender (E5) or equivalent EDR; at least one security analyst; willingness to learn |
| **Standalone value** | Operating rhythm for SOC; first guided threat hunt; purple team charter; 12-month capability roadmap |
| **Typical client** | Organizations that own E5/Defender/Sentinel but underutilize them; SOC drowning in noise; no hunt discipline; red and blue teams do not collaborate |
**What is delivered**:
- Capability audit: maturity assessment of detection, response, hunting, and metrics
- Operating rhythm: weekly Secure Score reviews, alert triage playbooks, automated enrichment
- First guided threat hunt: hypothesis-driven search with documented methodology
- Purple team exercise: collaborative attack/defence simulation with detection gap analysis
- 12-month roadmap: prioritized capability improvements with resource requirements
**Executive pitch**:
> *"You have a Ferrari-grade security stack and drive it like a rental car. The tools are not the problem—the team's ability to use them is. I help you build the weekly cadence, the hunt discipline, and the purple team culture that turns telemetry into action. In 12 weeks, your team owns the capability, not just the licenses."*
**Natural next modules**: Module 10 (Red Team & Validation), Module 3 (M365 Security Hardening), Module 7 (Recovery & Resilience)
**See**: [Blue/Purple Team Foundation](blue-purple-team-foundation.md)
**Also see**: [Retained Capability](retained-capability.md) for the MSSP co-management and detection engineering model.
---
## Module Selection Guide
### For the Client Who Knows Their Pain
| Client Says | Start With Module | Typical Duration |
|-------------|-------------------|-----------------|
| "We need to manage remote devices" | Module 1: Endpoint Management | 30-45 days |
| "We had a phishing incident" | Module 2: Identity Security | 30-60 days |
| "Our E3 licenses feel wasted" | Module 3: M365 Security Hardening | 30-60 days |
| "The auditor is coming" | Module 4: Data Governance | 45-90 days |
| "What is our AI strategy?" | Module 5: AI Sovereignty Bridge | 30-60 days |
| "Our AD is a mess" | Module 6: On-Premise AD Hardening | 45-60 days |
| "Can we actually recover from backup?" | Module 7: Recovery & Resilience | 30-45 days |
| "We operate critical infrastructure" | Module 8: OT Security Assessment | 45-90 days |
| "Security slows us down" | Module 9: Organizational Resilience | 60-90 days |
| "Prove our security works" | Module 10: Red Team & Validation | 15-30 days |
| "We don't feel in control" | Module 11: Embedded Quality Assurance | 60-90 days |
| "We own tools but can't use them" | Module 12: Blue/Purple Team Foundation | 60-90 days |
| "Our outsourced SOC underperforms" | Module 12 (+ Retained Capability Audit) | 60-90 days |
| "Mythos/AI will find all our vulnerabilities" | AI-assisted TVM Sprint | 30-90 days |
### For the Client Who Does Not Know Where to Start
**The Diagnostic Path**:
1. **Week 1: Kill Chain Assessment** (included in scoping; no charge)
- Interview stakeholders
- Identify the shortest path to organizational failure
- Recommend the module that closes the most critical gap
2. **Module selection based on kill chain**:
- Kill chain starts with compromised endpoint → Module 1
- Kill chain starts with stolen credentials → Module 2
- Kill chain starts with unrecoverable systems → Module 7
- Kill chain starts with OT bridge → Module 8
---
## Progressive Enhancement: How Modules Stack
### Path A: The M365-First Organization
```
Month 1-2: Module 1 (Endpoint Management)
↓ Discovers identity and AI gaps
Month 2-3: Module 2 (Identity Security)
↓ Discovers compliance and data gaps
Month 4-5: Module 4 (Data Governance)
↓ Discovers AI shadow usage
Month 5-6: Module 5 (AI Sovereignty Bridge)
↓ Discovers architectural fragility
Month 7-12: Module 10 (Red Team) + selected hardening
```
### Path B: The Hybrid Infrastructure Organization
```
Month 1-2: Module 6 (On-Premise AD Hardening)
↓ Discovers recovery and identity gaps
Month 2-3: Module 2 (Identity Security)
↓ Discovers endpoint visibility gap
Month 3-4: Module 1 (Endpoint Management)
↓ Discovers AI and data gaps
Month 5-8: Module 5 (AI Sovereignty) + Module 4 (Data Governance)
Month 9-12: Module 7 (Recovery Validation) + Module 10 (Red Team)
```
### Path C: The Critical Infrastructure Organization
```
Month 1-2: Module 8 (OT Security Assessment)
↓ Discovers IT/OT identity and recovery gaps
Month 2-3: Module 6 (On-Premise AD) + Module 2 (Identity Security)
Month 4-5: Module 7 (Recovery & Resilience)
↓ Validates black start, DR procedures
Month 6-9: Module 1 (Endpoint Management) + Module 3 (M365 Hardening)
Month 10-12: Module 10 (Red Team with OT scope)
```
### Path D: The "Not in Control" Organization
```
Month 1-3: Module 11 (Embedded Quality & Process Assurance)
↓ Discovers that tools are underutilized because processes are broken
Month 3-5: Module 12 (Blue/Purple Team Foundation)
↓ Builds operating rhythm for existing security stack
Month 5-7: Module 2 (Identity Security) + Module 3 (M365 Hardening)
↓ Technical fixes now stick because processes support them
Month 8-12: Module 10 (Red Team) + continuous improvement retainer
```
### Path E: The "Mythos / AI Vulnerability Panic" Organization
```
Week 1-2: AI-assisted TVM Baseline Sprint
↓ Discovers actual exploitable attack surface; beats adversary AI to first move
Month 1-2: Module 1 (Endpoint Management) + Module 2 (Identity Security)
↓ Closes the highest-risk doors while AI TVM operationalizes
Month 2-3: Module 3 (M365 Security Hardening) + AI TVM operationalization
↓ Automated remediation pipeline; <48h critical CVE response
Month 3-6: Module 12 (Blue/Purple Team) + continuous AI TVM improvement
↓ Purple team validates that open vulnerabilities are detected and contained
```
---
## Pricing and Engagement Structure
### Fixed-Scope Modules
Each module is sold with:
- **Fixed price** (or fixed daily rate with capped days)
- **Fixed duration** (hard stop)
- **Defined deliverables** (checklist)
- **Go/no-go gate** before any expansion
**Example module statement of work**:
```
Module: Endpoint Management Foundation
Duration: 30 business days
Investment: €[X]
Deliverables:
[ ] Device inventory: 100% of corporate devices identified
[ ] Enrollment: 90%+ of corporate devices managed
[ ] Compliance baseline: encryption, OS version, password policy deployed
[ ] Application inventory: shadow IT report delivered
[ ] Conditional access: compliant device required for M365
[ ] Training: client admin team operational
[ ] Handover: runbooks and monitoring dashboard
Go/No-Go Gate: Day 30 steering committee
→ If value demonstrated: propose Module 2 (Identity Security)
→ If value not demonstrated: engagement concludes with findings report
```
### Module Bundles (Optional)
For clients ready to commit to a multi-module journey, offer **discounted bundles**:
| Bundle | Modules | Discount | Typical Timeline |
|--------|---------|----------|-----------------|
| **M365 Foundation** | 1 + 2 + 3 | 10% | 90-120 days |
| **M365 Secure** | 1 + 2 + 3 + 4 + 5 | 15% | 180 days |
| **Hybrid Hardening** | 1 + 2 + 3 + 6 + 7 | 15% | 180 days |
| **Critical Infrastructure** | 1 + 2 + 6 + 7 + 8 + 10 | 20% | 270 days |
| **Capability Building** | 11 + 12 + 2 + 3 | 15% | 180 days |
| **MSSP Optimization** | Retained Capability Audit + 12 + 10 | 15% | 120-180 days |
| **AI TVM Sprint** | AI-assisted TVM + 1 + 2 + 3 | 15% | 90-120 days |
**The rule**: Bundles are discounted but still phase-gated. Each module has its own go/no-go. The client can pause or stop after any module.
---
## Sales Enablement
### The Modular Pitch
> *"We do not sell one-size-fits-all transformation programs. We sell specific, bounded modules that solve specific problems. You can start with any module—whichever pain is keeping you awake at night. Each module delivers measurable value in 30-60 days. If you like the results, we add the next module. If you do not, we stop. No long-term commitment. No sunk cost. Just building blocks that make your organization stronger."*
### The Discovery Question Sequence
1. *"What is the shortest path to a business-ending incident here?"* (Identifies kill chain)
2. *"Which of your security investments are you least sure about?"* (Identifies untapped tooling)
3. *"If you could fix one thing in the next 60 days, what would it be?"* (Identifies module selection)
4. *"What have you tried before that did not work?"* (Avoids repeating failures)
5. *"What would make you confident enough to expand to the next phase?"* (Defines go/no-go criteria)
---
## Integration With Existing Frameworks
| Document | Integration |
|----------|-------------|
| [Rapid Modernisation Plan](../playbooks/rapid-modernisation-plan.md) | Each module maps to one or more rapid modernisation phases |
| [Business Case Template](../playbooks/business-case-template.md) | Modular pricing structure; per-module ROI |
| [C-Suite Conversation Guide](c-suite-conversation-guide.md) | Modular pitching scripts and objection handling |
| [M365 Antifragile Project](../playbooks/m365-antifragile-project.md) | Modules 1-5 map directly to M365 project workstreams |
| [Antifragile Risk Register](../assessment-templates/antifragile-risk-register.md) | Each module closes a defined risk category |
---
*For the full 180-day rapid modernisation plan, see [Rapid Modernisation Plan](../playbooks/rapid-modernisation-plan.md).*
*For module-specific tactical guidance, see the linked playbooks in each module description.*

View File

@@ -0,0 +1,148 @@
# Move Fast and Fix Things
> *"The best time to plant a tree was 20 years ago. The second best time is now. The worst time is after the storm has already knocked it down."*
This document anchors the antifragile consulting practice in a single, actionable posture: **move fast and fix things**. It is not a contradiction of Taleb's philosophy—it is its operational expression. Antifragility is not achieved by standing still and theorizing. It is earned by rapid iteration, honest repair, and the refusal to let perfect be the enemy of resilient.
---
## The Philosophy
### Speed Is a Security Control
The organizations that survive are not the ones with the most comprehensive plans. They are the ones that **execute fastest** against the gaps that actually matter. A 90% solution deployed today outperforms a 100% solution that ships in six months—because the attacker does not wait for your roadmap.
### Fixing Things Is Strategic
Every unfixed vulnerability, orphaned account, and untested backup is a **compounding liability**. Technical debt in security does not accrue interest linearly. It accrues catastrophically. The longer a gap exists, the more likely it becomes the entry point for an existential incident.
Fixing things is not maintenance. It is **risk reduction at velocity**.
### Work Beats Purchases
Most organizations do not have a tools problem. They have a **utilization problem**. They own EDR but have 40% coverage. They own a SIEM but log only 20% of critical systems. They own a PAM solution but have not onboarded privileged accounts. They own backup software but have never tested a restore.
The antifragile consultant's first duty is not to recommend new spending. It is to **extract the value already paid for**.
---
## The Three Rules
### Rule 1: Start With What You Own
Before any new purchase is discussed, exhaust the capabilities of existing tooling. This is not cheapness. It is **optionality preservation**: every dollar not spent on redundant tooling is a dollar available for structural improvement.
| Common Underutilized Asset | What Most Organizations Do | What We Do |
|---------------------------|---------------------------|------------|
| Microsoft E5 / Defender suite | Buy additional EDR, SIEM, CASB | Maximize Defender for Endpoint, Sentinel, Entra ID PIM, Purview |
| Existing firewall / IDS | Buy another "next-gen" platform | Audit rules, enable logging, integrate with SOC workflow |
| Active Directory | Add third-party IAM | Cleanse accounts, implement PAWs, enforce conditional access |
| Backup solution | Buy additional DRaaS | Test restores, document runbooks, automate verification |
| CMDB / ITAM tool | Start a new discovery project | Populate with T0 assets, enforce ownership, feed security workflow |
### Rule 2: Fix the Kill Chain First
Not all debt is equal. We identify the shortest sequence of failures that would end the organization—the **kill chain**—and we fix those nodes with extreme prejudice. Everything else waits.
This requires brutal honesty:
- If your domain admins are logging in from workstations with email and browsing, that is the kill chain.
- If your backups have never been restored, that is the kill chain.
- If your cloud storage bucket is public and contains customer data, that is the kill chain.
- If your CEO's email has no MFA, that is the kill chain.
We do not fix everything. We fix the **existential** things. Fast.
### Rule 3: Every Fix Must Produce a Signal
A fix that does not generate intelligence is a fix that will rot. Every remediation must produce a **signal**: a metric, an alert, a log entry, or a structural change that prevents recurrence.
| Bad Fix | Good Fix |
|---------|----------|
| "We disabled the old account." | "We disabled the old account and implemented automated orphan detection." |
| "We patched the server." | "We patched the server and added it to automated vulnerability management." |
| "We rotated the password." | "We rotated the password and vaulted it in the PAM with checkout logging." |
| "We fixed the firewall rule." | "We fixed the firewall rule and added a monthly rule review to the change process." |
---
## Mapping to Antifragile Pillars
| Antifragile Pillar | Move Fast and Fix Things Expression |
|-------------------|-------------------------------------|
| **Structural Decoupling** | Identify and eliminate hidden dependencies before they become fatal. Do not add new platforms to solve problems that abstraction can solve. |
| **Optionality Preservation** | Maximize existing investments to preserve budget for strategic optionality. Every unnecessary purchase reduces your ability to pivot. |
| **Stress-to-Signal Conversion** | Every fix must generate telemetry. Incidents are not failures; they are unpaid penetration tests. Convert their lessons into structure. |
| **Sovereign Intelligence** | Use what you own first. Local AI on existing hardware beats cloud AI on a credit card. Your data should improve your models, not someone else's. |
| **Asymmetric Payoff Design** | Small, fast fixes on the kill chain yield disproportionate risk reduction. Do not distribute effort evenly; concentrate it where failure is existential. |
---
## Mapping to Standards
We do not treat compliance as the goal. We treat it as a **side effect of doing the right things fast**.
| Standard | How We Map |
|----------|-----------|
| **CIS Controls v8** | IG1 is the floor, not the ceiling. We aim for IG1 completeness in 90 days because it is the minimum viable security posture. See [CIS Controls Mapping](../reference/cis-controls-mapping.md). |
| **NIST CSF 2.0** | We align to Identify, Protect, Detect, Respond, Recover—but we emphasize GOVERN as the missing piece in most organizations. See [NIST CSF Mapping](../reference/nist-csf-mapping.md). |
| **ISO 27001** | Annex A controls are addressed through the kill chain-first methodology, not checklist compliance. |
| **DORA / NIS2** | Operational resilience and ICT risk management are natural outcomes of the antifragile rapid-modernisation approach. |
---
## The Consultant's Stance
When you walk into a client environment, bring these assumptions:
1. **They already own enough software.** Your job is to configure, integrate, and operationalize—not to shop.
2. **Their technical debt is worse than they admit.** Your job is to find the kill chain and fix it without shaming.
3. **Speed builds trust.** A visible fix in week one is worth more than a perfect report in week twelve.
4. **Honesty is the product.** You are not a reseller. You are an independent advisor. Say what you would do with your own company's data.
### The Opening Pitch
> *"Most consultants will sell you a shopping list. We start with what you already bought. Our job is to find the gaps that matter, fix them fast, and make sure they stay fixed. We move fast. We fix things. And we do it with the tools you already own."*
---
## Engagement Principles
### Week 1: Brutal Honesty Audit
- Inventory existing tooling and its utilization rate
- Identify the kill chain
- Pick three fixes that can be completed before the next steering committee
- Execute them
### Month 1: Momentum Through Visibility
- Show the client what they could not see before
- Close the highest-risk gaps
- Demonstrate value from existing tools
- Build political capital for harder changes
### Quarter 1: Structural Change
- Convert fixes into process
- Automate detection and response
- Establish the antifragile feedback loop: incident → learning → structure
---
## Contrast With "Move Fast and Break Things"
The Silicon Valley mantra was an excuse for externalizing harm. "Move fast and fix things" is its responsible successor:
| Move Fast and Break Things | Move Fast and Fix Things |
|---------------------------|--------------------------|
| Ship now, fix later | Fix now, ship sustainably |
| Externalize risk to users | Internalize risk and reduce it |
| Growth at all costs | Resilience as the foundation of growth |
| Ignore technical debt | Pay down the highest-interest debt first |
| Disrupt without accountability | Build trust through visible repair |
---
*Next: [CIS Controls Mapping](../reference/cis-controls-mapping.md)*
*Previous: [Antifragile Manifest](antifragile-manifest.md)*

View File

@@ -0,0 +1,278 @@
# Organizational Resilience: Breaking the Dev / Sec / Ops Silos
> *"You do not have a tools problem. You have a handoff problem. Every boundary between departments is a boundary where accountability dies."*
This document provides the strategic arguments, talking points, and implementation roadmap for organizational structures that produce resilient systems. It addresses two related transformations:
1. **Shift Left**: Moving security, reliability, and operational concerns earlier in the development lifecycle
2. **Merge Dev / Sec / Ops**: Eliminating the structural boundaries that create blame, delay, and fragility
It is designed for consultants who must persuade executives that **organizational design is a security control**—and that siloed departments are a latent single point of failure.
---
## The Executive Summary
Your clients likely have three departments that do not talk to each other:
- **Development** builds features and ships code
- **Security** reviews code after it is built and blocks releases
- **Operations** runs the systems and is blamed when they fail
The result is predictable: slow releases, adversarial relationships, security findings that are too late to fix economically, and operational failures that no one owns.
The antifragile alternative is not a new tool. It is a **new structure**: shared accountability, integrated workflows, and teams that own their systems from commit to retirement.
**The business case**:
- **Speed**: Releases move from quarterly to weekly—or daily—because there are no handoff queues
- **Cost**: Security findings fixed in development cost 1/100th of what they cost in production
- **Resilience**: Teams that own operations design systems that do not fail; teams that only build features design systems that look good on demo day
- **Talent**: Engineers want to work in high-trust, high-ownership environments
---
## Part 1: Shift Left — The Argument
### What "Shift Left" Actually Means
"Shift left" means moving quality, security, and operational concerns **earlier in the lifecycle**—from production to pre-production, from pre-production to build, from build to design, from design to requirements.
| Stage | Traditional Timing | Shift-Left Timing | Cost to Fix |
|-------|-------------------|-------------------|-------------|
| Requirements | Never | During specification | 1x |
| Design | Never | During architecture review | 5x |
| Development | Post-build (security scan) | During coding (IDE integration) | 10x |
| Build / CI | Post-commit | Pre-commit hooks, automated gates | 15x |
| Test | Pre-release | Continuous automated testing | 25x |
| Production | Post-incident | Continuous monitoring, chaos engineering | 100x+ |
### The Executive Framing
> *"Every security finding discovered in production is a finding that should have been caught in development—at one percent of the cost. Shift left is not a security initiative. It is a cost-reduction initiative with security as the primary beneficiary."*
### Why Most "Shift Left" Programs Fail
| Failure Mode | Root Cause | Antifragile Fix |
|-------------|-----------|----------------|
| Security scans produce thousands of findings | Scans run too late; debt accumulates | Run lightweight scans in IDE; gate commits on critical severity |
| Developers ignore security alerts | Security is not measured in their objectives | Shared OKR: team owns vulnerability count, not just security team |
| Security team becomes the "department of no" | Security is a gate, not a service | Embed security engineers in development teams as consultants |
| Operational issues discovered after release | Operations is not involved in design | Require operational readiness review before release |
---
## Part 2: Merging Dev / Sec / Ops — The Argument
### The Case for Integration
Separate departments create **perverse incentives**:
| Department | Incentive | Resulting Fragility |
|-----------|-----------|---------------------|
| Development | Ship features fast | Security and reliability deferred |
| Security | Prevent breaches | Block releases, slow innovation, become adversarial |
| Operations | Keep systems stable | Resist change, accumulate undocumented workarounds |
When these departments merge into **platform teams** or **product-aligned teams** with end-to-end ownership, incentives align:
| Integrated Team | Incentive | Resulting Resilience |
|----------------|-----------|---------------------|
| Platform team | Reliable, secure, fast infrastructure | Builds guardrails, not gates |
| Product team | Working software in production | Owns security, performance, and operability |
| SRE team | System reliability via engineering | Automates toil, designs for failure |
### The Executive Framing
> *"You currently have three departments optimizing for three different outcomes. Development ships fast. Security says no. Operations keeps the lights on. The result is that nobody optimizes for the only outcome that matters: working, secure, reliable software in production. Merging them does not eliminate specialization. It aligns specialization toward a shared goal."*
### The Three Models (Progressive Integration)
We do not demand full merger on day one. We propose a **progressive path**:
#### Model 1: Shift Left with Embedded Security (Months 1-6)
- Security engineers embed in development teams 2-3 days per week
- Security tooling integrated into IDE and CI/CD pipeline
- Shared vulnerability metrics: team owns count, not security department
- Operational readiness checklist required before release
**What changes**: Process and proximity. No headcount reorganization.
#### Model 2: Platform Teams with SRE (Months 6-12)
- Create platform teams that own infrastructure, tooling, and developer experience
- SREs embed in product teams or form dedicated reliability teams
- Security becomes a **platform capability**: secure defaults, automated scanning, policy-as-code
- Operations becomes a **platform capability**: observability, incident management, runbook automation
**What changes**: Structural realignment of infrastructure and tooling teams.
#### Model 3: Product-Aligned Teams with Full Ownership (Months 12-24)
- Product teams own their entire stack: code, security, operations, on-call
- Platform teams provide paved roads, not mandatory highways
- Security team becomes a **centre of excellence**: threat intelligence, advanced hunting, policy governance
- Operations becomes a **centre of excellence**: architecture review, chaos engineering, capacity planning
**What changes**: Full organizational transformation. Teams own outcomes, not functions.
---
## Talking Points for Executives
### For the CEO
> *"Your competitors are releasing features weekly while your teams debate whether a security scan finding should block a quarterly release. The organizations that win are not the ones with the best security department. They are the ones where security is so integrated that it does not slow anyone down."*
**Key points**:
- Speed and security are not trade-offs. They are complements when the structure is right.
- Talent retention: the best engineers will not work in slow, adversarial environments.
- Competitive velocity: every month spent in release queue is a month competitors gain.
### For the CFO
> *"A vulnerability found in development costs approximately €500 to fix. The same vulnerability found in production costs €50,000—plus incident response, customer notification, potential regulatory fines, and reputational damage. Shift left is the highest-return cost reduction available in your technology budget."*
**Key points**:
- Quantify current rework: What % of development capacity is spent on post-release fixes?
- Quantify delay cost: What is the revenue impact of a delayed release?
- Quantify incident cost: What was the last production security finding's total cost?
### For the CTO / Engineering Lead
> *"Your development teams want to build great software. Your security team wants to protect the company. Your ops team wants stability. None of them are wrong. But the organizational boundary between them creates friction that destroys all three goals. We are not asking you to hire different people. We are asking you to let them sit together and share a target."*
**Key points**:
- Shared ownership reduces blame and accelerates learning.
- Platform teams reduce cognitive load: developers focus on features, platform teams handle infrastructure.
- SRE practices (error budgets, SLOs) align reliability and velocity mathematically.
### For the CISO
> *"You cannot scale security by adding reviewers. You scale security by making the secure path the easy path. A merged structure does not reduce your authority. It increases your leverage—by embedding security into the workflow rather than standing at the gate."*
**Key points**:
- Security team becomes strategic: threat hunting, intelligence, architecture governance
- Embedded security engineers become force multipliers, not bottlenecks
- Metrics shift from "findings blocked" to "vulnerabilities prevented"
### For the Head of Operations
> *"Operations is not a cost centre. It is the place where software meets reality. When operations is separate from development, developers ship software they do not understand, and operations maintains systems they did not design. The result is burnout, outages, and undocumented fixes. Integrated teams own the full lifecycle. That ownership produces better design and fewer surprises."*
**Key points**:
- SRE principles reduce toil through automation
- Teams that own on-call design systems that fail gracefully
- Operational expertise upstream prevents downstream emergencies
---
## Objection Handling
| Objection | Response | Follow-Up |
|-----------|----------|-----------|
| "Our departments are too big to merge." | "We are not proposing a reorganization on day one. We are proposing embedded collaboration and shared metrics as the first step. Structure follows behaviour." | "Let us pilot with one product team and measure velocity and defect rates before and after." |
| "Security will lose independence." | "Independence does not require separation. Auditors can review integrated teams. The security function retains policy authority while embedding execution." | "The security team sets the guardrails. The product team drives within them. That is independence with collaboration." |
| "Developers do not want to do security." | "Developers do not want to do security theater. They want to ship working software. When security is automated, contextual, and fast, developers embrace it. When security is a quarterly scan with 500 false positives, they ignore it." | "Let us show them an IDE plugin that finds vulnerabilities as they type, with suggested fixes. That changes the conversation." |
| "Operations will resist losing control." | "Operations is not losing control. It is gaining influence earlier in the lifecycle. The operational readiness review becomes a design input, not a release gate." | "Your ops engineers have invaluable production knowledge. We want that knowledge in the architecture review, not just the war room." |
| "We tried DevOps before and it failed." | "Most 'DevOps' failures are actually 'DevOps theater': renaming teams without changing incentives or accountability. We measure outcomes—release frequency, change failure rate, mean time to recovery—not labels." | "What failed last time? Tools? Training? Executive support? We design specifically to avoid those failure modes." |
| "Regulators require segregation of duties." | "Segregation of duties does not require segregation of departments. It requires that no single person can approve and execute a critical change without review. Integrated teams can maintain segregation through workflow and tooling." | "Banking regulators increasingly accept policy-as-code and automated approval chains as valid segregation controls." |
| "This would require massive retraining." | "The first phase requires no retraining. It requires proximity: security engineers sitting with developers, ops engineers joining design reviews. Training follows need, not mandate." | "We will identify skill gaps in the pilot and target training precisely." |
---
## The 90-Day Organizational Pilot
We do not propose a full merger in 90 days. We propose a **pilot that proves the concept**.
### Week 1-2: Select the Pilot Team
- Criteria:
- High release frequency (or high desire for it)
- Moderate security exposure (not the most critical system, not the least)
- Willing engineering manager
- Existing CI/CD pipeline
### Week 3-4: Embed and Integrate
- Security engineer: 2-3 days per week with the team
- SRE / ops representative: joins sprint planning and retrospectives
- Shared Slack/Teams channel: no more ticket-based handoffs for routine questions
- Joint OKR: team owns vulnerability count, change failure rate, and mean time to recovery
### Week 5-8: Tooling and Automation
- Security scanning in IDE and CI pipeline
- Operational readiness checklist (automated where possible)
- Runbook for common operational tasks owned by the team
- Error budget defined: reliability target that allows velocity
### Week 9-12: Measure and Report
| Metric | Before | After | Target |
|--------|--------|-------|--------|
| Release frequency | X/quarter | Y/week | 1+ per week |
| Lead time for changes | X days | Y days | < 3 days |
| Change failure rate | X% | Y% | < 15% |
| Mean time to recovery | X hours | Y hours | < 1 hour |
| Critical vulnerabilities in production | X | Y | 0 |
| Security review cycle time | X days | Y days | < 1 day |
### Week 12: Steering Committee Presentation
- Show metrics
- Team testimonials
- Recommendation: expand to N teams, or adjust and retry
---
## Regulatory Alignment
### DORA and ICT Risk Management
DORA Article 6 (ICT risk management framework) implicitly requires:
- Integrated risk assessment across development, operations, and security
- Continuous monitoring that spans the full lifecycle
- Incident learning that produces structural improvements
A siloed organization struggles to demonstrate this integration. A merged structure produces the evidence naturally.
### Banking: Segregation of Duties
Banking regulators require segregation between:
- Development and production access
- Security policy and security operations
- Change approval and change execution
**These can be maintained in integrated teams through**:
- Policy-as-code (security rules encoded in pipeline)
- Automated approval workflows (no single person can deploy critical changes)
- Independent audit function (separate from operational teams)
- Immutable logging (all actions recorded, tamper-evident)
### Critical Infrastructure: Safety and Security
In power and telco, safety systems must be protected from IT changes. This does not require organizational separation. It requires:
- **Technical separation**: Air gaps, unidirectional gateways, safety-certified systems
- **Change control**: Independent safety review for changes touching safety-critical functions
- **Operational discipline**: Procedures that are followed regardless of organizational structure
---
## Integration With the Rapid Modernisation Plan
Organizational resilience runs parallel to technical hardening:
| Rapid Modernisation Phase | Organizational Parallel |
|--------------------------|------------------------|
| Hygiene (Days 0-30) | Map current Dev/Sec/Ops handoffs; identify highest-friction boundary |
| Control (Days 30-60) | Embed security in pilot team; automate first security gate in CI/CD |
| Sovereignty (Days 60-90) | Pilot team owns full lifecycle; measure release frequency and recovery time |
| Antifragility (Days 90-180) | Expand to additional teams; platform team provides paved roads; centre of excellence formed |
---
*For the C-suite conversation guide, see [C-Suite Conversation Guide](c-suite-conversation-guide.md).*
*For the business case including organizational ROI, see [Business Case Template](../playbooks/business-case-template.md).*

View File

@@ -0,0 +1,259 @@
# Embedded Quality & Process Assurance
> *"You do not need another audit. You need someone to sit with your team, watch them work, and help them fix the friction that slows them down and creates vulnerabilities."*
This document defines an engagement model for clients who feel they are **not truly in control** of their projects, teams, or operations. It is not an audit. It is not a penetration test. It is **embedded process assurance**: an experienced advisor joins the team, observes the reality of daily work, identifies the gaps between intent and execution, and co-creates improvements that stick.
It is designed for Heads of Security and Heads of Operations who have tools, policies, and headcount—but still feel that something is slipping through the cracks.
---
## The "Not in Control" Posture
### What They Actually Mean
When a Head of Security or Head of Operations says *"we don't believe we are truly in control of what we have / what we are doing,"* they are usually describing one or more of these conditions:
| Symptom | What Is Actually Happening |
|---------|---------------------------|
| "We have policies but nobody follows them" | Process-theater: documents exist, behaviour is unchanged |
| "We bought tools but they are not configured" | Shelfware: purchased capability, never operationalized |
| "I find out about changes after they happen" | Visibility gap: no governance gates, no change notification |
| "The same incident keeps happening" | Learning failure: post-mortems are written, nothing structural changes |
| "My team is busy but I cannot tell you what they achieved" | Activity without outcomes: metrics measure effort, not risk reduction |
| "We have a project plan but it does not match reality" | Planning fantasy: Gantt charts assume perfect conditions; reality is messier |
| "I do not trust our own documentation" | Drift: systems were documented once; they have changed dozens of times since |
**The insight**: These leaders do not need more tools. They need **someone to help them see what is actually happening** and **someone to help them fix it in the context of real work**.
---
## What This Is Not
| Traditional Approach | Embedded Quality Assurance |
|---------------------|---------------------------|
| **Audit** (arrives, checks boxes, leaves a report) | **Presence** (stays, observes work, fixes friction in real time) |
| **External assessment** (interviews, surveys, sampling) | **Embedded observation** (attends standups, watches deployments, reads actual tickets) |
| **Recommendations** (list of things to do, no help doing them) | **Co-implementation** (suggests improvement, helps implement it, validates it works) |
| **Quarterly review** (one meeting, static snapshot) | **Continuous calibration** (weekly check-ins, daily Slack/Teams presence, adaptive focus) |
| **Generalist consultant** (knows frameworks, not your stack) | **Practitioner-advisor** (knows M365, Azure, Defender, Intune, and how teams actually use them) |
---
## The Engagement Model
### Phase 1: Immersion (Week 1-2)
**Objective**: Understand the reality of how work happens—not how it is documented.
**Activities**:
- Attend team standups, sprint planning, retrospectives (for agile teams)
- Attend change advisory board, incident review, capacity planning (for operations teams)
- Shadow key personnel: senior engineer, security analyst, ops lead, project manager
- Review actual work artifacts: recent tickets, pull requests, incident post-mortems, change records
- Observe tool usage: how they actually use Intune, Sentinel, Defender, Azure AD—not how the manual says they should
- Map the **formal process** (documented) against the **informal process** (actual)
**Deliverable**: Immersion Report
- Formal vs. actual process map
- Top 5 friction points (not failures—friction)
- Top 3 "invisible risks" (things that are not tracked but should be)
- Team sentiment: what do they believe is broken that leadership does not see?
**The conversation at Week 2**:
> *"Your policy says all changes require CAB approval. In reality, 60% of Azure policy changes happen via direct portal access by two senior engineers who document them after the fact. That is not non-compliance. That is a signal that your CAB process is too slow for operational reality. We fix the process, not the people."*
---
### Phase 2: Friction Reduction (Week 3-6)
**Objective**: Fix the highest-friction gaps that create both inefficiency and vulnerability.
**Activities**:
- Implement **fast wins** that reduce daily pain:
- Automate a manual provisioning step
- Create a runbook for a recurring but undocumented task
- Simplify a approval workflow that takes 3 days and 4 people
- Standardize a configuration that is currently done differently on every deployment
- Introduce **guardrails, not gates**:
- Replace pre-deployment security review with automated scanning in CI/CD
- Replace quarterly access review with monthly automated report + exception tracking
- Replace post-incident blame with blameless post-mortem with structural mandate
- Build **visibility where there is blindness**:
- Dashboard showing actual vs. planned changes
- Alert when configuration drifts from baseline
- Weekly "what changed" report for leadership
**Deliverable**: Friction Reduction Report
- Before/after metrics: time saved, errors reduced, visibility gained
- Implemented improvements with ownership and maintenance plan
- Remaining friction points for next phase
---
### Phase 3: Capability Building (Week 7-10)
**Objective**: Ensure the team can sustain and extend improvements without permanent consultant dependency.
**Activities**:
- **Knowledge transfer sessions**: Teach the team why each improvement was made, not just how it works
- **Documentation-as-code**: Move runbooks and procedures into version-controlled, executable formats where possible
- **Metrics definition**: Help the team define their own success metrics (not consultant-imposed ones)
- **Self-assessment tools**: Give the team checklists and templates to continue the work
- **Mentoring**: Pair junior team members with consultant for specific skills (KQL query writing, Intune policy authoring, incident response triage)
**Deliverable**: Capability Handover Package
- Team-owned process documentation
- Self-assessment checklist
- Metrics dashboard maintained by the team
- 90-day improvement roadmap drafted by the team (not the consultant)
---
### Phase 4: Validation (Week 11-12)
**Objective**: Prove that the team is now in control—and that the consultant can leave.
**Activities**:
- Consultant steps back to advisory-only presence
- Team runs a week independently; consultant observes from distance
- Validation exercise: team handles a simulated incident, change, or deployment without consultant help
- Retrospective: what worked, what still needs work, what the team will tackle next
**Deliverable**: Validation Report
- Independent operation confirmation
- Remaining gaps (honest assessment)
- Recommended next module or engagement type
---
## Application Contexts
### Context 1: M365 Project Team
**Profile**: Client is deploying M365 (greenfield or migration); project is behind schedule; team is overwhelmed; quality is slipping.
**Embedded assurance activities**:
- Observe provisioning workflow: are users created consistently? Are licenses assigned correctly? Are permissions documented?
- Observe change control: is every tenant change tracked? Is there rollback capability?
- Observe communication: does the project team know what the security team needs? Does security know what the project team is changing?
- Implement: standard user provisioning template, automated license reconciliation, change log in shared channel
**The pitch**:
> *"Your M365 project is not failing because your team is incompetent. It is failing because the gap between what they know and what they are expected to deliver is too wide. I join the team for 12 weeks, help them close that gap, and leave them with processes they can sustain."*
### Context 2: Security Operations Team
**Profile**: SOC or security team has tools but no rhythm; alerts are ignored; incidents are reactive; burnout is high.
**Embedded assurance activities**:
- Observe alert triage: which alerts are ignored? Why? (False positive? No runbook? No authority to act?)
- Observe incident response: who is called? When? How is information shared? Where does the process stall?
- Observe shift handoffs: what is lost between shifts?
- Implement: alert tuning playbook, tier-1 triage runbook, automated enrichment, shift handoff template
**The pitch**:
> *"Your security team is drowning in noise. They do not need another SIEM. They need someone to help them turn that noise into signal, build repeatable processes, and regain the confidence that they are seeing what matters. I sit with them, watch their shifts, and help them build a rhythm."*
### Context 3: Infrastructure / Operations Team
**Profile**: Ops team maintains critical systems; changes are ad-hoc; documentation is stale; knowledge is concentrated in one or two people.
**Embedded assurance activities**:
- Observe change execution: how is a firewall rule added? A DNS record changed? A certificate renewed?
- Observe monitoring: what is watched? What is not? Who responds to alerts at 2 AM?
- Observe documentation: is it accurate? Do people use it? When was it last updated?
- Implement: change automation for high-frequency tasks, monitoring dashboard, living documentation process, cross-training plan
**The pitch**:
> *"Your ops team knows the systems better than anyone—but that knowledge lives in their heads. If one person leaves, the organization loses critical capability. I help them externalize that knowledge into repeatable, documented, automatable processes. The team becomes stronger, not more dependent."*
### Context 4: Development Team
**Profile**: Dev team ships code but security is a bottleneck; vulnerabilities found late; releases are stressful.
**Embedded assurance activities**:
- Observe the "security moment": when does security enter the conversation? Day 1 or day 45?
- Observe the deployment pipeline: what checks exist? Which are bypassed? Why?
- Observe the feedback loop: when a vulnerability is found, how long until it is fixed? What prevents faster resolution?
- Implement: security checks in IDE, automated SAST in CI/CD, vulnerability prioritization aligned with business impact, shared metrics
**The pitch**:
> *"Your developers want to ship secure code. Your security team wants to prevent breaches. Both are frustrated because they work in separate rooms with separate metrics. I embed with the dev team for 12 weeks, make security part of their daily workflow instead of a quarterly gate, and prove that speed and security are complements—not trade-offs."*
---
## Talking Points for the Head of Security / Head of Operations
**When they say**: *"We don't believe we are truly in control of what we have."*
**You respond**:
> *"That feeling is usually accurate—and it is not a tool problem. It is a visibility and process problem. You have capable people, but the gap between what is documented and what is actually happening has grown too wide. I do not audit you. I join your team, observe the reality, and help you close that gap. In 12 weeks, you will have repeatable processes, accurate documentation, and a team that trusts its own capability."*
**When they say**: *"We have tried consultants before and nothing changed."*
**You respond**:
> *"Most consultants deliver a report and leave. I deliver presence. I attend your standups, read your tickets, and help fix things while I am there. The difference is not the findings—it is the implementation. You will see changes in the first two weeks, not in a final deck."*
**When they say**: *"We don't have budget for a long engagement."*
**You respond**:
> *"This is 12 weeks, fixed scope. But the first deliverable—the immersion report—is available in Week 2. If you do not see value by then, we stop. Most clients see enough value in the first two weeks to justify the full engagement."*
**When they say**: *"My team will feel judged if someone is watching them."*
**You respond**:
> *"I am not there to evaluate individuals. I am there to evaluate the system: the processes, the tools, the handoffs, the invisible workarounds. Every team has workarounds—they exist because the formal process does not match reality. My job is to make the formal process match reality, not to shame anyone for adapting."*
---
## Metrics That Prove Control
| Before | After | What It Measures |
|--------|-------|-----------------|
| "We think our config is standard" | "We can show the drift from baseline in real time" | Visibility |
| "Changes happen, we find out later" | "Every change is logged, notified, and rollback-ready" | Control |
| "The same alert fires 50 times a day" | "We tuned the alert; it now fires 3 times, and each is actionable" | Signal quality |
| "Incidents take 4 hours to escalate" | "Incidents auto-enrich and route in 15 minutes" | Response speed |
| "Two people know how to do X" | "Anyone on the team can do X from the runbook" | Resilience |
| "We have 20 open critical vulnerabilities" | "We have 3; the other 17 were false positives or already mitigated" | Accuracy |
| "I do not know what the team did this week" | "I can see risk reduction, process improvement, and blockers" | Transparency |
---
## Integration With Modular Engagements
This module sits naturally between **technical hardening** and **organizational transformation**:
```
Module 1 (Endpoint Management) or Module 3 (M365 Hardening)
↓ Reveals process gaps the tools cannot fix
Module 11 (Embedded Quality & Process Assurance)
↓ Builds team capability to sustain improvements
Module 9 (Organizational Resilience) or Module 12 (Blue/Purple Team Foundation)
↓ Scales the capability across the organization
```
It can also precede technical work:
```
Module 11 (Embedded Quality & Process Assurance)
↓ Discovers that tools are misconfigured because processes are broken
Module 2 (Identity Security) or Module 3 (M365 Hardening)
↓ Technical fixes now stick because processes support them
```
---
*For the modular engagement menu, see [Modular Engagements](modular-engagements.md).*
*For organizational structure transformation, see [Organizational Resilience](organizational-resilience.md).*
*For blue/purple team capability building, see [Blue/Purple Team Foundation](blue-purple-team-foundation.md).*

View File

@@ -0,0 +1,263 @@
# Retained Capability: What to Keep In-House When You Outsource Security
> *"Outsourcing your SOC does not outsource your risk. It outsources your alert triage. The thinking—the detection engineering, the threat modeling, the business-context awareness—must stay inside your walls. Otherwise you are paying for someone else's generic playbook applied to your specific threat landscape."*
This document addresses one of the most common and expensive misconceptions in enterprise security: the belief that outsourcing a security function means outsourcing the expertise required to make that function effective. It is designed for clients who have engaged an MSSP (Managed Security Service Provider) or outsourced SOC, who feel the service underperforms, and who do not realize that the performance gap is largely within their own control.
---
## The MSSP Illusion
### What the Client Believes
> *"We pay a SOC provider €50,000 per month. They have 200 analysts and advanced tools. Our security is handled."*
### What Is Actually Happening
| Client Assumption | MSSP Reality |
|------------------|--------------|
| "They monitor our environment 24/7" | They monitor the alerts their generic rules generate. Rules tuned to their entire client base, not to your environment. |
| "They have threat intelligence" | They consume commercial threat feeds. They do not have intelligence about *your* specific adversaries, your *industry's* TTPs, or your *proprietary* attack surface. |
| "They investigate incidents" | They triage alerts based on severity. True investigation—understanding *why* an anomaly matters to *your* business—is rarely within scope. |
| "They improve over time" | They improve their own margins by standardizing. Customization for your environment costs them money. |
| "We can hold them accountable" | Your SLA measures ticket volume and response time, not detection quality, mean-time-to-contain, or adversary emulation success rate. |
**The hard truth**: Most MSSP underperformance is not the MSSP's fault. It is the client's fault for outsourcing the execution **and** the thinking.
---
## The Retained Capability Model
When you outsource a security function, you should retain three capabilities internally:
| Retained Capability | Why It Cannot Be Outsourced | What It Produces |
|--------------------|---------------------------|------------------|
| **Detection Engineering** | Only you know what "normal" looks like in your environment. Only you can write rules that detect anomalies specific to your architecture, your applications, and your user behaviours. | Custom detection rules (KQL, Sigma, YARA) that catch threats generic rules miss |
| **Threat Context & Prioritization** | Only you know which assets are crown jewels. Only you can prioritize a vulnerability on your payment gateway over a vulnerability on your marketing blog. | Risk-ranked remediation that aligns with business impact |
| **Integration & Orchestration** | Only you can connect the SOC to your change management, your identity team, your OT engineers, and your executives. | Closed-loop incident response that produces structural improvement |
**The analogy**:
> *"An MSSP is like a security guard in your building. They watch the cameras, patrol the halls, and call the police when they see something. But they do not design the building's security architecture. They do not know which rooms contain the crown jewels. They do not decide whether a new wing needs stronger locks. Those decisions require someone who understands the building, its occupants, and its valuables. That someone must be you."*
---
## The Detection Engineering Gap (SOC-Specific)
### What Generic MSSP Rules Detect
- Known malware signatures
- Common phishing indicators
- Brute-force login attempts
- Known-bad IP addresses and domains
- Standard persistence techniques
### What Generic MSSP Rules Miss
| Threat | Why Generic Rules Miss It | What Custom Detection Would Catch |
|--------|--------------------------|-----------------------------------|
| **Insider threat**: Employee exfiltrating data via sanctioned cloud storage | The activity looks like normal business use | Unusual volume, timing, or destination for that specific user role |
| **Living-off-the-land**: Attacker using native tools (WMIC, net.exe, PowerShell) | These are legitimate administrative tools | Execution context, parent-child process relationships, and command-line arguments specific to your environment |
| **Compromised service account**: Non-interactive account suddenly interactive | Service accounts are rarely monitored individually | Any interactive login from a known service account |
| **Supply chain compromise**: Vendor VPN used at 3 AM from new geography | Vendor access is pre-authorized | Time-of-day and geo anomalies for specific vendor accounts |
| **OT reconnaissance**: IT network scanning targeting OT VLANs | Standard IT scanning is normal | Scanning traffic crossing the IT/OT boundary |
| **AI-enabled fraud**: Deepfake voice call authorizing wire transfer | Traditional fraud controls do not detect synthetic media | Anomaly in voice authentication + financial authorization workflow |
**The insight**: Every environment has a unique "attack surface fingerprint." An MSSP serving 200 clients cannot maintain 200 custom detection rulebooks. They maintain one rulebook and apply it everywhere. The gaps are yours to fill.
---
## The Minimum Viable In-House Capability
You do not need a 20-person SOC to make an MSSP effective. You need a **minimal viable retained capability**:
### For Outsourced SOC: The Detection Engineering Cell
| Role | FTE | Responsibility |
|------|-----|---------------|
| **Detection Engineer** | 0.5-1.0 | Writes custom KQL/Sigma rules; tunes MSSP alert thresholds; validates MSSP detection coverage |
| **Threat Context Analyst** | 0.5-1.0 | Prioritizes MSSP findings by business impact; provides environment-specific context; hunts for gaps |
| **Integration Lead** | 0.25-0.5 | Ensures SOC feeds into change management, incident response, and governance; owns the MSSP relationship |
**Total: 1.5-2.5 FTEs** (can be part-time across existing staff or a single senior analyst)
**What this cell does weekly**:
- Reviews MSSP closed tickets: were they true positives? Were any missed?
- Reviews MSSP open tickets: are they stuck waiting for context the MSSP does not have?
- Reviews new threats: would our MSSP detect this? If not, what custom rule do we need?
- Conducts one hunt: proactive search for threats the MSSP is not configured to see
- Meets with MSSP: provides feedback, requests tuning, shares environment changes
---
## How to Audit Your MSSP's Detection Coverage
### The Purple Team Test for MSSPs
Most clients evaluate MSSPs on **response time** and **ticket volume**. These are the wrong metrics. Evaluate them on **detection coverage**.
**The test**:
1. **Select 5 TTPs** relevant to your threat model:
- One initial access vector (e.g., phishing with embedded macro)
- One persistence technique (e.g., scheduled task creation)
- One lateral movement technique (e.g., RDP hijacking)
- One data collection technique (e.g., large ZIP creation)
- One exfiltration technique (e.g., upload to personal cloud storage)
2. **Execute them in a controlled environment** (or simulate them with purple team tools)
3. **Measure**:
- Did the MSSP detect the activity?
- How long from execution to alert?
- Was the alert accurate and actionable?
- Did the MSSP understand the business impact?
4. **Gap analysis**: For every undetected TTP, determine:
- Is the MSSP capable of detecting this but not tuned for our environment?
- Is this beyond the MSSP's generic capability?
- What custom detection rule would close the gap?
**Deliverable**: Detection Coverage Matrix
| TTP | Generic MSSP Detection | Custom Rule Required | Owner | Priority |
|-----|----------------------|---------------------|-------|----------|
| Phishing with macro | Yes (standard) | No | MSSP | — |
| Scheduled task persistence | Partial (noisy) | Yes: parent process + user context | Client Detection Engineer | P1 |
| RDP hijacking | No | Yes: concurrent sessions + unusual source | Client Detection Engineer | P1 |
| Large ZIP creation | No | Yes: volume threshold + destination | Client Detection Engineer | P2 |
| Personal cloud upload | Partial (known apps only) | Yes: DLP + user behaviour baseline | Client Detection Engineer | P1 |
---
## The MSSP Relationship Redesign
Most MSSP contracts are structured as **black boxes**: the client sends logs; the MSSP sends tickets. This model guarantees mediocrity.
**The antifragile alternative**: Co-managed SOC with clear capability boundaries.
| Function | MSSP Responsibility | Client Responsibility | Collaboration Model |
|----------|--------------------|----------------------|---------------------|
| **Log ingestion & platform ops** | Own the SIEM/SOAR infrastructure | Provide logs, verify completeness | Monthly log source audit |
| **Alert triage (Tier 1)** | Initial assessment, enrichment, false positive closure | Provide context, approve escalations | Shared Slack/Teams channel |
| **Investigation (Tier 2)** | Technical analysis, scope assessment | Business impact assessment, stakeholder notification | Joint incident bridge |
| **Detection engineering** | Maintain generic rulebook | Write custom rules, tune thresholds, validate coverage | Bi-weekly detection review |
| **Threat hunting** | Hunt on MSSP-wide intelligence | Hunt on client-specific intelligence and anomalies | Monthly hunt hypothesis workshop |
| **Incident response** | Contain and eradicate (with approval) | Strategic decisions, regulatory notification, communications | Pre-approved containment playbooks |
| **Reporting & metrics** | Ticket volume, response time, closed alerts | Detection coverage, mean-time-to-contain, business impact | Joint monthly metrics review |
| **Continuous improvement** | Platform updates, threat feed integration | Architecture changes, detection gap closure, purple team | Quarterly capability review |
**The contract amendment**:
> *"Your MSSP contract currently measures response time and ticket volume. We propose adding two metrics: (1) Detection Coverage Rate—the percentage of emulated TTPs your MSSP detects in our environment, and (2) Custom Rule Integration Time—the days between us submitting a detection rule and your team deploying it. These metrics align your incentives with our actual security outcomes."*
---
## Generalizing Beyond SOC
The retained capability principle applies to any outsourced security function:
### Outsourced Penetration Testing
| What the Vendor Does Well | What You Must Retain |
|---------------------------|---------------------|
| Execute standardized test methodology | Define scope based on your actual threat model |
| Find common vulnerabilities | Prioritize findings by business impact |
| Write exploit proof-of-concepts | Validate whether a finding is truly exploitable in *your* architecture |
| Produce a report | Convert findings into a structural improvement roadmap |
**The gap**: Most pentest reports sit unread. Without internal capability to validate, prioritize, and remediate, the test is theater.
### Outsourced Compliance Auditing
| What the Vendor Does Well | What You Must Retain |
|---------------------------|---------------------|
| Check control existence against framework | Define which controls actually reduce your risk |
| Sample evidence | Ensure evidence represents operational reality, not audit-day fiction |
| Write findings | Convert findings into actionable remediation with business justification |
| Provide certification | Maintain continuous compliance between audits |
**The gap**: Compliance auditors check boxes. They do not know which boxes matter most to your survival.
### Outsourced Cloud Security Posture Management
| What the Vendor Does Well | What You Must Retain |
|---------------------------|---------------------|
| Scan cloud resources against benchmarks | Define which misconfigurations are actually exploitable in your network topology |
| Generate remediation scripts | Validate that remediation does not break production workloads |
| Track drift over time | Understand *why* drift occurs (process failure, shadow IT, emergency change) |
**The gap**: CSPM tools find thousands of "violations." Without internal context, every violation is treated as equally urgent.
### Outsourced Incident Response Retainer
| What the Vendor Does Well | What You Must Retain |
|---------------------------|---------------------|
| Respond to active incidents with specialized expertise | Know your environment well enough to guide the responders to critical systems |
| Forensic acquisition and analysis | Preserve chain of custody and business continuity during investigation |
| Eradication and recovery | Make strategic decisions about containment scope and communication |
**The gap**: External IR firms arrive blind. Without internal documentation and a pre-established relationship, they spend the first 48 hours learning your network.
---
## The Business Case for Retained Capability
### Cost of the Current Model
| Cost Category | Typical Annual Impact |
|--------------|----------------------|
| MSSP subscription (underperforming) | €500K-€2M |
| Missed detections leading to breach | €4.5M average (rare but catastrophic) |
| Alert fatigue: analyst turnover and burnout | €150K per replaced analyst |
| Compliance penalties from undetected control failures | €100K-€2M (regulated industries) |
| **Total risk-adjusted cost** | **€600K-€8M+** |
### Cost of Retained Capability
| Investment | Annual Cost |
|-----------|-------------|
| 1.5-2.5 FTE detection engineering cell | €150K-€300K |
| Detection engineering tooling (free/open-source + Azure) | €10K-€30K |
| Purple team exercises (quarterly) | €20K-€40K |
| Consultant support (detection engineering mentor, quarterly) | €30K-€60K |
| **Total retained capability investment** | **€210K-€430K** |
**ROI**: For a mid-sized organization, retained capability reduces breach probability, improves MSSP effectiveness, and prevents compliance failures. The investment pays for itself if it prevents one missed detection per year.
---
## The Consultant's Role
As an antifragile consultant, you do not replace the MSSP. You make the MSSP effective by:
1. **Auditing detection coverage** (Purple team test for MSSPs)
2. **Building the detection engineering cell** (hiring, training, tooling, process)
3. **Redesigning the MSSP relationship** (metrics, collaboration model, contract amendments)
4. **Writing the first custom rules** (KQL, Sigma, Sentinel analytics rules)
5. **Training internal staff** to sustain and extend the capability
6. **Establishing the operating rhythm** (weekly detection review, monthly hunt, quarterly capability assessment)
**The pitch to the CISO**:
> *"Your MSSP is not failing you. You are failing to give them the context and custom detection rules they need to succeed in your environment. We do not fire the MSSP. We build a 2-person detection engineering cell inside your organization that makes the MSSP 3x more effective. For the cost of one senior analyst, you transform a €600K annual MSSP spend from insurance theater into actual protection."*
**The pitch to the CFO**:
> *"You are spending €600K per year on a SOC provider that runs generic rules. Generic rules catch generic threats. Your adversaries are not generic. A €200K investment in retained detection engineering makes your existing €600K SOC investment actually work. That is not additional spend. That is making current spend effective."*
---
## Integration With Existing Frameworks
| Document | Integration |
|----------|-------------|
| [Blue/Purple Team Foundation](blue-purple-team-foundation.md) | Detection engineering is the core of blue team capability; this document adds the MSSP co-management layer |
| [Modular Engagements](modular-engagements.md) | Retained capability audit can be delivered as a standalone 30-day module; detection engineering cell build is a 60-90 day module |
| [Antifragile Risk Register](../assessment-templates/antifragile-risk-register.md) | "Outsourced SOC with no retained detection engineering" is a T1 risk with extreme optionality impact |
| [Business Case Template](../playbooks/business-case-template.md) | Retained capability ROI calculation |
---
*For building blue team capability from scratch, see [Blue/Purple Team Foundation](blue-purple-team-foundation.md).*
*For the modular engagement menu, see [Modular Engagements](modular-engagements.md).*

View File

@@ -0,0 +1,222 @@
# T0 Asset Framework
> *"Local AI is not an upgrade. It is an insurance policy against the obsolescence of your own company."*
This framework defines the **Tier 0 (T0) asset classification** and its application to sovereign intelligence, critical infrastructure, and organizational survival. It translates cybersecurity risk language into strategic architecture decisions.
---
## What Is a T0 Asset?
In enterprise security and infrastructure architecture, assets are commonly tiered by criticality:
| Tier | Definition | Traditional Examples |
|------|-----------|---------------------|
| T3 | Standard business assets | Office productivity, non-critical SaaS |
| T2 | Important operational assets | ERP, CRM, standard customer-facing systems |
| T1 | Critical assets whose failure causes major harm | Financial systems, core production databases, active directory |
| **T0** | **Assets whose compromise or loss destroys the entire operation** | **Domain controllers, root certificate authorities, cryptographic key material, sovereign intelligence** |
A T0 asset is not merely "important." It is **existential**. Its loss does not cause downtime; it causes dissolution.
---
## Why Sovereign Intelligence Is T0
Treating local AI infrastructure as Tier 0 reframes the conversation from "technology investment" to **"foundational pillar of survival."**
### 1. T0 Defines the Boundary of Trust
Most organizations have allowed their cognitive perimeter to dissolve. Data flows outward to cloud AI providers through APIs, chat interfaces, and embedded assistants. The boundary of trust—the firewall between "us" and "them"—has been punctured by convenience.
By classifying intelligence as T0 and moving it inside the perimeter, the organization:
- **Re-establishes the boundary of trust**
- **Regains control over what can be known about the organization**
- **Prevents silent exfiltration of strategic reasoning**
> *"Our strategy is now ours again."*
### 2. T0 Removes Vendor Risk
Clients are rightly terrified of vendor lock-in for infrastructure. Yet they are sleepwalking into the ultimate lock-in: **intelligence lock-in**.
If an organization builds workflows around a cloud model, it is renting its ability to think. The vendor controls:
- The model's capabilities and behaviour
- The pricing and availability
- The "alignment" and safety filters
- The terms of service and data usage policies
A local model is **vendor-independent**. It is an asset that remains fully functional regardless of:
- Silicon Valley boardroom decisions
- Geopolitical events affecting API availability
- Pricing restructuring
- Model deprecation or behaviour changes
This is the definition of a T0 asset: **it must survive the failure of any external dependency**.
### 3. T0 Signals Strategic Maturity
Most competitors are pushing shiny cloud APIs because they are easy to implement and make the consultant look "modern."
When you advocate for local T0 infrastructure, you signal that you are not interested in the shiny. You are interested in **durability**. You are optimizing for the organization's viability over a 5-to-10-year horizon, not the next quarterly demo.
Clients who are serious about survival recognize that maturity immediately.
### 4. T0 Elevates the Advisor
The industry is currently filled with "AI consultants" who are essentially glorified sales reps for cloud providers. They have a structural conflict of interest: their revenue depends on your consumption of third-party services.
An independent architect has no such conflict. When you say:
> *"I am not suggesting local AI because it is easy. I am suggesting it because it is the only way to keep our proprietary edge from being harvested."*
You are speaking with the authority of someone who is **on the client's side of the table**.
---
## The T0 Asset Lifecycle
### Identification
Not all AI infrastructure is T0. The classification applies to:
- **Proprietary fine-tuned models** trained on internal data
- **Core reasoning infrastructure** that drives strategic or operational decisions
- **Model weights and architectures** that encode organizational knowledge
- **Training datasets** that represent irreproducible intellectual capital
- **Inference pipelines** that touch classified, regulated, or crown-jewel data
Cloud AI usage for generic, non-proprietary tasks (e.g., drafting public marketing copy) may remain non-T0. The classification is **data- and context-dependent**.
### Protection
T0 assets demand T0 protection:
| Control Layer | Requirement |
|--------------|-------------|
| **Physical** | Local hardware in controlled facilities; no third-party physical access |
| **Network** | Air-gapped or strictly segmented; no direct internet egress from inference hosts |
| **Access** | Zero-trust with just-in-time elevation; multi-party approval for model changes |
| **Cryptographic** | Model weights encrypted at rest and in transit; key material in HSM |
| **Audit** | Complete logging of access, inference, and fine-tuning operations |
| **Backup** | Immutable, geographically distributed backups of weights, data, and configurations |
| **Recovery** | Tested recovery procedures with RPO < 1 hour and RTO < 4 hours |
### Monitoring
T0 assets require continuous validation:
- **Integrity monitoring**: Detect unauthorized changes to model weights or configurations
- **Performance drift monitoring**: Ensure fine-tuned models maintain accuracy over time
- **Access anomaly detection**: Alert on unusual inference patterns or unauthorized access attempts
- **Dependency health**: Monitor supporting infrastructure (GPU, storage, orchestration) with the same rigor as the models themselves
### Recovery
A T0 asset without a tested recovery plan is a liability:
- **Quarterly recovery drills**: Restore model weights and inference pipelines from backup
- **Version rollback capability**: Maintain previous model versions for instant reversion
- **Cross-site redundancy**: Active-passive or active-active deployment across independent facilities
- **Documentation**: Recovery runbooks that can be executed by personnel who did not design the system
---
## The Vault Metaphor
When clients ask why they should accept the "friction" of local hosting, use the vault metaphor:
> *"Think of it like this: If our company's intelligence was a physical pile of cash, would we store it in a public bank that takes a 'training fee' off every dollar we put in and that holds the right to change the currency whenever they want? Or would we keep it in our own vault, where we control the security, the access, and the value?"*
**Local AI is the vault.**
The vault has a cost. It requires space, guards, and maintenance. But it guarantees that:
- The cash is there when you need it
- No one else is lending it out
- The currency does not change overnight
- You can audit the balance at any time
---
## T0 Classification Worksheet
Use this worksheet during client engagements to classify AI and intelligence assets:
```
Asset Name: ________________________________
Description: ________________________________
Data Types Processed: _______________________
[ ] Public information
[ ] Internal operational data
[ ] Customer data
[ ] Financial data
[ ] Strategic / IP data
[ ] Regulated data (specify: _________)
If this asset were unavailable for 24 hours:
[ ] Minor inconvenience
[ ] Operational disruption
[ ] Significant financial loss
[ ] Existential threat to organization
If this asset's data were leaked to a competitor:
[ ] No impact
[ ] Reputational damage
[ ] Competitive disadvantage
[ ] Existential threat to organization
If the vendor discontinued this service tomorrow:
[ ] Easy replacement within 30 days
[ ] Difficult replacement within 90 days
[ ] Replacement requires major re-architecture
[ ] No viable replacement exists
TIER CLASSIFICATION: [ ] T3 [ ] T2 [ ] T1 [ ] T0
Justification: ________________________________
Required Controls: ____________________________
Owner: ______________________________________
Review Date: ________________________________
```
---
## Integrating T0 with Existing Frameworks
### NIST Cybersecurity Framework
| NIST Function | T0 Application |
|--------------|----------------|
| Identify | Asset inventory explicitly includes model weights, training data, and inference pipelines |
| Protect | Encryption, access control, and segmentation applied to AI infrastructure at the highest level |
| Detect | Anomaly detection on model access and inference patterns |
| Respond | Incident response plans include model compromise and data poisoning scenarios |
| Recover | Recovery objectives for AI assets match or exceed those of domain controllers |
### CIS Controls
Map T0 AI assets to CIS Control 1 (Inventory and Control of Enterprise Assets) and Control 3 (Data Protection). Treat model weights as sensitive data subject to the same controls as cryptographic key material.
---
## Consultant's Checklist
When presenting the T0 framework to clients:
- [ ] Explain the T0 concept using familiar examples (domain controllers, root CAs)
- [ ] Map the client's current AI usage to the tier classification
- [ ] Identify at least one T0-class intelligence asset the client has not recognized
- [ ] Present the vault metaphor for intuitive understanding
- [ ] Quantify the vendor risk: what happens if the cloud provider changes terms tomorrow?
- [ ] Show the strategic maturity signal: this is what serious organizations do
- [ ] Provide the worksheet for self-assessment
- [ ] Connect T0 classification to immediate next steps in the [Rapid Modernisation Plan](../playbooks/rapid-modernisation-plan.md)
---
*Next: [Rapid Modernisation Plan](../playbooks/rapid-modernisation-plan.md)*
*Previous: [AI Sovereignty Framework](ai-sovereignty-framework.md)*

View File

@@ -0,0 +1,113 @@
# Antifragile Enterprise Consulting Repository — Index
## For Executives and Board Members
| Document | Purpose | Audience |
|----------|---------|----------|
| [Executive Summary](core/executive-summary.md) | One-page strategic overview | CEOs, Boards, Executive Committees |
| [Modular Engagements](core/modular-engagements.md) | Menu of independent modules; choose your starting point | CEOs, CFOs, Procurement |
| [C-Suite Conversation Guide](core/c-suite-conversation-guide.md) | Scripts, objection handling, and psychological framing | Executives, Advisors |
| [Business Case Template](playbooks/business-case-template.md) | Financial justification, ROI, and risk quantification | CFOs, Boards, Risk Committees |
| [Antifragile Manifest](core/antifragile-manifest.md) | Core philosophy and five pillars (business translation) | Executives, Architects, Consultants |
## For Practitioners and Consultants
| Document | Purpose | Audience |
|----------|---------|----------|
| [README](README.md) | Repository overview and quick start | Everyone |
| [Move Fast and Fix Things](core/move-fast-and-fix-things.md) | Company motto and engagement posture | Consultants, Executives |
| [Antifragile Manifest](core/antifragile-manifest.md) | Core philosophy and five pillars | Executives, Architects, Consultants |
| [AI Operations Inevitability](core/ai-operations-inevitability.md) | Defensive AI is inevitable; business AI is optional | CISOs, CTOs, Consultants |
| [Azure OpenAI Sovereignty Bridge](core/azure-openai-sovereignty-bridge.md) | Azure OpenAI/Foundry as pragmatic sovereignty step | CTOs, Architects, Consultants |
| [Organizational Resilience](core/organizational-resilience.md) | Shift left and Dev/Sec/Ops merger talking points | CTOs, CISOs, Consultants |
| [Embedded Quality Assurance](core/quality-management-engagement.md) | Process assurance for teams feeling "not in control" | Heads of Security, Operations, Project Leaders |
| [Blue/Purple Team Foundation](core/blue-purple-team-foundation.md) | Building defensive capability from existing tool investments | CISOs, SOC Managers, Security Architects |
| [Retained Capability](core/retained-capability.md) | What to keep in-house when outsourcing SOC, pentest, compliance | CISOs, CFOs, Procurement |
## Core Frameworks
| Document | Purpose | Audience |
|----------|---------|----------|
| [Move Fast and Fix Things](core/move-fast-and-fix-things.md) | Speed, repair, and maximizing existing investment | Consultants, Executives |
| [Antifragile Manifest](core/antifragile-manifest.md) | Five pillars of antifragile enterprise | Executives, Architects, Consultants |
| [AI Sovereignty Framework](core/ai-sovereignty-framework.md) | Strategic arguments and implementation for local AI | CISOs, CTOs, Security Architects |
| [T0 Asset Framework](core/t0-asset-framework.md) | Tier 0 classification and protection for critical assets | Security Architects, Infrastructure Leads |
## Playbooks
| Document | Purpose | Audience |
|----------|---------|----------|
| [Rapid Modernisation Plan](playbooks/rapid-modernisation-plan.md) | 30-60-90-180 day transformation roadmap | Program Managers, Consultants, CISOs |
| [Endpoint Management Entry Vector](playbooks/endpoint-management-entry-vector.md) | Intune/device management as the ideal engagement entry point | M365 Consultants, Account Managers |
| [AI-Assisted TVM Blueprint](playbooks/ai-assisted-tvm.md) | AI-powered vulnerability management for AI-powered adversaries | CTOs, CISOs, Vulnerability Management |
| [Zero-Budget Vulnerability Discovery](playbooks/zero-budget-vulnerability-discovery.md) | Script-based and osquery-based server/container vuln discovery without Tenable/Qualys | Security Engineers, Consultants |
| [Perimeter Scanning Capability](playbooks/perimeter-scanning-capability.md) | External attack surface strategy: build, partner, or hybrid | Security Architects, Consultants |
| [Osquery: The Sovereign Discovery Platform](playbooks/osquery-custom-platform.md) | Build a custom vulnerability and asset inventory platform on osquery | Security Engineers, Consultants, CTOs |
| [M365 Antifragile Project](playbooks/m365-antifragile-project.md) | Greenfield and modernisation with antifragile design | M365 Consultants, Project Managers |
| [M365 E3 Hardening](playbooks/m365-e3-hardening.md) | Tactical hardening for M365 E3 environments | M365 Consultants, Security Engineers |
| [AD and Endpoint Hardening](playbooks/ad-endpoint-hardening.md) | On-prem AD, Windows endpoints, hybrid identity | Infrastructure Consultants, Security Engineers |
| [Zero-Budget Hardening](playbooks/zero-budget-hardening.md) | Maximize existing tools, minimize new purchases | Consultants, CISOs, IT Managers |
| [Implementation Playbook](playbooks/implementation-playbook.md) | Tactical step-by-step delivery guide | Technical Leads, Security Engineers |
| [Business Case Template](playbooks/business-case-template.md) | Financial justification, ROI, risk quantification | CFOs, Boards, Consultants |
## Standards Reference
| Document | Purpose | Audience |
|----------|---------|----------|
| [CIS Controls v8 Mapping](reference/cis-controls-mapping.md) | IG1-IG3 alignment with antifragile actions | Consultants, Auditors, Compliance |
| [NIST CSF 2.0 Mapping](reference/nist-csf-mapping.md) | CSF function mapping and evidence package | Consultants, Auditors, Compliance |
## Vertical References
| Document | Purpose | Audience |
|----------|---------|----------|
| [Vertical: Power and Utilities](reference/vertical-power-utilities.md) | Power generation, transmission, water, OT, NIS2/CER | Consultants in energy/water sectors |
| [Vertical: Telco](reference/vertical-telco.md) | Mobile/fixed operators, signaling security, 5G, fraud | Consultants in telecommunications |
| [Vertical: Banking](reference/vertical-banking.md) | Financial services, DORA, PSD2, SWIFT CSP alignment | Consultants in banking/fintech sectors |
## Assessment and Tools
| Document | Purpose | Audience |
|----------|---------|----------|
| [Antifragile Risk Register](assessment-templates/antifragile-risk-register.md) | Kill chain-aware risk taxonomy and register template | Risk Managers, Consultants |
| [M365 Project Risk Register](assessment-templates/m365-project-risk-register.md) | M365-specific risk register with phase gates | Project Managers, M365 Consultants |
| [Assessment Templates](assessment-templates/README.md) | Future diagnostic tools and maturity models | Consultants, Auditors |
## Navigation by Role
### For the Executive Sponsor
1. [Move Fast and Fix Things](core/move-fast-and-fix-things.md) — understand the engagement posture and speed philosophy
2. [Antifragile Manifest](core/antifragile-manifest.md) — understand the strategic philosophy
3. [AI Sovereignty Framework](core/ai-sovereignty-framework.md) — read the executive summary and five strategic arguments
4. [Rapid Modernisation Plan](playbooks/rapid-modernisation-plan.md) — review phases and governance cadence
5. [Zero-Budget Hardening](playbooks/zero-budget-hardening.md) — understand how existing investments are maximized
### For the Security Architect
1. [T0 Asset Framework](core/t0-asset-framework.md) — master the classification and protection model
2. [Implementation Playbook](playbooks/implementation-playbook.md) — follow the workstreams for identity, perimeter, and resilience
3. [Rapid Modernisation Plan](playbooks/rapid-modernisation-plan.md) — adapt phases to organizational context
### For the Consultant
1. [README](README.md) — repository orientation
2. [Move Fast and Fix Things](core/move-fast-and-fix-things.md) — your opening stance and engagement principles
3. [Modular Engagements](core/modular-engagements.md) — the engagement menu: sell any module standalone
4. [Antifragile Manifest](core/antifragile-manifest.md) — philosophical foundation for client conversations
5. [M365 E3 Hardening](playbooks/m365-e3-hardening.md) — your bread-and-butter: hardening for E3 clients
6. [AD and Endpoint Hardening](playbooks/ad-endpoint-hardening.md) — on-premises identity and endpoint depth
7. [AI Sovereignty Framework](core/ai-sovereignty-framework.md) — persuasive arguments and objection handling
8. [AI Operations Inevitability](core/ai-operations-inevitability.md) — why defensive AI is not optional
9. [Organizational Resilience](core/organizational-resilience.md) — shift left and Dev/Sec/Ops merger talking points
10. [Zero-Budget Hardening](playbooks/zero-budget-hardening.md) — prove value fast without selling
11. [Zero-Budget Vulnerability Discovery](playbooks/zero-budget-vulnerability-discovery.md) — script-based and osquery-based discovery before scanner procurement
12. [Osquery: The Sovereign Discovery Platform](playbooks/osquery-custom-platform.md) — build owned vulnerability and asset inventory capability
13. [Rapid Modernisation Plan](playbooks/rapid-modernisation-plan.md) — structured engagement roadmap
14. [Implementation Playbook](playbooks/implementation-playbook.md) — tactical delivery guidance
15. [Vertical: Power and Utilities](reference/vertical-power-utilities.md), [Vertical: Telco](reference/vertical-telco.md), or [Vertical: Banking](reference/vertical-banking.md) — sector-specific adaptations
14. [CIS Controls Mapping](reference/cis-controls-mapping.md) and [NIST CSF Mapping](reference/nist-csf-mapping.md) — standards alignment for auditors and regulators
---
*This index is updated as the repository grows.*

View File

@@ -0,0 +1,380 @@
# On-Premises AD and Endpoint Hardening Playbook
> *"The cloud gets the glory. Active Directory gets compromised."*
This playbook covers the security of on-premises Active Directory, Windows endpoints, and the identity boundary between on-premises and cloud (hybrid identity). It is designed for consulting engagements where the client maintains on-premises infrastructure alongside M365—common in telco, power, and banking environments.
---
## The On-Premise Reality
Most M365 clients did not start in the cloud. They have:
- Active Directory forests with 10+ years of technical debt
- Group Policy objects (GPOs) that no one dares to change
- Service accounts with passwords set to "never expire"
- Admin accounts that log in from the same workstations as regular users
- Backup systems that have never been tested
- KRBTGT accounts that have never been rotated
Our job is not to shame them. Our job is to **fix the kill chain fast** and give them a path to sustainable hygiene.
---
## Phase 1: AD Kill Chain Assessment (Days 1-7)
### Identity Census
**Export and analyze the full AD estate**:
```powershell
# All users with properties
Get-ADUser -Filter * -Properties LastLogonDate, PasswordLastSet, PasswordNeverExpires, ServicePrincipalName, MemberOf | Export-Csv ad-users.csv
# All groups (especially privileged)
Get-ADGroup -Filter * | Where-Object { $_.Name -match "admin|operator|backup|account|server" } | Export-Csv ad-priv-groups.csv
# All computer accounts
Get-ADComputer -Filter * -Properties LastLogonDate, OperatingSystem | Export-Csv ad-computers.csv
# Service accounts (have SPN or description indicating service use)
Get-ADUser -Filter { ServicePrincipalName -like "*" } -Properties ServicePrincipalName | Export-Csv ad-spns.csv
```
**What to look for**:
| Red Flag | Risk | Action |
|----------|------|--------|
| Accounts with PasswordNeverExpires = $true | Credential stuffing goldmine | Force rotation; justify exceptions |
| Admin accounts with last logon > 90 days | Stale, possibly compromised | Disable; verify with owner |
| Users in Domain Admins who should not be | Lateral movement path | Remove; document justification for remaining |
| Computer accounts with last logon > 180 days | Ghost machines, easy targets | Disable; purge after 30 days |
| Service accounts with interactive logon | Violation of principle | Convert to managed service accounts or gMSA |
| Duplicate SPNs | Kerberos authentication failures, potential attack vector | Fix immediately |
### Privileged Access Assessment
**Map the tier model** (if it exists) or establish one:
| Tier | Scope | Examples |
|------|-------|----------|
| Tier 0 | Controls AD and identity | Domain Admins, Enterprise Admins, Schema Admins, Account Operators, KRBTGT |
| Tier 1 | Controls server workloads | Server Admins, Database Admins, Backup Operators |
| Tier 2 | Controls workstations | Workstation Admins, Help Desk |
**Immediate actions**:
- Remove Account Operators, Backup Operators, Print Operators from Tier 0 equivalents if possible (these groups have dangerous default permissions)
- Ensure no Tier 0 account ever logs on to a Tier 2 device (workstation)
- Document every member of Domain Admins with business justification
### The KRBTGT Account
The KRBTGT account is the **cryptographic foundation of the entire Kerberos realm**. Its password hash is used to sign all Kerberos tickets. If an adversary has this hash, they have permanent golden ticket capability.
**Check last password change**:
```powershell
Get-ADUser krbtgt -Properties PasswordLastSet
```
- If last changed > 180 days ago: **rotate immediately**
- If never changed (common in old forests): **rotate immediately, but plan carefully**
**Rotation procedure** (do not do this during business hours without planning):
```powershell
# Requires Domain Admin; do twice with ~10 hours between (replication window)
Reset-KrbtgtKeyInteractive -Domain "corp.example.com"
```
Or use the Microsoft KRBTGT rotation script: `https://github.com/microsoft/New-KrbtgtKeys.ps1`
**Warning**: Rotating KRBTGT invalidates all existing Kerberos tickets. Users will need to re-authenticate. Plan for:
- Off-hours execution
- Service account impact (may need restart)
- VPN reconnection requirements
---
## Phase 2: Endpoint Hardening (Days 8-14)
### Microsoft Defender Antivirus (E3 Baseline)
E3 includes Defender Antivirus but **not** the advanced EDR features. Maximize what you have:
**Enable all protection features** (often disabled by previous AV migration):
```powershell
# Check current state
Get-MpPreference | Select-Object Disable*, Exclusion*
# Enable real-time protection
Set-MpPreference -DisableRealtimeMonitoring $false
# Enable behaviour monitoring
Set-MpPreference -DisableBehaviorMonitoring $false
# Enable network protection (blocks malicious IPs/URLs at network layer)
Set-MpPreference -EnableNetworkProtection Enabled
# Enable attack surface reduction rules (audit mode - requires ASR-capable license for full enforcement, but audit logging works)
# Note: Full ASR enforcement requires Defender for Endpoint P2, but you can still configure audit mode
Set-MpPreference -AttackSurfaceReductionRules_Actions AuditMode
```
**Update signatures and engine**:
```powershell
Update-MpSignature
Update-MpThreatDefinitions
```
### Sysmon Deployment (Free Telemetry)
Since E3 lacks EDR, **Sysmon is non-negotiable**. It provides process creation, network connections, driver loading, and file creation telemetry.
**Deployment**:
1. Download Sysmon from Microsoft Sysinternals
2. Use the SwiftOnSecurity configuration: `sysmonconfig-export.xml`
3. Deploy via GPO or Intune:
```cmd
sysmon.exe -accepteula -i sysmonconfig-export.xml
```
**Log forwarding**: Configure Windows Event Forwarding (WEF) or use a free log collector (Wazuh agent, nxlog) to centralize Sysmon logs.
### LAPS (Local Administrator Password Solution)
LAPS is **free from Microsoft** and essential. It randomizes local admin passwords per machine and stores them securely in AD.
**Deployment**:
1. Download LAPS from Microsoft
2. Extend AD schema (one-time, irreversible):
```powershell
Update-AdmPwdADSchema
```
3. Set permissions for computer self-write:
```powershell
Set-AdmPwdComputerSelfPermission -OrgUnit "OU=Workstations,DC=corp,DC=example,DC=com"
```
4. Set read permissions for authorized admins only:
```powershell
Set-AdmPwdReadPasswordPermission -OrgUnit "OU=Workstations,DC=corp,DC=example,DC=com" -AllowedPrincipals "HelpDesk-Admins"
```
5. Deploy LAPS client via GPO
**The conversation**:
> *"Every workstation with the same local admin password is a domino. If I compromise one, I own them all. LAPS makes every password unique and rotates it automatically. It is free, from Microsoft, and takes one day to deploy."*
### Windows Firewall Hardening
Enable and log all profiles:
```powershell
# Enable all profiles
Set-NetFirewallProfile -Profile Domain,Public,Private -Enabled True
# Enable logging for dropped packets
Set-NetFirewallProfile -Profile Domain,Public,Private -LogBlocked True -LogFileName "%systemroot%\system32\LogFiles\Firewall\pfirewall.log"
```
**Block inbound by default** except:
- RDP (only via jump host or PAW)
- SMB (only server-to-server, block workstation inbound)
- Required application ports (documented)
### Credential Guard and Device Guard (Where Hardware Supports)
Credential Guard isolates LSASS to prevent credential theft (Mimikatz-style attacks).
**Requirements**: UEFI 2.3.1c+, Secure Boot, TPM 2.0, Hyper-V Hypervisor
**Enable via GPO**:
- Computer Configuration → Administrative Templates → System → Device Guard → Turn On Virtualization Based Security
- Enable Credential Guard
**Banking/telco/power**: These sectors often have hardware that supports Credential Guard. Enable it. It is free and dramatically reduces credential theft risk.
---
## Phase 3: Network Segmentation and Boundary (Days 15-21)
### The Active Directory Perimeter
Most AD environments are "flat": every workstation can reach every server, every VLAN trusts every other VLAN. This is the kill chain.
**Segmentation priorities** (work with existing network team):
| Segment | What It Contains | Access Rules |
|---------|-----------------|--------------|
| Tier 0 | Domain controllers, AD admin jump hosts | No inbound from Tier 1 or 2. Admin access only from PAWs. |
| Tier 1 | Servers, databases, applications | No inbound from Tier 2 (workstations) except required application ports. |
| Tier 2 | Workstations, user devices | Internet and internal app access only. No direct server admin access. |
| Management | Monitoring, backup, patch management | Outbound to all tiers for management traffic. Inbound restricted to admin sources. |
| OT Boundary | SCADA, ICS, control systems | **Air-gapped or one-way diode**. If integration required, use data diode or unidirectional gateway. |
### DNS Security
DNS is the most underrated security control. Most malware needs DNS to find its command and control.
**Immediate actions**:
- Point all endpoints to a DNS resolver with filtering:
- **Quad9** (9.9.9.9) — free, blocks known malicious domains
- **Cloudflare for Teams** (free tier) — filtering + logging
- **Microsoft DNS security** (if available)
- Enable DNS query logging on internal DNS servers
- Block DNS over HTTPS (DoH) at the firewall unless using a managed DoH provider (prevents DNS tunneling evasion)
### Network Monitoring on a Budget
**Zeek (formerly Bro)** — open-source network analysis framework:
- Deploy on a SPAN port or network tap at internet boundary
- Provides connection logs, DNS logs, HTTP logs, SSL certificate logs
- Feed into Wazuh, Splunk Free, or Elastic Stack
**Suricata** — open-source IDS/IPS:
- Deploy at internet boundary and critical internal segments
- Use Emerging Threats Open ruleset (free)
- Alert on known malicious indicators
**The conversation**:
> *"You do not need a $100,000 NDR platform to see malicious traffic. You need a SPAN port, an old server, and Zeek. We will show you the connections your firewall is allowing that it should not be."*
---
## Phase 4: Hybrid Identity Security (Days 22-30)
### Azure AD Connect Health
Most on-premises AD environments are synchronized to Entra ID (Azure AD) via Azure AD Connect.
**Immediate hardening**:
- **Secure the Azure AD Connect server**: Treat it as Tier 0. No interactive logon except admins.
- **Enable PTA (Pass-Through Authentication) or PHS (Password Hash Sync) + Seamless SSO**: Evaluate which is appropriate
- PHS: Better resilience (can authenticate even if AAD Connect is down)
- PTA: Passwords never leave premises (some regulatory preference)
- **Enable password hash synchronization even if using PTA**: Provides fallback auth and enables Identity Protection detections if you later upgrade to P2
- **Enable Seamless SSO**: Reduces password prompts, improves MFA adoption
**Azure AD Connect configuration audit**:
```powershell
# On the AAD Connect server
Get-ADSyncScheduler
Get-ADSyncConnector
```
Verify:
- Only required OUs are syncing
- No accidental filtering exclusions that hide accounts
- The sync account has minimal necessary permissions
### AD FS (If Present)
AD FS is a **high-value target**. If compromised, the adversary controls federation for all cloud apps.
**Immediate hardening**:
- **Upgrade to latest supported version** (AD FS 2019 or later)
- **Enable Extranet Lockout**: Prevents brute force against AD FS from the internet
- **Enable PPR (Protection Against Password Reuse) / Smart Lockout**
- **Require MFA for AD FS extranet access** (if MFA infrastructure exists)
- **Review relying party trusts**: Remove stale or unknown trusts
- **Enable AD FS audit logging**: Forward to SIEM
**The conversation**:
> *"If I compromise AD FS, I do not need to crack your passwords. I just federate myself as an administrator. AD FS is Tier 0. Treat it accordingly."*
---
## OT / Critical Infrastructure Specifics (Telco, Power)
### The IT/OT Boundary
In power and telco environments, the AD forest often extends closer to OT than it should.
**Rules**:
- OT networks must not trust IT AD forests directly
- If Active Directory is required in OT, use a **separate forest** with one-way trust or no trust
- SCCM / Intune patch management for OT systems must be on a separate hierarchy
- Administrative credentials for OT must never be used on IT workstations
### Control System Workstations
- Engineering workstations (EWS) and operator stations (HMI) must run **application whitelisting** (AppLocker or third-party)
- USB ports: disabled or strictly controlled
- No internet access from OT VLANs
- Antivirus signatures updated via offline mechanism, not direct internet
### NIS2 and Critical Infrastructure
For EU critical infrastructure (power, telco):
- Incident reporting to CSIRT/NIS authority within 24-72 hours
- Supply chain security: document every vendor with AD or network access
- Encryption: data at rest and in transit for sensitive systems
- Multi-factor authentication for all remote access to critical systems
See [Vertical: Power Utilities](../reference/vertical-power-utilities.md) for comprehensive OT alignment.
---
## Banking Specifics
### Privileged Access for Financial Data
- Database administrators with access to core banking systems: **vault all credentials**, require dual authorization
- SWIFT infrastructure: isolated network, dedicated workstations, no internet
- Audit trails for all financial transaction system access: immutable, 7+ years retention
### Regulatory Alignment
| Regulation | AD/Endpoint Implication |
|-----------|------------------------|
| **PSD2** | Strong authentication for payment service users; MFA for internal payment systems |
| **DORA** | ICT risk management includes identity and access; recovery testing mandatory |
| **GDPR** | Access to personal data must be logged, justified, and time-bounded |
| **NIS2** (for systemic banks) | Incident reporting, supply chain risk management, encryption |
See [Vertical: Banking](../reference/vertical-banking.md) for comprehensive regulatory alignment.
---
## 30-Day Checklist for AD/Endpoint Engagements
- [ ] Full AD identity census exported and analyzed
- [ ] KRBTGT password rotation completed (or scheduled with plan)
- [ ] All privileged groups documented and justified
- [ ] LAPS deployed to all workstations
- [ ] Sysmon deployed to all Windows endpoints
- [ ] Defender Antivirus fully enabled and updated
- [ ] Windows Firewall enabled and logging on all endpoints
- [ ] DNS filtering deployed (Quad9 / Cloudflare)
- [ ] Network segmentation plan documented (even if not fully implemented)
- [ ] Azure AD Connect server secured and audited
- [ ] AD FS hardened (if present)
- [ ] Backup of AD System State tested (verify you can restore a DC)
- [ ] Credential Guard enabled on capable hardware
---
*Previous: [M365 E3 Hardening](m365-e3-hardening.md)*
*Next: [Implementation Playbook](implementation-playbook.md)*

View File

@@ -0,0 +1,326 @@
# AI-Assisted Threat and Vulnerability Management Blueprint
> *"Mythos will scan your entire perimeter in hours, not weeks. But here is the asymmetry: Mythos finds vulnerabilities. AI-assisted TVM finds them first, prioritizes them by exploitability in your specific environment, and generates the remediation code before the adversary writes the exploit."*
This blueprint provides a concrete, board-ready program for organizations facing the reality that AI-powered adversaries—whether criminal tools or agentic systems like Mythos—can discover and weaponize vulnerabilities faster than human teams can patch them.
It is designed for CTOs who need to go to the board with **something tangible**: not just "fix the basics," but an active, modern defensive capability that uses artificial intelligence as a force multiplier against AI-powered offence.
---
## The Problem: AI-Powered Offense Changes the Math
### Traditional Vulnerability Management
| Step | Traditional Timeline | Human Effort |
|------|---------------------|--------------|
| Scan for vulnerabilities | Weekly or monthly | Automated scanner |
| Prioritize findings | Days to weeks | Analyst reads CVSS, debates internally |
| Assess exploitability | Weeks | Manual research, PoC testing |
| Create remediation | Weeks to months | Engineering ticket, backlog queue |
| Validate fix | Months | Re-scan, manual verification |
| **Total cycle** | **3-9 months** | **Heavy human bottlenecks** |
### AI-Powered Offense (Mythos-Class)
| Capability | Impact |
|-----------|--------|
| **Continuous autonomous scanning** | Perimeter scanned daily, not monthly |
| **Intelligent vulnerability chaining** | Identifies kill chains: vuln A + vuln B + misconfiguration C = domain compromise |
| **Automated exploit generation** | Proof-of-concept code generated in minutes for newly disclosed CVEs |
| **Context-aware targeting** | Prioritizes vulnerabilities on internet-facing, privileged, or unmonitored assets |
| **Speed** | What took a human red team weeks takes an AI agent hours |
**The board conversation the CTO fears**:
> *"We have 12,000 open vulnerabilities. Our patching SLA is 90 days for critical. Mythos—or a criminal group using similar tooling—can scan our entire estate, chain our weaknesses, and have an exploit ready before we have even assigned the ticket."*
**The traditional consultant response** (which is correct but insufficient):
> *"We need to implement CIS IG1, clean up our attack surface, and get our house in order."*
**The problem**: The board has heard this before. The CTO has heard this before. It sounds like the same plan that has failed for five years, now with an AI-shaped deadline.
---
## The Asymmetric Response: AI-Assisted TVM
AI-assisted TVM does not replace basic hygiene. It **accelerates it by an order of magnitude**. The goal is not to eliminate all vulnerabilities—that is impossible. The goal is to **compress the find-to-fix cycle so dramatically that the adversary's AI advantage is neutralized**.
| Traditional TVM | AI-Assisted TVM | Speed Multiplier |
|----------------|-----------------|------------------|
| Scan → prioritize by CVSS | Scan → prioritize by **exploitability × asset criticality × active threat intelligence** | 10x faster prioritization |
| Manual research: "Is this actually exploitable?" | AI predicts exploitability from code patterns, social media chatter, and dark web indicators | 100x faster assessment |
| Manual ticket creation and assignment | AI generates **remediation code, GPO scripts, or Intune policies** with human review | 10x faster remediation prep |
| Monthly re-scan to verify | Continuous validation via **agent-based monitoring and drift detection** | Real-time verification |
| Analyst reads 500-page scan report | AI synthesizes **top 10 actions that reduce risk most** into a one-page brief | Board-ready in seconds |
---
## The Architecture
### Layer 1: Discovery and Inventory
**Goal**: Know what you have before the adversary does.
| Source | What It Provides | AI Enhancement |
|--------|-----------------|---------------|
| **Defender Exposure Management** (E5) | Vulnerability inventory, misconfigurations, Secure Score | AI prioritizes recommendations by actual exploitability, not just severity |
| **Network scanners** (Tenable, Qualys, Rapid7, OpenVAS) | Traditional vulnerability scanning | AI correlates scan results with threat intel to predict which vulns will be exploited first |
| **Cloud security posture** (Defender for Cloud, Prisma, Wiz) | Cloud resource misconfigurations | AI identifies cloud-specific kill chains (e.g., overly permissive S3 → compromised IAM → lateral movement) |
| **Zero-budget discovery** (PowerShell, SSH scripts, Syft/Grype, osquery) | Server inventory, SBOMs, package-level CVE correlation | AI aggregates script-based findings into unified risk view. See [Zero-Budget Vulnerability Discovery](zero-budget-vulnerability-discovery.md) |
| **osquery + FleetDM** | Cross-platform endpoint inventory, real-time process/network data, policy compliance | AI queries live endpoint state for prioritization and kill chain simulation. See [Osquery: The Sovereign Discovery Platform](osquery-custom-platform.md) |
| **Attack surface management** (Cortex Xpanse, Shodan, Nuclei, Amass) | External-facing assets unknown to IT | AI maps shadow IT and forgotten assets faster than manual discovery. See [Perimeter Scanning Capability](perimeter-scanning-capability.md) |
| **Software bill of materials (SBOM)** | Known vulnerable components in applications | AI monitors SBOMs against real-time CVE disclosure and exploit availability |
### Layer 2: Intelligent Prioritization
**Goal**: Stop patching by CVSS. Start patching by **probability of exploitation in your environment**.
| Input | AI Processing | Output |
|-------|--------------|--------|
| CVE database + exploit code availability | Predictive model: will this be exploited in the wild in the next 7/14/30 days? | Risk-ranked vulnerability list |
| Asset criticality (CMDB + business context) | Cross-reference: which vulnerable assets are Tier 0 / Tier 1 / internet-facing? | Environment-specific priority |
| Active threat intelligence (MISP, CISA KEV, vendor advisories) | Correlation: are threat actors currently targeting this vulnerability? | Threat-informed urgency |
| Network topology and segmentation | Kill chain simulation: can this vulnerability be reached from the internet? From a compromised workstation? | Reachability-adjusted risk |
| Compensating controls | Control validation: is the vulnerable host behind WAF? Is EDR monitoring it? | Residual risk calculation |
| External attack surface (perimeter scan findings) | Outside-in risk multiplier: internet-facing vulns weighted 10x higher than internal | Perimeter-aware priority |
**The outside-in weighting**: A vulnerability on an internet-facing server is 10x more urgent than the same vulnerability on an internal workstation because adversary AI scanners find it first. See [Perimeter Scanning Capability](perimeter-scanning-capability.md).
**The result**: Instead of 12,000 vulnerabilities sorted by CVSS, the team sees **the 50 vulnerabilities that matter this week**—ranked by the probability that an AI-powered adversary will exploit them in the client's specific architecture.
### Layer 3: Automated Remediation Preparation
**Goal**: Reduce the time from "identified" to "fix ready" from weeks to hours.
| Vulnerability Type | AI-Generated Remediation | Human Review Required |
|-------------------|-------------------------|----------------------|
| Missing OS patch | PowerShell/Intune update policy + deployment ring recommendation | Yes: test and schedule |
| Misconfigured firewall rule | Corrected rule + impact analysis + rollback script | Yes: network team validation |
| Default credential | Password randomization script + vault storage + service restart procedure | Yes: application owner sign-off |
| TLS configuration weakness | Hardened registry settings / nginx config / Azure Front Door policy | Yes: SSL/TLS team validation |
| Cloud IAM over-permission | Least-privilege policy + impact simulation | Yes: cloud team review |
| Container image vulnerability | Updated Dockerfile + base image recommendation | Yes: CI/CD pipeline test |
**Key principle**: AI generates the **draft remediation**. Humans validate, test, and deploy. This is not autonomous patching. It is **augmented patching**—the AI does the research and scripting; the human does the judgment and approval.
### Layer 4: Continuous Validation
**Goal**: Prove that fixes worked and detect drift immediately.
| Validation Method | AI Enhancement |
|-------------------|---------------|
| Re-scan after patch | AI correlates patch deployment with scan results; flags failed patches automatically |
| Configuration drift detection | AI baselines "known good"; alerts on deviation within hours, not months |
| Exploit attempt detection | AI monitors EDR/SIEM for exploitation techniques targeting recently disclosed CVEs |
| Adversarial simulation | AI-driven purple team exercises that target the **exact vulnerabilities** still open |
---
## The 30-60-90 Day AI-Assisted TVM Sprint
### Phase 1: Baseline and Acceleration (Days 0-30)
**Theme**: *Know your enemy's starting point. Beat them to the first move.*
**Week 1: Threat-Informed Asset Discovery**
- Inventory all vulnerability scanning sources (Defender Exposure Management, Tenable, Qualys, cloud scanners, or zero-budget scripts if no commercial tools exist)
- Identify gaps: which assets are not scanned? Which scans are stale?
- Deploy **attack surface management** scan: discover what the internet sees
- Deploy **Shadow IT discovery**: unknown cloud apps, unapproved infrastructure
- Run **zero-budget discovery sweep** on servers without EDR/scanner coverage. See [Zero-Budget Vulnerability Discovery](zero-budget-vulnerability-discovery.md)
**Deliverable**: Asset and vulnerability inventory with coverage gaps identified
**Week 2: AI-Powered Prioritization Engine**
- Integrate vulnerability data with:
- CISA Known Exploited Vulnerabilities (KEV) catalog
- ExploitDB / GitHub exploit availability
- Dark web chatter monitoring (where feasible)
- Client's CMDB for asset criticality
- Deploy **local AI model** (or Azure OpenAI with structured prompting) to:
- Synthesize scan results into risk-ranked action list
- Predict which vulnerabilities will be exploited in next 30 days
- Generate one-page executive brief weekly
**Deliverable**: AI-prioritized vulnerability list; first executive brief
**Week 3: Remediation Acceleration**
- Select top 20 vulnerabilities from AI-prioritized list
- Use AI to generate remediation scripts/policies for each
- Human review and validation
- Deploy fixes in controlled maintenance windows
- Measure: time from identification to fix ready vs. historical baseline
**Deliverable**: 20 critical vulnerabilities remediated or in controlled deployment
**Week 4: Validation and Board Briefing**
- Re-scan to validate fixes
- AI generates before/after risk dashboard
- Board briefing: "We had 12,000 vulnerabilities. AI identified the 50 that mattered. We fixed the top 20 in 30 days. Here is the trend."
**Deliverable**: Board-ready TVM dashboard; 30-day metrics report
---
### Phase 2: Operationalization (Days 30-60)
**Theme**: *Make AI-assisted TVM the operating rhythm, not a project.*
**Week 5-6: Integration into SOC Workflow**
- Vulnerability alerts feed into SOC triage queue
- AI enriches vulnerability alerts with: exploit availability, asset criticality, business impact
- SOC analysts can escalate high-risk vulnerabilities as incidents
- Automated containment: vulnerable internet-facing assets temporarily restricted pending patch
**Week 7-8: Automated Remediation Pipeline**
- Build CI/CD pipeline for vulnerability remediation:
- AI generates patch policy → security team reviews → automated deployment to test ring → validation → production deployment
- Target: 80% of routine patches (OS, browser, standard apps) automated with human approval
- Exception handling: complex or risky patches remain manual
**Week 9-10: Purple Team Targeting Open Vulnerabilities**
- Purple team exercise: red team attempts to exploit vulnerabilities **still open** from the AI-prioritized list
- Measures: Did the SOC detect the exploitation attempt? Did the vulnerability allow compromise? How fast was response?
- Findings feed back into AI prioritization model
**Deliverable**: Operating rhythm established; automated pipeline operational; first vulnerability-focused purple team complete
---
### Phase 3: Strategic Advantage (Days 60-90)
**Theme**: *Convert vulnerability management from cost centre to competitive advantage.*
**Week 11-12: Predictive and Proactive**
- AI monitors CVE disclosure streams in real time
- Within 24 hours of critical CVE disclosure:
- AI assesses: are we affected? Which assets? What is the exposure?
- AI generates: risk assessment, remediation script, communication draft
- Human team validates and deploys in <48 hours
- Compare: industry average for critical CVE response is 30-60 days. Target: <48 hours for high-confidence remediations.
**Ongoing: Continuous Improvement**
- Weekly AI-generated TVM executive brief
- Monthly purple team exercise targeting open vulnerabilities
- Quarterly board report: mean time to remediate, AI prediction accuracy, adversarial simulation results
---
## The Board-Ready Demo Script
When the CTO walks into the boardroom with this program, they bring **evidence, not promises**.
### The 10-Minute Demo
**Minute 1-2: The Threat**
> *"Last month, an AI-powered scanning tool identified 12,000 vulnerabilities in our environment. Industry average time to patch a critical vulnerability: 60 days. Industry average time for an AI-powered adversary to weaponize a newly disclosed vulnerability: 5 days. The gap is fatal."*
**Minute 2-4: The Traditional Response**
> *"Our previous approach was to patch by CVSS score. The board has seen this plan before. It requires 20 additional engineers we cannot hire, 9 months we do not have, and produces a false sense of security because CVSS does not predict exploitability."*
**Minute 4-7: The AI-Assisted Alternative**
[Show the dashboard live]
> *"This is our AI-assisted TVM platform. It does not show us 12,000 vulnerabilities. It shows us the 47 vulnerabilities that an adversary is likely to exploit in our specific environment this month, ranked by probability."
[Click on top vulnerability]
> *"This vulnerability—CVE-2024-XXXX—is on three of our internet-facing web servers. CVSS score: 7.5. But the AI has cross-referenced exploit availability, our network topology, and active threat intelligence. It predicts 85% probability of exploitation within 14 days. It has already generated the remediation script. We are deploying it tonight."
[Show before/after]
> *"In 30 days, we reduced our exploitable attack surface by 40%. We did not hire 20 engineers. We used AI to prioritize, generate fixes, and validate. Our mean time to remediate a critical vulnerability dropped from 60 days to 4 days."*
**Minute 7-10: The Ask**
> *"We are not asking for a three-year transformation. We are asking for a 90-day sprint to operationalize AI-assisted vulnerability management. The investment is less than one senior engineer's annual salary. The return is closing the 55-day gap between adversary weaponization and our remediation."*
---
## Tool Stack Recommendations
### Microsoft-Centric (Most Common for Our Clients)
| Layer | Microsoft Tool | AI Enhancement |
|-------|---------------|---------------|
| Discovery | Defender Exposure Management + Defender for Cloud | AI prioritizes exposure recommendations by exploitability |
| Prioritization | Azure OpenAI / local LLM + CISA KEV feed + MISP | Predictive exploitability scoring |
| Remediation | Intune + Azure Policy + PowerShell + Azure Automation | AI-generated remediation scripts and policies |
| Validation | Defender for Endpoint + Sentinel | AI-driven drift detection and adversarial simulation validation |
| Reporting | Power BI + Azure OpenAI synthesis | Natural language executive briefs generated automatically |
### Open-Source and Hybrid
| Layer | Tool | Role |
|-------|------|------|
| Discovery | Wazuh + OpenVAS + osquery/FleetDM + Cloud-native scanners | Vulnerability, configuration, and real-time endpoint discovery |
| Prioritization | Local LLM (Llama 3, Mistral) + exploit prediction models | On-premise AI for sensitive environments |
| Remediation | Ansible + Puppet + custom scripts | Infrastructure-as-code remediation |
| Validation | VulnHub + Atomic Red Team + Caldera | Continuous adversarial validation |
| Reporting | Grafana + custom dashboards + LLM synthesis | Real-time metrics and executive summaries |
---
## The Honest Limitations
AI-assisted TVM is powerful but not magic. Be honest with the board:
| What AI TVM Does Well | What AI TVM Cannot Do |
|----------------------|----------------------|
| Prioritizes faster and smarter than humans | Cannot patch systems without human approval and testing |
| Generates remediation scripts and policies | Cannot fix architectural debt or design flaws |
| Predicts which vulnerabilities will be exploited | Cannot predict zero-days before disclosure |
| Validates fixes continuously | Cannot replace basic hygiene (CIS IG1 is still mandatory) |
| Reduces analyst workload by 70% | Cannot operate without skilled human oversight |
**The framing**:
> *"AI-assisted TVM does not replace our need to implement CIS IG1, harden our endpoints, and govern our identities. What it does is compress the vulnerability management cycle from months to days—giving us a fighting chance against adversaries who operate at machine speed. It is the accelerator. Basic hygiene is still the foundation."*
---
## Integration With Existing Frameworks
| Document | Integration Point |
|----------|-------------------|
| [Rapid Modernisation Plan](rapid-modernisation-plan.md) | AI TVM maps to Phase 1 (Hygiene: visibility), Phase 2 (Control: prioritized remediation), and Phase 4 (Antifragility: continuous learning) |
| [Modular Engagements](../core/modular-engagements.md) | AI TVM can be delivered as a standalone 90-day module or embedded in Module 3 (M365 Security Hardening) and Module 12 (Blue/Purple Team) |
| [Zero-Budget Hardening](zero-budget-hardening.md) | AI TVM leverages existing Microsoft tooling (Defender Exposure Management, Intune) before recommending new purchases |
| [Osquery: The Sovereign Discovery Platform](osquery-custom-platform.md) | osquery provides the owned, queryable data layer for AI prioritization; FleetDM enables continuous endpoint monitoring |
| [Azure OpenAI Sovereignty Bridge](../core/azure-openai-sovereignty-bridge.md) | Azure OpenAI can power the prioritization and synthesis layers; local AI can power air-gapped environments |
| [Antifragile Risk Register](../assessment-templates/antifragile-risk-register.md) | AI TVM directly addresses vulnerability-related risks with convex payoff: small AI investment prevents catastrophic exploitation |
---
## Metrics and KPIs
| Metric | Before | 30-Day Target | 90-Day Target |
|--------|--------|--------------|---------------|
| Mean time to prioritize critical vuln | 14 days | 24 hours | 4 hours |
| Mean time to remediate critical vuln | 60 days | 14 days | 4 days |
| Vulnerabilities with known exploits (open) | Unknown | Measured | <10 |
| % of estate with current scan coverage | 60% | 90% | 98% |
| AI prediction accuracy (exploited vs. not) | N/A | 70% | 85% |
| Time to generate remediation script | 2 days | 2 hours | 30 minutes |
| Executive brief generation time | 8 hours | 30 minutes | 5 minutes (automated) |
| Purple team detection rate (open vulns) | Unknown | 50% | 80% |
---
*For the AI operations inevitability argument, see [AI Operations Inevitability](../core/ai-operations-inevitability.md).*
*For the business case template, see [Business Case Template](business-case-template.md).*
*For board conversation guidance, see [C-Suite Conversation Guide](../core/c-suite-conversation-guide.md).*

View File

@@ -0,0 +1,245 @@
# Business Case Template
> *"The board does not buy security. The board buys risk reduction, regulatory survival, and competitive advantage. Price it accordingly."*
This template provides a reusable structure for building financial justification for antifragile engagements. It is designed to be adapted per client, per vertical, and per regulatory context. The output should be a 4-6 page document that a CFO can evaluate in 15 minutes.
---
## Document Structure
### Page 1: Executive Summary
**Subtitle**: *Investment Proposal: Antifragile Enterprise Program*
| Element | Content |
|---------|---------|
| **Investment ask** | €[X] over 180 days, phase-gated with go/no-go decisions at days 30, 60, 90 |
| **Primary return** | Reduction of existential cyber risk; regulatory compliance evidence; competitive differentiation through AI sovereignty |
| **Break-even** | Day 90 (via avoided regulatory fine exposure, reduced insurance premiums, or operational resilience) |
| **Risk of inaction** | Quantified below; summary: [X]% probability of material incident within 24 months at estimated cost of €[Y] |
### Page 2: Cost of Inaction
**Frame**: The most expensive decision is the one not to act.
#### Direct Costs (Quantifiable)
| Risk Category | Probability (Client-Specific) | Average Industry Cost | Expected Value |
|--------------|------------------------------|----------------------|----------------|
| Ransomware incident (recovery + downtime) | [X]% | €4.5M | €[X * 4.5M] |
| Regulatory fine (DORA / NIS2 / national) | [X]% | 1-2% global turnover | €[X * % GT] |
| Data breach notification and remediation | [X]% | €3.8M (per IBM Cost of Data Breach Report) | €[X * 3.8M] |
| Cloud AI vendor price increase / lock-in | [X]% | 200-500% price shock | €[X * shock] |
| Competitive intelligence loss (cloud AI training) | [X]% | Unquantifiable but existential | High |
**Calculation**:
```
Expected Loss = Σ (Probability_i × Cost_i)
```
Present this as: *"Without intervention, the organization faces an expected loss of €[X] over 24 months. The proposed program costs €[Y], representing a [Z]:1 return on risk reduction."*
#### Indirect Costs (Narrative)
- **Reputational damage**: Customer churn, difficulty acquiring new business, talent attrition
- **Operational paralysis**: During an incident, leadership attention is diverted from growth to survival
- **Insurance premium increases**: Cyber insurers are tightening terms; resilience demonstrably reduces premiums
- **Regulatory scrutiny**: A single incident triggers multi-year regulatory attention and reporting obligations
---
### Page 3: Investment Structure
**Frame**: We spend your money as if it were our own. Configuration first. Purchase only if justified.
#### Phase-Gated Budget
| Phase | Timeline | Primary Activity | Estimated Cost | Go/No-Go Gate |
|-------|----------|-----------------|----------------|---------------|
| **1. Hygiene** | Days 0-30 | Configuration of existing tools; identity cleanse; visibility | €[X] (primarily labor) | Day 30: Demonstrate risk reduction or stop |
| **2. Control** | Days 30-60 | ASR, MFA enforcement, network segmentation, vendor lockdown | €[X] (labor + minimal tooling) | Day 60: Validate control effectiveness |
| **3. Sovereignty** | Days 60-90 | Local AI pilot; recovery drills; T0 asset protection | €[X] (labor + local inference hardware if needed) | Day 90: Prove local AI viability |
| **4. Antifragility** | Days 90-180 | Chaos engineering; red team; continuous improvement | €[X] (labor + external testing) | Day 180: Maturity assessment and next-phase planning |
| **Total** | 180 days | | **€[X]** | |
#### Cost Categories
| Category | Typical % of Budget | Description |
|----------|--------------------|-------------|
| Consulting / Labor | 60-70% | Configuration, process design, training, documentation |
| Existing Tool Activation | 0% | Included in current licensing; no new purchase |
| Local AI Infrastructure | 10-20% | Hardware or sovereign cloud for inference (only if pilot justifies) |
| External Testing | 10-15% | Red team, penetration testing, regulatory validation |
| Training / Change Management | 5-10% | Security awareness, champion programs, board briefings |
#### Compare to Alternatives
| Alternative Approach | Cost | Timeline | Risk |
|---------------------|------|----------|------|
| **Do nothing** | €0 | — | Expected loss €[X] over 24 months |
| **Traditional security audit** | €[X] | 90 days | Produces report; no structural change |
| **Full E5 licensing upgrade** | €[X]/user/year | 30 days | Solves some gaps; does not address architecture or AI sovereignty |
| **Managed security service (MSSP)** | €[X]/month | Ongoing | Outsources detection; does not reduce structural fragility |
| **Antifragile program (this proposal)** | €[X] | 180 days | Structural change, regulatory evidence, AI sovereignty, measurable resilience |
---
### Page 4: Return on Investment
**Frame**: The return is not revenue. It is **avoided cost + preserved optionality + regulatory license to operate**.
#### Quantifiable Returns
| Return Category | Calculation | 12-Month Value | 24-Month Value |
|----------------|-------------|---------------|----------------|
| Avoided ransomware recovery | Probability reduction × €4.5M | €[X] | €[Y] |
| Avoided regulatory fine | Probability reduction × % GT | €[X] | €[Y] |
| Insurance premium reduction | 10-20% reduction on cyber premium | €[X] | €[Y] |
| Cloud AI cost stabilization | Shift from variable API costs to fixed infra | €[X] | €[Y] |
| Reduced incident response cost | Faster detection and containment | €[X] | €[Y] |
| **Total Quantifiable Return** | | **€[X]** | **€[Y]** |
#### Strategic Returns (Narrative)
| Return Category | Description |
|----------------|-------------|
| **Competitive moat** | Proprietary data improves only your models; competitors cannot replicate your operational intelligence |
| **Regulatory agility** | Demonstrable resilience accelerates regulatory approvals, market entries, and partnership discussions |
| **Talent retention** | Engineers and security professionals prefer organizations that invest in durability over firefighting |
| **M&A readiness** | Clean identity architecture, tested recovery, and documented controls increase valuation and reduce due-diligence friction |
| **Vendor negotiation leverage** | Documented exit architectures improve negotiating position with all major suppliers |
#### ROI Summary
```
ROI = (Total Return - Total Investment) / Total Investment × 100%
```
Present as: *"This program delivers a [X]% return in year one, rising to [Y]% in year two, with strategic optionality that compounds beyond quantification."*
---
### Page 5: Risk and Sensitivity Analysis
**Frame**: We are honest about what could go wrong. That honesty is why you should trust us.
#### Program Risks
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|-----------|
| Operational disruption during hygiene phase | Medium | Medium | Changes executed in maintenance windows; rollback procedures documented; "get out of jail free" executive authorization |
| Client team capacity constraints | High | Medium | Weekly sprints with clear priorities; we do the heavy lifting; client provides decisions, not labor |
| Scope creep | Medium | High | Ruthless phase gating; kill chain prioritization; deferred items tracked for future phases |
| Tool activation reveals deeper problems | High | Low | This is the point. Early discovery is cheaper than late discovery. |
| Executive sponsor departure | Low | High | Board-level endorsement; documented in steering committee minutes; knowledge transfer at each phase |
#### Sensitivity Analysis
| Scenario | Investment Adjustment | Outcome |
|----------|----------------------|---------|
| **Best case** | No additional tooling needed | Program completes under budget; all value from configuration |
| **Base case** | Local AI hardware required for pilot | Slight budget increase; sovereign intelligence proven |
| **Worst case** | Deeper technical debt than anticipated | Extend Phase 1 by 30 days; additional labor cost; still cheaper than incident |
---
### Page 6: Recommendation and Next Steps
**The Ask (Full Program)**:
> *"We recommend approval of a 180-day antifragile enterprise program, structured in four 30-60-90-180 day phases with hard go/no-go gates. The initial 30-day investment is €[X] with a defined deliverable: identification and initial closure of the organizational kill chain. If measurable risk reduction is not demonstrated by Day 30, the program stops with no further obligation."*
**The Ask (Modular Alternative)**:
> *"Alternatively, we can start with a single, fixed-scope module chosen based on your highest-priority pain. Each module is 30-60 days, fixed price, with defined deliverables and a hard stop. If the value is proven, we proceed to the next module. If not, you have still received a complete, bounded solution. See [Modular Engagements](../core/modular-engagements.md) for the module menu."*
**Immediate Next Steps**:
| Step | Owner | Timeline |
|------|-------|----------|
| Executive sponsor designation | CEO / Board | Week 0 |
| Steering committee scheduling | COO / Chief of Staff | Week 0 |
| Data room access (AD, cloud IAM, network diagrams) | CISO / IT Director | Week 0 |
| SOW execution and kickoff | Procurement / Consultant | Week 1 |
| Week 1 stakeholder interviews | Consultant | Week 1 |
| Day 30 steering committee and go/no-go | Executive Sponsor | Day 30 |
---
## Vertical-Specific Financial Adjustments
### Banking
- **Regulatory fine exposure**: DORA fines up to 2% of global turnover; use client's actual global turnover
- **SWIFT CSP non-compliance**: Potential disconnection from SWIFT network; catastrophic for international payments
- **PSD2 SCA failure**: Transaction rejection rates, customer abandonment, regulator attention
- **Insurance context**: Many banks are self-insured for cyber; frame as direct balance-sheet protection
### Telco / Power (Critical Infrastructure)
- **NIS2 penalties**: Up to €10M or 2% of global turnover (whichever is higher)
- **Operational downtime**: Power outages measured in €/minute; telco downtime in subscriber churn
- **National security implications**: Some incidents trigger government intervention or nationalization risk
- **Supply chain**: Single vendor failure can disable critical infrastructure; optionality has direct monetary value
### Generic Enterprise
- **Ransomware**: Primary quantifiable risk; use industry averages if client-specific data unavailable
- **Business interruption**: Use revenue/day × estimated downtime
- **Reputation**: Use customer acquisition cost × estimated churn from breach notification
---
## The CFO Conversation: Key Metrics
When presenting to the CFO, lead with these metrics and no others:
1. **Expected loss without intervention** (24 months): €[X]
2. **Program cost**: €[Y]
3. **Risk reduction ROI**: [Z]%
4. **Cash payback period**: [X] days
5. **Probability of material incident**: [before]% → [after]%
Everything else is supporting detail.
---
## Template Appendix: Client-Specific Worksheets
### Worksheet 1: Revenue at Risk
```
Annual revenue: €_________
Revenue per day: €_________ (annual / 365)
Critical system downtime tolerance: _________ days
Revenue at risk from downtime: €_________ (revenue/day × tolerance)
```
### Worksheet 2: Regulatory Fine Exposure
```
Global turnover (if applicable): €_________
Applicable regulation: [DORA / NIS2 / National / None]
Maximum fine %: _________%
Maximum fine €: €_________
Probability of fine (current): _________%
Expected fine exposure: €_________
```
### Worksheet 3: Cloud AI Cost Trajectory
```
Current monthly cloud AI spend: €_________
Projected 24-month spend: €_________
Local AI infrastructure cost: €_________
Break-even month: _________
24-month savings: €_________
Data leakage risk (narrative): [Eliminated / Reduced / Unchanged]
```
---
*For the board conversation guide, see [C-Suite Conversation Guide](../core/c-suite-conversation-guide.md).*
*For the one-page executive summary, see [Executive Summary](../core/executive-summary.md).*

View File

@@ -0,0 +1,297 @@
# Endpoint Management: The Antifragile Entry Vector
> *"Every client who asks you to manage their devices is actually asking you to see their blind spots. Endpoint management is the Trojan horse that gets you inside the perimeter—and from there, every other security conversation becomes natural."*
This playbook positions **endpoint management**—Microsoft Intune, Endpoint Manager, and modern device management—as the ideal entry vector for antifragile consulting engagements. It is designed for M365/Azure consultancies whose clients arrive with a specific, bounded request ("manage our devices" or "replace SCCM") and who need a structured path from that request to a comprehensive security transformation.
---
## Why Endpoint Management Is the Perfect Entry Vector
### 1. Clients Ask for It
Unlike abstract security frameworks, endpoint management solves **immediate, visible pain**:
| Client Pain | Why They Call |
|-------------|--------------|
| "We need to manage remote worker laptops" | COVID-era remote work became permanent; devices are invisible |
| "We are retiring SCCM and moving to the cloud" | On-premise management infrastructure is end-of-life or too expensive |
| "We need mobile device management for field staff" | Tablets and phones access email and customer data with no oversight |
| "Our auditor asked for proof of device compliance" | Regulatory gap; no evidence that devices meet security baselines |
| "We bought Intune licenses and never turned them on" | Common scenario: E3/E5 includes Intune but deployment stalled |
| "Users install whatever they want" | Shadow IT on endpoints; malware risk; unlicensed software |
**The insight**: Every one of these requests is a **symptom of deeper fragility**. The client sees the device problem. You see the identity, data, network, and governance problem that the device problem reveals.
### 2. It Creates Immediate Visibility
Once Intune is deployed, you can see:
- Every managed device: OS version, patch level, encryption status
- Every application installed: sanctioned and shadow
- Every configuration drift: firewall off, AV disabled, unknown admin accounts
- Every compliance failure: unencrypted disk, missing updates, jailbroken phone
This visibility is **the foundation of everything else**. You cannot harden what you cannot see. You cannot govern what you cannot inventory.
### 3. It Touches Every Security Domain
Endpoint management is not an island. It is the **intersection point** of:
| Domain | Endpoint Management Connection |
|--------|-------------------------------|
| **Identity** | Device compliance becomes a conditional access signal; non-compliant devices cannot access data |
| **Network** | VPN profiles, certificate deployment, DNS settings, Wi-Fi security |
| **Data** | DLP enforcement at the endpoint; remote wipe; encryption policies |
| **Application** | App deployment, update management, software inventory, browser policies |
| **Threat detection** | EDR onboarding (Defender for Endpoint), ASR rule deployment, vulnerability visibility |
| **AI governance** | Devices are where shadow AI usage happens; endpoint visibility reveals unsanctioned AI tools |
### 4. It Produces Visible Results Fast
In 30 days, a client can see:
- A dashboard of all their devices
- Non-compliant devices highlighted in red
- Policies pushing encryption, updates, and security baselines
- Remote workers no longer "flying blind"
This builds **trust and political capital** for the harder conversations that follow.
---
## The Trojan Horse Strategy
### The Opening Request
> *"Can you help us deploy Intune? We need to manage our laptops and phones."*
### Your Response
> *"Absolutely. And while we are deploying Intune, we will see things that need attention—accounts that should not exist, devices that are not encrypted, applications that are leaking data. We will fix the device problem in 30 days. But we will also give you a map of what we found, because the device is usually where the bigger problems show up first."*
### The Natural Expansion Path
| Endpoint Management Phase | What We Discover | What We Propose Next |
|--------------------------|-----------------|---------------------|
| **Device enrollment and inventory** | Orphaned AD accounts, devices with no owner, unknown machines on the network | Identity hygiene blitz; CMDB seeding |
| **Compliance policy deployment** | No disk encryption, outdated OS, missing patches, legacy authentication | Endpoint hardening; patch management; ASR rules |
| **Application management** | Shadow IT, unlicensed software, consumer AI apps on corporate devices | Application governance; sanctioned AI alternative (Azure OpenAI bridge) |
| **Conditional access integration** | No device-based access control; same credentials work from any device anywhere | Identity security architecture; MFA enforcement; location policies |
| **Remote worker security** | Home networks, personal printers, USB devices, split tunneling | Zero-trust architecture; DNS security; data loss prevention |
---
## The 30-60-90 Day Endpoint Management Sprint
### Phase 1: Visibility (Days 0-30)
**Objective**: Know every device. Know its state. Know its owner.
| Week | Action | Deliverable | Natural Discovery |
|------|--------|-------------|-------------------|
| 1 | Tenant readiness review: Intune licensing, roles, connectors, update rings | Readiness report | Often finds unused E5 Security licenses; orphaned Intune configs from previous attempts |
| 1 | AD/AAD device inventory: What devices exist? Which are managed? Which are not? | Device census spreadsheet | Ghost devices; stale computer accounts; devices with no owner |
| 2 | Enrollment campaign: Auto-enrollment for AAD-joined devices; manual for BYOD/COPE | Enrollment metrics (% managed) | Users with multiple unmanaged devices; non-standard hardware |
| 2 | Compliance baseline: Encryption, OS version, password policy, firewall | Compliance dashboard | Massive non-compliance: unencrypted disks, outdated Windows, disabled firewalls |
| 3 | Application inventory: Installed apps via Intune inventory or WDAC/AppLocker audit | Application report | Shadow IT goldmine: unauthorized VPNs, consumer cloud storage, AI apps, games |
| 3 | Policy deployment (audit mode): Push basic policies without enforcement to measure impact | Policy readiness report | Devices that will break; apps that will be blocked; users who will be affected |
| 4 | Enforcement (gradual): Enable policies in waves; prioritize highest-risk users | Enforcement wave report | Executive devices that were never managed; admin machines with no PAW |
**The Phase 1 conversation**:
> *"We now manage 85% of your devices. Twenty-three devices are unencrypted. Fourteen are running Windows versions that no longer receive security updates. Seven users have installed consumer AI tools that send data to third-party clouds. We fixed the device management request. Here is what we found—and here is what we should fix next."*
---
### Phase 2: Control (Days 30-60)
**Objective**: Ensure every managed device meets the security baseline. Eliminate the highest-risk gaps.
| Week | Action | Deliverable |
|------|--------|-------------|
| 5 | Encryption enforcement: BitLocker (Windows), FileVault (macOS) | Encryption coverage: 100% of managed devices |
| 5 | Update rings: Deploy Windows Update for Business; test and production rings | Patch compliance report |
| 6 | Application control: Block known-bad categories; require approved app installation | Application control policy deployed |
| 6 | Browser hardening: Edge/Chrome policies, extension management, safe browsing | Browser security baseline |
| 7 | Conditional access integration: Device compliance as access signal | CA policies: compliant device required for M365 access |
| 7 | Admin device hardening: PAW enrollment, dedicated admin profiles, restricted browsing | Admin device compliance: 100% |
| 8 | Mobile device hardening: iOS/Android app protection policies, jailbreak detection | Mobile compliance report |
| 8 | DNS and network: Deploy secure DNS (DoH/DoT) via Intune profile | Network security baseline |
**The Phase 2 conversation**:
> *"Your devices are now encrypted, patched, and compliant. Only managed, healthy devices can access your email and documents. But we also discovered that your conditional access policies do not exist yet—so a stolen password from an unmanaged device still works. That is the next bridge to cross."*
---
### Phase 3: Sovereignty and Expansion (Days 60-90)
**Objective**: Use endpoint visibility to drive broader security transformation.
| Week | Action | Deliverable |
|------|--------|-------------|
| 9 | Shadow AI discovery: Review application inventory for AI/ML tools; proxy log correlation | Shadow AI report |
| 9 | Sanctioned AI deployment: Azure OpenAI bridge or local AI alternative for approved use | AI governance pilot |
| 10 | EDR deployment: Defender for Endpoint (if E5) or Wazuh/Sysmon augmentation (if E3) | EDR coverage report |
| 10 | Vulnerability management: Integrate Intune compliance data with vulnerability prioritization | Risk-based patch prioritization |
| 11 | Data loss prevention: Endpoint DLP policies (if Purview licensed) or manual controls | DLP baseline |
| 11 | Recovery validation: Test remote wipe, device replacement workflow, backup of device config | Recovery procedure tested |
| 12 | Governance handover: Client team trained on Intune operations; runbooks documented; monitoring automated | Operational handover complete |
**The Phase 3 conversation**:
> *"Your endpoint estate is now managed, hardened, and visible. From here, the natural next steps are identity hardening—because devices are only as strong as the accounts that access them—and AI sovereignty—because we found consumer AI tools on twelve corporate devices that are sending your data to third parties. We can fix both in the next 90 days."*
---
## Client Archetypes and Approach
### Archetype 1: The SCCM Retiree
**Profile**: Mature on-premises environment; SCCM administering thousands of devices; management wants cloud-native management.
**Entry conversation**:
> *"SCCM has served you well, but it requires infrastructure, VPN connectivity, and on-premises presence. Intune manages devices wherever they are—home, hotel, airport—without VPN. We can run SCCM and Intune in parallel during migration, then retire SCCM once coverage is proven. During the migration, we will also modernize your security baselines because they have likely not been updated since the SCCM deployment began."*
**Key considerations**:
- Co-management (SCCM + Intune) as a transitional state
- Task sequence migration to Intune proactive remediations and PowerShell scripts
- Windows Update for Business replacing WSUS
- Driver and firmware update strategy (Intune is weaker here; plan for Windows Update for Business or third-party tools)
### Archetype 2: The Remote-First Convert
**Profile**: Post-COVID organization; devices scattered globally; no visibility into home office security.
**Entry conversation**:
> *"Your devices are in forty home offices, three countries, and an unknown number of coffee shops. You currently have no visibility into whether they are encrypted, patched, or compromised. Intune gives you that visibility in two weeks. From there, we can enforce compliance so that only healthy devices access company data—regardless of where the device is physically located."*
**Key considerations**:
- BYOD vs. corporate-owned: define the boundary clearly
- Privacy regulations: employee monitoring on personal devices requires legal review
- Network security: home Wi-Fi is untrusted; DNS security and VPN policies critical
- Licensing: Intune is included in E3; no additional purchase required for basic MDM
### Archetype 3: The Compliance-Driven Client
**Profile**: Regulated industry (banking, healthcare, critical infrastructure); auditor found device management gaps; needs evidence.
**Entry conversation**:
> *"Your auditor wants proof that every device accessing customer data is encrypted, patched, and compliant. Intune does not just achieve compliance—it generates the evidence automatically. Every device reports its state. Every policy violation is logged. Every remediation is tracked. When the auditor returns, you show them a dashboard, not a prayer."*
**Key considerations**:
- Evidence retention: compliance reports must be retained for auditor review
- Segregation: regulated devices may need separate compliance policies
- Documentation: every policy must have a business justification for auditor review
### Archetype 4: The Intune License Hoarder
**Profile**: Bought E3/E5 years ago; Intune was never deployed; licenses are "shelfware."
**Entry conversation**:
> *"You are already paying for Intune. It is included in your E3 licenses. Deploying it costs nothing beyond our time—and it will reveal whether you are getting value from the rest of your Microsoft investment. We often find that organizations with unused Intune also have unused MFA, unused conditional access, and unused Defender features. Intune is the first domino."*
**Key considerations**:
- Zero incremental licensing cost is a powerful argument
- Often reveals other underutilized E3/E5 capabilities
- Fastest path to visible ROI
---
## E3 vs. E5 Endpoint Management
| Capability | E3 Inclusion | E5 Addition | Practical Impact |
|-----------|-------------|-------------|------------------|
| Intune MDM/MAM | Yes | Yes | Full device and app management |
| Windows Update for Business | Yes | Yes | Cloud-native patching |
| BitLocker management | Yes | Yes | Encryption deployment and key escrow |
| Defender Antivirus | Yes | Yes | Basic AV configuration via Intune |
| **Defender for Endpoint (EDR)** | **No** | **Yes** | Behavioral detection, threat hunting, automated investigation |
| **Advanced compliance policies** | **Basic** | **Enhanced** | Risk-based conditional access integration |
| **Endpoint DLP** | **No** | **Yes** (Purview) | Data loss prevention at the endpoint |
| **Attack Surface Reduction (ASR)** | **No** | **Yes** | Exploit protection, controlled folder access |
**The E3 approach**:
- Intune for configuration, compliance, and application management
- Sysmon + Wazuh for EDR-like visibility
- Manual vulnerability prioritization
- LAPS for local admin password management
**The E5 approach**:
- Everything in E3, plus Defender for Endpoint full EDR
- ASR rules deployed via Intune
- Automated investigation and remediation
- Endpoint DLP for data governance
- Threat analytics and vulnerability management integration
---
## Converting Endpoint Management Into Antifragile Engagement
### The 30-Day Pivot
At the 30-day steering committee, present:
1. **Device management results**: enrollment %, compliance %, encryption %
2. **Discovery findings**: the top 5 security gaps revealed by device visibility
3. **The expansion proposal**: 60-90 day roadmap to address those gaps
**Example pivot**:
> *"We enrolled 340 devices and achieved 94% compliance. During enrollment, we discovered 12 devices with consumer AI tools sending data to third-party clouds, 8 accounts with standing global admin rights, and no conditional access policies at all. The device problem is solved. We now propose a 60-day identity and access hardening sprint to close the gaps we found."*
### The Natural Service Ladder
```
Month 1: Endpoint Management (Intune deployment)
↓ Discovery of identity, app, and data gaps
Month 2-3: Identity Hardening (MFA, conditional access, PIM)
↓ Discovery of shadow AI and data leakage
Month 4-6: AI Sovereignty (Azure OpenAI bridge, local AI pilot)
↓ Discovery of architectural fragility
Month 6-12: Antifragile Architecture (exit architectures, chaos engineering, red team)
```
---
## Talking Points for Executives
### For the CEO
> *"Your employees are working from home offices, airports, and coffee shops on devices you cannot see. Intune gives you visibility in two weeks and control in four. It is not surveillance—it is ensuring that the device accessing your strategy documents is encrypted, patched, and owned by your company, not a contractor with a personal laptop."*
### For the CFO
> *"You already own Intune. It is included in your E3 licenses. We are not selling you software. We are extracting value you have already paid for. The average organization with E3 uses less than 40% of included security capabilities. Intune is the fastest way to prove ROI on existing licensing."*
### For the CISO
> *"Intune is not just device management. It is the enforcement point for every other security control. Your conditional access policies are useless if they cannot evaluate device health. Your DLP policies are toothless if they do not apply to endpoints. Your identity security is theoretical if stolen credentials work from any unmanaged device. Intune makes the rest of your security stack actually work."*
### For the IT Director
> *"We know SCCM has been reliable. But it requires VPN, on-premises infrastructure, and manual touch. Intune automates what SCCM does and adds capabilities SCCM cannot: mobile device management, application protection on personal devices, and cloud-native patching without VPN. We run them in parallel, migrate gradually, and retire SCCM only when you are confident."*
---
## Integration With Existing Frameworks
| Framework Document | Integration Point |
|-------------------|-------------------|
| [M365 E3 Hardening](m365-e3-hardening.md) | Intune is the primary E3 endpoint management tool; this document extends it with entry-vector strategy |
| [M365 Antifragile Project](m365-antifragile-project.md) | Endpoint management is a core workstream in both greenfield and modernisation projects |
| [Rapid Modernisation Plan](rapid-modernisation-plan.md) | Phase 1 (Hygiene) device visibility maps directly to endpoint management deployment |
| [Zero-Budget Hardening](zero-budget-hardening.md) | Intune is free in E3; Sysmon/Wazuh augment E3 endpoint security without new purchases |
| [Azure OpenAI Sovereignty Bridge](../core/azure-openai-sovereignty-bridge.md) | Device application inventory reveals shadow AI; Intune becomes the enforcement point for sanctioned AI |
| [AI Operations Inevitability](../core/ai-operations-inevitability.md) | Endpoints are where defensive AI agents run; managed endpoints are prerequisite for AI-driven endpoint security |
---
*For the M365 E3 hardening specifics, see [M365 E3 Hardening](m365-e3-hardening.md).*
*For the rapid modernisation plan, see [Rapid Modernisation Plan](rapid-modernisation-plan.md).*
*For the M365 antifragile project playbook, see [M365 Antifragile Project](m365-antifragile-project.md).*

View File

@@ -0,0 +1,403 @@
# Implementation Playbook
> *"This is not an upgrade. It is an insurance policy against the obsolescence of your own company."*
This playbook provides tactical, step-by-step guidance for delivering the [Rapid Modernisation Plan](rapid-modernisation-plan.md) in a client environment. It is organized by workstream and intended for hands-on consultants, security architects, and technical leads.
---
## Table of Contents
1. [Engagement Kickoff](#engagement-kickoff)
2. [Workstream: Identity and Access](#workstream-identity-and-access)
3. [Workstream: Perimeter and Visibility](#workstream-perimeter-and-visibility)
4. [Workstream: AI Sovereignty](#workstream-ai-sovereignty)
5. [Workstream: Resilience and Recovery](#workstream-resilience-and-recovery)
6. [Workstream: Culture and Governance](#workstream-culture-and-governance)
7. [Common Failure Modes](#common-failure-modes)
8. [Tools and Templates](#tools-and-templates)
---
## Engagement Kickoff
### Pre-Engagement Checklist
Before arriving on-site or starting the remote engagement:
- [ ] Client has signed SOW with explicit scope, authority, and escalation paths
- [ ] Key stakeholders identified: CISO, CIO, legal, business unit sponsors
- [ ] Initial data room access granted: AD exports, cloud IAM, network diagrams, CMDB if exists
- [ ] Emergency contact list established with authority to disable accounts / block access
- [ ] Backup verification: confirm backups exist and have been tested within last 90 days
- [ ] "Get out of jail free" letter: written executive authorization for disruptive security actions
### Day 0: Stakeholder Interviews
Interview each stakeholder for 30 minutes. Ask the same five questions:
1. What is the shortest path to a business-ending incident here?
2. What are you most worried about that you are not telling the board?
3. What is the one system whose failure would stop revenue for 24 hours?
4. Where is your proprietary data going that you cannot fully track?
5. If you had to replace your primary cloud vendor in 90 days, could you?
Document answers. Look for contradictions between stakeholders—these reveal hidden dependencies.
### Day 0: Establish the War Room
- Physical or virtual space for daily standups
- Shared dashboard: tasks, blockers, risks
- Direct escalation path to executive sponsor
- Decision log: every major decision recorded with rationale and owner
---
## Workstream: Identity and Access
### Objective
Eliminate unknown identities, reduce privilege, and establish just-in-time access before attackers exploit standing permissions.
### Week 1: Identity Census
**Step 1: Export all identities**
- Active Directory: all users, groups, computers, service accounts
- Cloud IAM: AWS IAM, Azure AD / Entra ID, GCP IAM
- SaaS platforms with local identity stores
- Non-human identities: API keys, service principals, OAuth apps, managed identities
**Step 2: Deduplicate and correlate**
- Match cloud identities to on-premises identities
- Identify orphaned accounts: no owner, no recent use, no documented purpose
- Identify over-privileged accounts: admin rights without justification
**Step 3: Categorize by risk**
| Category | Action | Timeline |
|----------|--------|----------|
| Orphaned, unused > 90 days | Disable immediately | Day 1-2 |
| Shared accounts | Target for elimination or vaulting | Week 1-2 |
| Admin / privileged | Force password rotation + MFA enforcement | Day 3-5 |
| Service accounts with interactive logon | Review and restrict | Week 1-2 |
| External / vendor access | Audit and time-bound | Week 1-2 |
### Week 2: Privilege Reduction
**Step 1: Implement Privileged Access Workstations (PAWs)**
- Dedicated machines for admin tasks
- No internet browsing, no email, no non-admin applications
- Physical or strongly virtualized separation
**Step 2: Deploy Just-in-Time (JIT) elevation where possible**
- Azure AD PIM, AWS IAM Identity Center, or third-party PAM
- Maximum elevation duration: 4 hours
- Require approval for standing admin roles
**Step 3: Password hygiene enforcement**
- Minimum 14 characters, no complexity requirements (NIST 800-63B)
- Audit against known-breached password lists
- Eliminate password rotation mandates unless compromise suspected
### Week 3-4: MFA and Conditional Access
- Enforce MFA on all remote access: VPN, cloud admin, RDP gateways
- Implement risk-based conditional access:
- Unmanaged device → require MFA + compliant device
- Impossible travel → block or step-up
- Legacy authentication → block entirely
### Common Pitfalls
- **Over-scoping**: Do not attempt to fix every identity in 30 days. Focus on privileged and external first.
- **Breaking automation**: Service account password rotations can break CI/CD. Coordinate with application owners. Test in non-production first.
- **Shadow IT identities**: SaaS platforms with standalone accounts (Slack, Zoom, etc.) are often missed. Use email domain scanning or CASB tools.
---
## Workstream: Perimeter and Visibility
### Objective
Know exactly what the organization looks like from the outside, and monitor every path that crosses the trust boundary.
### Week 1-2: External Attack Surface Mapping
**Step 1: Passive reconnaissance**
- Enumerate subdomains: certificate transparency logs, DNS brute force, search engine dorks
- Identify exposed services: Shodan, Censys, custom port scanning from external vantage points
- Map cloud assets: public S3 buckets, open storage accounts, exposed databases
**Step 2: Active validation**
- Confirm ownership of discovered assets with client
- Test for default credentials on exposed management interfaces
- Document findings with risk ratings: P0 (immediate), P1 (urgent), P2 (planned)
### Week 2-3: Internal Visibility
**Step 1: Deploy endpoint detection**
- Microsoft Defender for Endpoint, CrowdStrike, SentinelOne, or equivalent
- Target: 100% of managed Windows, macOS, Linux endpoints
- Validate: can you see process execution, network connections, and file modifications?
**Step 2: Network monitoring**
- Deploy sensors at:
- Internet boundary
- Internal network segments (especially IT/OT boundaries)
- Critical server VLANs
- Enable DNS query logging and analysis
**Step 3: Log aggregation**
- Centralize logs from: identity systems, endpoints, firewalls, cloud control planes, critical applications
- Minimum retention: 90 days hot, 1 year cold
- Ensure tamper protection: attackers delete logs
### Week 3-4: CMDB Seeding
- Populate CMDB with T0 and T1 assets first
- For each asset: owner, criticality, dependencies, recovery requirements
- Accept imperfection. A partially correct CMDB is infinitely better than no CMDB.
### Common Pitfalls
- **Scanning without authorization**: Ensure written approval for active scanning. Some jurisdictions treat unauthorized scanning as criminal.
- **Alert fatigue**: Do not enable every detection rule on day one. Start with high-confidence, high-impact alerts. Tune before expanding.
- **Log storage costs**: Centralized logging is expensive. Prioritize critical systems. Use tiered storage.
---
## Workstream: AI Sovereignty
### Objective
Convert intelligence from a rented commodity into an owned, protected, T0-class asset.
### Week 1-2: AI Usage Discovery
**Step 1: Survey**
- Interview department heads: engineering, legal, marketing, operations, finance
- Ask: "What AI tools are you using? What data are you putting into them?"
- Expect 30-50% shadow usage. Employees use personal ChatGPT accounts, browser extensions, and mobile apps.
**Step 2: Technical discovery**
- Review proxy logs for AI API traffic: OpenAI, Anthropic, Google, Azure OpenAI
- Review SaaS billing for AI-enabled tools
- Review browser extensions and endpoint software inventories
**Step 3: Data flow mapping**
For each discovered AI tool, document:
- Data types entering the tool
- Data residency and processing location
- Vendor terms: training use, retention, deletion, subprocessing
- Regulatory implications: GDPR, DORA, NIS2, industry-specific
### Week 3-4: Local AI Infrastructure
**Step 1: Select hardware or sovereign cloud**
| Option | When to Use |
|--------|-------------|
| On-premise GPU servers | High volume, strict air-gap, existing data centre capacity |
| Sovereign cloud (EU, national) | Regulatory requirements, no on-premises GPU expertise |
| Edge inference nodes | Distributed organization, OT environments, low-latency requirements |
**Step 2: Select initial model**
For most organizations, start with:
- **Base model**: Llama 3, Mistral, or Qwen (7B-13B parameters, quantized to 4-bit)
- **Deployment**: Ollama, vLLM, or llama.cpp for inference
- **Orchestration**: LangChain or custom RAG pipeline for proprietary data integration
- **Fine-tuning**: QLoRA for domain adaptation on proprietary datasets
**Step 3: Deploy with T0 controls**
- Network segmentation: inference hosts have no direct internet egress
- Access control: model weights encrypted at rest; access requires multi-party approval
- Audit: log all prompts, responses, and model access
- Backup: immutable backups of weights, configurations, and vector databases
### Week 5-8: Pilot and Measure
Select one high-value, low-risk workflow:
| Workflow | Why It Works |
|----------|-------------|
| Internal code review assistant | Proprietary code never leaves perimeter; measurable quality improvement |
| Security log analysis | Feeds defensive AI directly; reduces analyst workload |
| Policy / compliance document drafting | High volume, repetitive, proprietary domain knowledge |
| Customer support triage | Reduces response time; training data is historical tickets |
**Measurement criteria**:
- Accuracy vs. cloud baseline (human-evaluated on a sample)
- Cost per inference (compute + personnel)
- Data leakage incidents: zero
- User satisfaction: qualitative survey
### Common Pitfalls
- **Over-engineering the first deployment**: Do not build a full MLOps platform for the pilot. Start simple. Prove value. Then scale.
- **Ignoring GPU availability**: GPU procurement can take months. Have a cloud fallback for the pilot if on-premises hardware is delayed.
- **Neglecting prompt injection**: Local models are not immune to adversarial prompts. Implement input validation and output filtering.
- **Forgetting the human loop**: AI augments decisions; it does not replace accountability. Design workflows where humans retain final authority.
---
## Workstream: Resilience and Recovery
### Objective
Ensure that when—not if—a critical system fails, recovery is fast, tested, and deterministic.
### Week 1-4: Backup Validation
**Step 1: Inventory backup coverage**
- For every T0 and T1 asset: what is backed up, how often, where, by what mechanism
- Identify gaps: databases without point-in-time recovery, VMs without application-consistent snapshots
**Step 2: Test restoration**
- Select one critical system per week
- Perform full restoration to isolated environment
- Document: time to restore, data loss window, manual steps required, blockers encountered
**Step 3: Fix what breaks**
- If a backup cannot be restored, the backup does not exist
- Update procedures, fix tooling, re-test
### Month 2-3: Recovery Automation
- Automate the most common recovery scenarios: VM restore, database point-in-time recovery, Active Directory forest recovery
- Document runbooks for scenarios that cannot be fully automated
- Train multiple team members on each runbook
### Month 3-6: Chaos Engineering
**Step 1: Game days**
- Scheduled, announced simulations of failure scenarios
- Example: simulate domain controller failure during business hours
- Measure: detection time, escalation time, resolution time, communication quality
**Step 2: Chaos experiments**
- Unannounced, bounded experiments in non-production
- Example: terminate API service instances, block DNS resolution, fill disk space
- Validate: auto-scaling, alerting, runbook accuracy
**Step 3: Production chaos**
- Only after months of successful game days and non-production experiments
- Start with low-risk failures: single instance termination, network latency injection
- Always have automated rollback and a human kill switch
### Common Pitfalls
- **Assuming backups work**: Untested backups are prayers, not plans.
- **Recovery without validation**: A restored system that cannot authenticate users or connect to databases is not recovered.
- **Chaos without guardrails**: Never run chaos experiments when the organization is already under stress (active incident, change freeze, key personnel on leave).
---
## Workstream: Culture and Governance
### Objective
Embed antifragile principles into decision-making, hiring, and organizational habits.
### Tactics
**Blameless Post-Mortems**
- Within 48 hours of significant incident
- Focus: what about the system allowed this mistake? Not: who made the mistake?
- Mandatory output: at least one structural change (policy, architecture, or procedure)
- Publish internally: transparency builds trust and disseminates learning
**Security Champions Program**
- Identify one volunteer per team who acts as security liaison
- Monthly 1-hour meeting: new threats, policy changes, team-specific concerns
- Champions feed team context up and security guidance down
**Red Team as a Service**
- Monthly or quarterly adversarial simulations
- Report to CISO and board, not just IT
- Measure: time to detect, time to contain, time to evict
- Trend over time: the organization should get faster, not just more compliant
**Antifragile Metrics Review**
- Monthly steering committee reviews:
- Mean time to structural fix (from incident)
- Number of chaos experiments run and lessons learned
- % of vendor dependencies with documented exit plan
- AI sovereignty maturity score
### Common Pitfalls
- **Post-mortems without action**: If findings are not tracked to completion, they become theater.
- **Security champions without authority**: Champions need time allocation and executive backing, or they become scapegoats.
- **Metrics without narrative**: Numbers alone do not persuade boards. Pair metrics with stories: "Here is what we learned, here is what we changed, here is why we are safer."
---
## Common Failure Modes
| Failure Mode | Symptom | Remedy |
|-------------|---------|--------|
| **Scope creep** | 30-day phase stretches to 90 days | Time-box ruthlessly. Document deferred items for next phase. |
| **Tool obsession** | Team debates SIEM vendor for 3 weeks | Pick the good-enough tool. Implementation beats selection. |
| **Perfectionism** | CMDB project stalls waiting for completeness | Seed with critical assets. Expand iteratively. |
| **Vendor capture** | Recommendations always favor one provider | Disclose relationships. Maintain independence. Document alternatives. |
| **Executive fatigue** | Board stops attending updates | Lead with business risk, not technical detail. Show cost of inaction. |
| **Operational resistance** | IT refuses to disable legacy accounts | Use the "get out of jail free" letter. Escalate to executive sponsor. |
| **Pilot purgatory** | Local AI pilot runs forever without production migration | Define hard success criteria and production migration date before starting. |
---
## Tools and Templates
### Templates Included in This Repository
- [T0 Asset Classification Worksheet](../core/t0-asset-framework.md#t0-classification-worksheet)
- AI Usage Discovery Interview Guide (see Workstream: AI Sovereignty)
- Blameless Post-Mortem Template (to be added)
- Chaos Experiment Planning Template (to be added)
- Vendor Exit Architecture Template (to be added)
### Recommended External Tools
| Category | Options | Notes |
|----------|---------|-------|
| Endpoint Detection | Microsoft Defender, CrowdStrike, SentinelOne | Choose based on existing Microsoft footprint |
| SIEM / Log Analysis | Sentinel, Splunk, Elastic, Wazuh | Wazuh is open-source and sufficient for many environments |
| Identity Governance | Azure AD / Entra ID, Okta, Saviynt | Match to primary cloud identity provider |
| PAM / Vault | CyberArk, Delinea, HashiCorp Vault | Essential for service account and secret management |
| CMDB | ServiceNow, Device42, GLPI, or spreadsheet | Any CMDB is better than no CMDB |
| Local AI Inference | Ollama, vLLM, llama.cpp, TGI | Start simple; scale to TGI or vLLM for production load |
| Chaos Engineering | Gremlin, Chaos Mesh, custom scripts | Gremlin for enterprise; Chaos Mesh for Kubernetes |
---
*This playbook is a living document. Update it with lessons from every engagement.*
*Previous: [Rapid Modernisation Plan](rapid-modernisation-plan.md)*

View File

@@ -0,0 +1,310 @@
# M365 Antifragile Project Playbook
> *"Most M365 deployments create fragile monocultures: one tenant, one identity provider, one way in, and no way out. We architect M365 as an antifragile platform: decoupled, observable, recoverable, and sovereign."*
This playbook applies antifragile principles to Microsoft 365 projects—both **greenfield deployments** (new tenant, new organization, or post-merger consolidation) and **modernisation** (existing tenant hardening, restructuring, or security transformation).
It is designed for M365/Azure consultancies who want to deliver resilient, governance-ready, and future-proof M365 environments—not just functional ones.
---
## The Antifragile M365 Philosophy
Traditional M365 projects optimize for:
- **User adoption**: How quickly can we get people using Teams?
- **Feature enablement**: Which M365 apps should we roll out?
- **License efficiency**: Are we using all our E3/E5 seats?
Antifragile M365 projects optimize for:
- **Structural decoupling**: Can we migrate, split, or exit this tenant without existential disruption?
- **Observability**: Do we know who has access to what, and what they are doing with it?
- **Recoverability**: Can we rebuild this tenant from zero in 48 hours?
- **Sovereignty**: Does our proprietary data improve our position, or Microsoft's?
---
## Part 1: Greenfield M365 Deployment
### Phase 0: Architecture and Sovereignty Design (Before Migration)
**Objective**: Design the tenant so it does not become a trap.
| Decision | Antifragile Default | Fragile Alternative |
|----------|--------------------|---------------------|
| **Tenant location** | Data center in client's primary jurisdiction (e.g., EU, Germany, Switzerland) | Default US tenant with data residency afterthought |
| **Domain strategy** | Custom domain owned by client; MX records client-controlled | Microsoft-managed domain; no exit path |
| **Identity architecture** | Cloud-only Entra ID with documented exit path, OR hybrid with phased cloud-native migration | Hybrid AD with indefinite synchronization; no cloud-only plan |
| **Email archiving** | Immutable third-party journal or customer-managed retention; not Exchange Online-only | Exchange Online retention only; vendor-dependent |
| **External sharing** | Default off; enabled per-site with justification | Default on; locked down reactively after incidents |
| **Guest access** | Disabled by default; enabled via governed workflow | Enabled by default; cleaned up never |
| **Third-party apps** | Admin consent required; app catalog governed | User consent allowed; shadow OAuth proliferation |
| **Backup strategy** | Third-party backup with immutable storage; tested quarterly | Native recycle bin only; no recovery testing |
**The conversation**:
> *"We are not just setting up email and Teams. We are designing the digital foundation of your organization for the next decade. Every decision we make in the first two weeks will either preserve your optionality or eliminate it. We choose optionality."*
---
### Phase 1: Tenant Foundation (Week 1-2)
**Identity and Access Architecture**
- **Custom domain verification**: Client retains DNS control; Microsoft is a service, not an owner
- **Break-glass accounts**: 2-3 global admins, excluded from conditional access, complex passwords managed offline
- **Initial admin roles**: No standing global admins for daily work; delegated admin roles (Exchange admin, SharePoint admin, User admin)
- **Security defaults or conditional access baseline**:
- E3: Per-user MFA for all admins; block legacy authentication
- E5: Conditional access requiring MFA for all users, compliant devices for admins, block legacy auth, risky sign-in policies
**Data Governance Foundation**
- **Retention policies**: Define retention from day one
- Email: 7 years for regulated industries; 3 years for general business
- Teams chat: 2 years minimum
- SharePoint: per-site classification
- **Microsoft Purview labels** (if licensed): Deploy default sensitivity labels (Public, Internal, Confidential, Highly Confidential)
- **Data loss prevention** (if licensed): Pilot DLP for PCI, PII, and client-defined crown jewels
**Baseline Security Configuration**
- **Audit logging**: Enable Unified Audit Log immediately; configure 10-year retention for regulated clients
- **Mailbox auditing**: Enable for all mailboxes via PowerShell
- **Alert policies**: Configure default alert policies for elevated privileges, malware, phishing
- **Secure Score**: Baseline and weekly tracking
---
### Phase 2: Workload Deployment (Week 3-6)
**Deployment Order (Antifragile Priority)**
| Priority | Workload | Why First? |
|----------|----------|-----------|
| 1 | **Exchange Online** | Identity verified, email secured, archiving established |
| 2 | **SharePoint / OneDrive** | Document governance foundation before content accumulates |
| 3 | **Teams** | Collaboration with the governance guardrails already in place |
| 4 | **Intune / Endpoint Management** | Device compliance before conditional access enforcement; see [Endpoint Management Entry Vector](endpoint-management-entry-vector.md) |
| 5 | **Power Platform** | Low-code governance before citizen developers create shadow IT |
| 6 | **Copilot / AI features** | Only after data governance, access control, and sovereignty architecture are proven |
**The antifragile rule**: Governance before workload. Every Teams channel created without retention policy is technical debt. Every Power App deployed without DLP is a future incident.
---
### Phase 3: Hardening and Governance (Week 7-10)
**Conditional Access (E5 or Entra ID P1/P2)**
- Require MFA for all users
- Require compliant or hybrid Azure AD joined device for sensitive apps
- Block legacy authentication
- Block downloads from unmanaged devices for confidential content
- Require password change on high user risk
- Enforce token binding where supported
**SharePoint and OneDrive Lockdown**
- External sharing: Only people in your organization (default)
- Anyone links: Disabled
- Guest access: Admin-controlled per site
- Site creation: Admin-only or governed workflow
- Access requests: Disabled or routed to site owner
**Teams Governance**
- Team creation: Governed workflow (not open to all)
- Guest access in Teams: Disabled by default; enabled per team with justification
- Private channel creation: Restricted
- Third-party apps in Teams: Admin-approved catalog only
- Meeting recordings: Retention policy applied; transcription governed
**Power Platform Governance**
- Environment strategy: Default environment restricted; production environments for approved use cases
- DLP policies: Block connectors that exfiltrate data (personal email, unauthorized cloud storage)
- Data policies: Prevent citizen developers from creating unmanaged databases of customer data
- ALM: Require solution packaging for production environments
---
### Phase 4: Validation and Handover (Week 11-12)
**Recovery Testing**
- Perform tenant recovery drill: restore a deleted mailbox, a deleted SharePoint site, a corrupted Teams channel
- Validate backup integrity if third-party backup is deployed
- Document recovery runbooks
**Governance Documentation**
- Acceptable use policy for M365
- Data classification and handling guide
- Guest access policy
- External sharing decision tree
- Incident response runbook for M365-specific threats (BEC, OAuth consent grants, data exfiltration)
**Knowledge Transfer**
- Admin training: Entra ID, Exchange admin center, SharePoint admin, Security & Compliance
- End-user training: Phishing awareness, data handling, external sharing procedures
- Champion program: Identify M365 champions per department
---
## Part 2: M365 Modernisation
### The Modernisation Audit
Before any changes, assess the current tenant against antifragile criteria:
| Category | Audit Question | Finding |
|----------|---------------|---------|
| **Identity** | How many global admins? How many unused accounts? Is PIM enabled? | |
| **Access** | Is conditional access deployed? Is legacy auth blocked? Is MFA enforced? | |
| **Data** | Are sensitivity labels deployed? Is DLP active? Who can share externally? | |
| **Applications** | How many enterprise apps? How many OAuth consents? Are they justified? | |
| **Devices** | What is EDR coverage? Is Intune managing devices? Are PAWs used for admin? | |
| **Recovery** | When was the last backup test? Is there a tenant recovery plan? | |
| **Governance** | Is there an acceptable use policy? Who owns site creation? | |
| **AI** | Is shadow AI in use? Is there a sanctioned alternative? | |
**The conversation**:
> *"Most M365 modernisations start with 'What new features should we enable?' We start with 'What would kill this organization if it failed?' Then we fix that first."*
---
### Phase 1: Kill Chain Closure (Week 1-4)
**Identity Blitz**
```powershell
# Export and analyze the full identity estate
Get-MgUser -All | Select-Object DisplayName,UserPrincipalName,AccountEnabled,LastSignInDateTime | Export-Csv users.csv
Get-MgDirectoryRole | ForEach-Object { Get-MgDirectoryRoleMember -DirectoryRoleId $_.Id }
Get-MgOAuth2PermissionGrant -All | Export-Csv oauth-grants.csv
```
- Disable unused accounts (> 90 days inactive)
- Remove excessive admin roles
- Revoke stale OAuth consents
- Enable PIM for all privileged roles (if licensed)
- Enforce MFA for all users (per-user MFA for E3; conditional access for E5)
**External Access Lockdown**
- Audit all guest users: business justification per guest
- Audit all external shares: revoke stale links
- Audit all enterprise apps: remove unused, justify retained
- Disable user consent for apps (admin consent required)
**Email Security Tuning**
- E3: Maximize EOP (anti-phishing impersonation protection, anti-malware, anti-spam)
- E5: Enable Safe Links, Safe Attachments, advanced anti-phishing
- Mailbox auditing: enable for all mailboxes
---
### Phase 2: Structural Improvement (Week 5-8)
**Data Governance Deployment**
- Deploy sensitivity labels (if Purview available) or manual classification guidance
- Deploy retention policies for all workloads
- Deploy DLP policies for high-sensitivity data types
- Site provisioning governance: restrict site creation or implement approval workflow
**Device and Endpoint**
- Deploy Intune MDM for all corporate devices
- Deploy Windows Defender features available in E3
- Consider Sysmon + Wazuh for EDR-like visibility without E5
- Deploy LAPS for local admin password randomization
**Power Platform Cleanup**
- Inventory all environments, apps, and flows
- Apply DLP policies
- Migrate unmanaged production apps to governed environments
- Document and train citizen developers
---
### Phase 3: Sovereignty and AI Integration (Week 9-12)
**AI Sovereignty Bridge**
- Inventory shadow AI usage
- Deploy Azure OpenAI Service as sanctioned alternative (see [Azure OpenAI Sovereignty Bridge](../core/azure-openai-sovereignty-bridge.md))
- Configure private endpoints, CMK, and conditional access for AI endpoints
- Pilot Copilot for M365 with governance guardrails (if licensed)
**Tenant Recovery Validation**
- Third-party backup test: restore mailbox, SharePoint site, Teams data
- Document tenant rebuild runbook
- Validate domain recovery procedures (DNS, MX, SPF, DKIM, DMARC)
**Operational Handover**
- Transfer admin knowledge to client team
- Establish recurring governance review cadence
- Deploy automated Secure Score monitoring
---
## Antifragile M365 Checklist
### Greenfield Deployment
- [ ] Tenant in correct geographic region
- [ ] Custom domain with client-controlled DNS
- [ ] Break-glass accounts created and secured
- [ ] Security defaults or conditional access baseline
- [ ] Unified Audit Log enabled
- [ ] Retention policies defined and deployed
- [ ] External sharing default: off
- [ ] Guest access default: disabled
- [ ] User consent for apps: disabled
- [ ] Intune MDM baseline configured
- [ ] Third-party backup deployed and tested
- [ ] Recovery runbook documented
- [ ] Admin and end-user training completed
- [ ] AI governance framework defined before Copilot deployment
### Modernisation
- [ ] Full identity census completed
- [ ] Unused accounts disabled
- [ ] Admin roles minimized and justified
- [ ] OAuth consents audited and cleaned
- [ ] MFA enforced for 100% of users
- [ ] Legacy authentication blocked
- [ ] External sharing audited and locked down
- [ ] Guest access audited and time-bounded
- [ ] Email security tuned (EOP or Defender for O365)
- [ ] Sensitivity labels or classification guidance deployed
- [ ] Retention policies applied to all workloads
- [ ] Power Platform governed with DLP
- [ ] Shadow AI inventoried and sanctioned alternative deployed
- [ ] Backup recovery tested
- [ ] Secure Score trending upward
---
## Integration With the Rapid Modernisation Plan
| Rapid Modernisation Phase | M365 Project Mapping |
|--------------------------|---------------------|
| **Hygiene (Days 0-30)** | Identity audit; external access lockdown; MFA enforcement; shadow AI inventory |
| **Control (Days 30-60)** | Conditional access; data governance; device management; email security tuning |
| **Sovereignty (Days 60-90)** | Azure OpenAI bridge deployment; backup recovery validation; tenant exit architecture |
| **Antifragility (Days 90-180)** | Automated governance monitoring; quarterly recovery drills; red team including M365 vectors; AI pilot expansion |
---
*For the M365 E3 hardening specifics, see [M365 E3 Hardening](m365-e3-hardening.md).*
*For the Azure OpenAI sovereignty bridge, see [Azure OpenAI Sovereignty Bridge](../core/azure-openai-sovereignty-bridge.md).*
*For the M365 project risk register, see [M365 Project Risk Register](../assessment-templates/m365-project-risk-register.md).*

View File

@@ -0,0 +1,331 @@
# M365 E3 Hardening Playbook
> *"Most of your clients own E3, not E5. That is not a handicap. It is a constraint that forces precision."*
This playbook is designed for consulting engagements where the client's primary environment is **Microsoft 365 with E3 licensing**. It provides a pragmatic hardening roadmap that respects the E3 feature boundary while closing critical security gaps through configuration, process, and low-cost augmentation.
E3 provides the foundation. The gaps are real but manageable. This document shows you exactly what E3 gives you, what it does not, and how to close the gaps without immediately pushing an E5 upgrade.
---
## What E3 Actually Includes (Security-Relevant)
| Capability | E3 Inclusion | Notes |
|-----------|-------------|-------|
| Exchange Online Protection (EOP) | Yes | Anti-malware, anti-spam, basic anti-phishing |
| Azure AD Free / Entra ID Free | Yes | Basic identity, no conditional access, no PIM |
| Microsoft Defender Antivirus | Yes | Client-side AV, no EDR, no ASR |
| Office 365 Audit Logging | Yes | Must be manually enabled |
| Basic Mobile Device Management (MDM) | Yes | Via Microsoft Intune limited enrollment |
| Self-Service Password Reset (SSPR) | Yes | Requires Azure AD Basic configuration |
| Teams, SharePoint, OneDrive | Yes | Data governance limited without Purview |
## What E3 Does NOT Include (The Gaps)
| Capability | Missing in E3 | Business Impact |
|-----------|---------------|-----------------|
| Microsoft Defender for Endpoint P2 | No | No EDR, no ASR rules, no threat analytics, no automated investigation |
| Entra ID P2 / P1 Conditional Access | No | No risk-based policies, no device compliance gating, no location-based rules |
| Entra ID PIM | No | No just-in-time admin elevation |
| Microsoft Defender for Office 365 P2 | No | No Safe Links, no Safe Attachments, no advanced anti-phishing |
| Microsoft Purview | No | No DLP, no sensitivity labels, no insider risk management |
| Microsoft Sentinel | No | No native SIEM; logs go to Log Analytics only with additional cost |
---
## The E3 Hardening Strategy
We operate in three layers:
1. **Maximize E3** — Every configuration, every policy, every log that E3 can produce
2. **Augment E3** — Open-source and low-cost tools that close the most dangerous gaps
3. **Justify E5 selectively** — Use E3 gaps as evidence for strategic E5 upgrades, not blanket licensing
---
## Phase 1: E3 Foundation (Week 1-2)
### Identity and Access
**Enable MFA for All Users**
E3 includes MFA via Azure AD Free/Entra ID Free, but it is **per-user MFA** (less flexible than conditional access). This is still mandatory.
- Navigate to Microsoft Entra admin center → Users → Per-user MFA
- Enable MFA for all administrative accounts first
- Roll out to all users within 14 days
- Enroll at least one backup method per user (authenticator app + phone)
**Document the Gap**: Per-user MFA cannot enforce risk-based step-up, device compliance, or location-based blocking. Document this as a risk for steering committee.
**Disable Legacy Authentication**
- Microsoft 365 admin center → Settings → Org settings → Modern authentication
- Verify legacy auth is disabled tenant-wide
- If specific protocols are required (e.g., IMAP for legacy devices), document exceptions with expiration dates
**Audit and Cleanse Identities**
- Export all users: `Get-MsolUser -All | Export-Csv`
- Export all guest users: `Get-MsolUser -All -UnlicensedUsersOnly` (guests often hidden)
- Export all service principals / enterprise apps: `Get-MsolServicePrincipal`
- Disable unused accounts (> 90 days inactive)
- Review and revoke excessive OAuth consents
**Secure Break-Glass Accounts**
- Create 2-3 Global Admin accounts that are excluded from MFA (for emergency access)
- Use non-personal, complex passwords (20+ characters, managed offline)
- Log every use; review quarterly
### Email Security (EOP-Only)
**Harden Anti-Phishing in EOP**
EOP anti-phishing is basic but not useless. Configure it aggressively:
- Exchange admin center → Protection → Anti-phishing
- Enable impersonation protection for:
- Domain (your own domains)
- Users (CEO, CFO, board members)
- Enable mailbox intelligence (learns sender patterns)
- Set action for impersonated users: **Quarantine**
- Set action for impersonated domains: **Quarantine**
**Configure Anti-Malware**
- Exchange admin center → Protection → Anti-malware
- Enable common attachment filter (block executable content)
- Notify internal senders if malware detected
- Notify administrators with full message details
**Anti-Spam Tuning**
- Exchange admin center → Protection → Anti-spam
- Set bulk email threshold to 6 or 7 (aggressive)
- Enable SPF hard fail evaluation
- Configure outbound spam notifications
### Audit Logging
**Enable Unified Audit Log**
This is **not enabled by default** in many tenants and is the most underutilized E3 feature.
```powershell
# Verify status
Get-AdminAuditLogConfig | Select-Object UnifiedAuditLogIngestionEnabled
# Enable if false
Set-AdminAuditLogConfig -UnifiedAuditLogIngestionEnabled $true
```
- Retention: 90 days (E3 default); document the gap vs. 1-year requirement in some regulations
- Export for analysis: `Search-UnifiedAuditLog` or use Microsoft Purview Audit (Standard) if available
**Enable Mailbox Auditing**
```powershell
# Enable for all mailboxes
Get-Mailbox -ResultSize Unlimited | Set-Mailbox -AuditEnabled $true
```
### SharePoint and OneDrive
**External Sharing Lockdown**
- SharePoint admin center → Policies → Sharing
- Default: **Only people in your organization**
- Override per site only with documented business justification
- Disable "Anyone" links (anonymous sharing)
**OneDrive Retention**
- OneDrive admin center → Storage
- Set retention for deleted users: 30 days minimum
- Document data ownership transfer process
---
## Phase 2: Augment E3 (Week 3-4)
### Close the EDR Gap (No Defender for Endpoint P2)
E3 includes Microsoft Defender Antivirus but **not** EDR. You need visibility.
| Option | Cost | Effort | When to Use |
|--------|------|--------|-------------|
| **Wazuh** (open-source) | Free | Medium | Need centralized EDR-like visibility without purchase |
| **Sysmon + free log forwarding** | Free | Medium | Need detailed Windows endpoint telemetry |
| **Upgrade select users to E5 Security** | ~$10/user/month | Low | Critical users only (admins, executives, finance) |
| **Microsoft Defender for Business** | ~$3/user/month | Low | Small business clients; includes EDR-lite |
**Recommended Hybrid Approach for E3 Clients**:
1. Deploy **Sysmon** (free) on all Windows endpoints with the SwiftOnSecurity config
2. Forward Sysmon logs to **Wazuh** (free) or existing syslog/SIEM
3. Upgrade **only privileged users** to Microsoft Defender for Endpoint P2 via add-on or E5 Security
4. This gives you EDR coverage where it matters most at ~10% of full E5 cost
### Close the Conditional Access Gap (No Entra ID P1/P2)
Without conditional access, you cannot enforce:
- Device compliance gating
- Location-based blocking
- Risk-based step-up
- Block legacy auth per-protocol
**Mitigations within E3**:
- **Per-user MFA**: Enforce for 100% of users (already covered above)
- **Block legacy auth tenant-wide**: Already covered above
- **Intune MDM enrollment**: E3 includes basic Intune; enroll all corporate devices
- **Third-party MFA with policy engine**: Duo, Okta (additional cost, but cheaper than full E5)
**The Strategic Conversation**:
> *"E3 gives us strong authentication but weak authorization. We can enforce MFA, but we cannot say 'only from a managed device in the Czech Republic.' If that is a requirement for your risk profile, the minimum viable upgrade is Entra ID P1 for conditional access, not a full E5 jump."*
### Close the Email Security Gap (No Defender for Office 365 P2)
EOP anti-phishing is reactive. Safe Links and Safe Attachments are proactive.
**Mitigations within E3**:
- **URL rewriting via transport rules**: Block known bad TLDs, force HTTPS where possible
- **Attachment filtering**: Block executable attachments at transport rule level (EOP already does this partially)
- **User education**: Phishing simulation via free or low-cost platforms (GoPhish is open-source)
- **Third-party email gateway**: Proofpoint, Mimecast, Avanan (~$3-5/user/month)
**The Strategic Conversation**:
> *"EOP catches spam and known malware. It does not rewrite URLs or sandbox attachments. For a bank/telco/power client, that gap is meaningful. The most cost-effective close is either Defender for Office 365 P1 add-on or a third-party gateway. Let us quantify the phishing risk first, then size the investment."*
### Close the PAM Gap (No PIM)
Without PIM, administrative roles are standing privileges.
**Mitigations within E3**:
- **Dedicated admin accounts**: Separate admin and user identity for every administrator
- **PAW (Privileged Access Workstation)**: Physical or virtual separation for admin tasks
- **Time-bounded access via process**: Manual approval workflow for admin elevation
- **Quarterly admin access review**: Document every admin; remove stale assignments
- **LAPS**: Free from Microsoft; randomizes local admin passwords
---
## Phase 3: M365-Specific Threat Scenarios
### Scenario 1: Business Email Compromise (BEC)
**The Attack**: Adversary compromises executive mailbox, sends fraudulent payment instructions.
**E3 Defenses**:
- Impersonation protection in EOP (configured above)
- Mailbox auditing (configured above)
- MFA on all accounts (prevents initial compromise)
- Outbound spam policy: flag unusual send patterns
**Gap**: No Safe Links to rewrite URLs in real-time; no automated investigation.
**Augmentation**: User education + third-party email gateway.
### Scenario 2: OAuth / Consent Grant Attack
**The Attack**: User grants permissions to malicious app; adversary gains persistent access.
**E3 Defenses**:
- Audit all enterprise apps: `Get-AzureADServicePrincipal`
- Review OAuth consents quarterly
- Disable user consent to apps (admin consent required)
- Microsoft 365 admin center → Settings → Org settings → User consent to apps → **Off**
**Gap**: No automated anomaly detection for consent grants.
**Augmentation**: Manual quarterly review + scripting.
### Scenario 3: Data Exfiltration via SharePoint / OneDrive
**The Attack**: Insider or compromised account bulk-downloads sensitive files.
**E3 Defenses**:
- External sharing locked down (configured above)
- Audit logging enabled (configured above)
- Basic retention policies
**Gap**: No DLP, no sensitivity labels, no insider risk analytics.
**Augmentation**:
- PowerShell scripts to detect bulk downloads
- Quarterly access reviews on sensitive sites
- Process: data classification by site owner (manual but effective)
### Scenario 4: Lateral Movement via Compromised Credentials
**The Attack**: Phished credentials → mailbox compromise → password reset on other services → full identity takeover.
**E3 Defenses**:
- MFA (prevents password-only access)
- SSPR with MFA enforcement (prevents account lockout abuse)
**Gap**: No risk-based step-up; no impossible travel blocking.
**Augmentation**: Monitor for impossible travel in audit logs (manual or scripted).
---
## The E5 Upgrade Conversation
There will come a point where E3 augmentation is no longer cost-effective. Frame the E5 conversation around **specific capability gaps**, not feature lust.
| E5 Capability | What It Solves | When to Recommend |
|--------------|----------------|-------------------|
| Defender for Endpoint P2 | EDR, ASR, threat analytics | Client has had malware incident or is in regulated industry |
| Entra ID P2 | Conditional access, PIM, identity protection | Client has admin compromise or needs device/location gating |
| Defender for Office 365 P2 | Safe Links, Safe Attachments, automated investigation | Client has had phishing-driven incident |
| Purview | DLP, sensitivity labels, insider risk | Client handles customer PII, financial data, or trade secrets |
| Sentinel | SIEM, SOAR, threat hunting | Client has dedicated SOC or regulatory SIEM requirements |
**The Pitch**:
> *"We have extracted 80% of the security value from your E3 investment. The remaining 20% requires capabilities that only exist in E5 or specific add-ons. I am not recommending a blanket upgrade. I am recommending we selectively license the gaps that match your actual risk profile."*
---
## OT / Critical Infrastructure Overlay (Telco, Power)
For clients with operational technology (OT) or critical infrastructure obligations:
| E3 Consideration | OT Implication |
|-----------------|----------------|
| MFA enforcement | Admin accounts for OT-facing M365 tenants must have hardware tokens (no phone SMS in control rooms) |
| Audit logging | 90-day retention may be insufficient; plan export to long-term storage |
| External sharing | OneDrive/SharePoint must not become accidental conduit between IT and OT networks |
| Guest access | Strictly prohibit guest accounts in OT-connected tenants |
| Email security | EOP is baseline; NIS2 and critical infrastructure regulations may mandate advanced email filtering |
See [Vertical: Power Utilities](../reference/vertical-power-utilities.md) for full OT alignment.
---
## Banking Overlay
For financial services clients:
| E3 Consideration | Regulatory Implication |
|-----------------|----------------------|
| Audit logging | DORA Article 12 (ICT risk management) requires comprehensive logging and monitoring |
| MFA | PSD2 Strong Customer Authentication principles apply to internal systems |
| Data residency | M365 data must remain in EU/geographically appropriate datacenters |
| DLP gap | No native DLP in E3; manual data governance + eventual Purview upgrade likely required |
| Email archiving | Financial regulations often require immutable, long-term email retention |
See [Vertical: Banking](../reference/vertical-banking.md) for full regulatory alignment.
---
*Previous: [Zero-Budget Hardening](zero-budget-hardening.md)*
*Next: [AD and Endpoint Hardening](ad-endpoint-hardening.md)*
For how Intune deployment becomes the natural entry point for broader security transformation, see [Endpoint Management Entry Vector](endpoint-management-entry-vector.md).

View File

@@ -0,0 +1,644 @@
# osquery: The Sovereign Discovery Platform
> *"Tenable sees what Tenable chooses to show you. osquery sees whatever you ask it to see. The difference is sovereignty."*
This document provides a complete blueprint for building a **custom vulnerability discovery, compliance, and asset inventory platform** on osquery—the open-source, cross-platform endpoint agent that exposes operating systems as SQL databases. It is designed for consultancies and clients who want **owned visibility** rather than rented scanner reports.
Osquery is the technical expression of the antifragile principle: **sovereign intelligence**. Your data. Your queries. Your infrastructure. No third-party black box.
---
## Why osquery Fits the Antifragile Posture
| Commercial Scanner | osquery |
|-------------------|---------|
| Proprietary detection logic | **Open-source SQL queries you can inspect, modify, and extend** |
| Data sent to vendor cloud | **Data stays on your infrastructure** |
| Vendor-defined scan scope | **You define what to query; if you can think it, you can ask it** |
| Per-asset licensing cost | **Free and open-source** |
| Quarterly or monthly scans | **Continuous or on-demand; you control the cadence** |
| Generic report templates | **Custom dashboards and reports built on your data** |
| Vendor lock-in | **Portable SQL queries; migrate to any platform** |
**The executive framing**:
> *"Tenable is a rented microscope. It shows you what the manufacturer decided you should see. osquery is a laboratory. You design the experiments, you collect the samples, and you interpret the results. It requires more expertise—but it produces intelligence that no competitor can replicate because it is built on your specific questions about your specific environment."*
---
## What osquery Actually Is
Osquery is an **endpoint agent** that runs on Windows, macOS, Linux, and FreeBSD. It exposes the operating system as a relational database with **hundreds of tables**:
| Table Category | Examples | What You Can Ask |
|----------------|----------|-----------------|
| **Processes** | `processes`, `process_memory_map`, `process_open_sockets` | "Show me processes listening on external ports" |
| **Network** | `listening_ports`, `interface_details`, `etc_hosts` | "Show me hosts with no firewall enabled" |
| **Users & Authentication** | `users`, `groups`, `shadow`, `logged_in_users` | "Show me accounts with password never expires" |
| **Software & Packages** | `programs`, `deb_packages`, `rpm_packages`, `chrome_extensions` | "Show me installed software with known vulnerable versions" |
| **System Configuration** | `os_version`, `system_info`, `registry` | "Show me all Windows Server 2012 machines" |
| **Security** | `startup_items`, `scheduled_tasks`, `authorizations` | "Show me persistence mechanisms" |
| **File Integrity** | `file_events`, `hash` | "Show me changes to /etc/passwd in the last hour" |
| **Hardware** | `usb_devices`, `system_info`, `cpu_info` | "Show me unmanaged USB devices" |
**The power**: You write SQL. osquery returns live system data. No proprietary query language. No vendor-defined limits.
---
## Deployment Architecture
### Model 1: Standalone / Ad-Hoc (Proof of Concept)
For a first sweep or targeted investigation:
```bash
# Install osquery on a single system
# Windows: choco install osquery
# macOS: brew install osquery
# Ubuntu: apt install osquery
# Run a query interactively
osqueryi "SELECT name, version, install_date FROM programs WHERE name LIKE '%Adobe%'"
# Run a query from file
osqueryi --json < queries/windows-software-inventory.sql > results.json
```
**Use case**: Consultant's laptop runs osqueryi against a script-generated target list via SSH/WinRM. No infrastructure. No agents permanently deployed. Perfect for first sweeps.
### Model 2: Scheduled Agent with Local Logging (Basic Monitoring)
Deploy osquery as a daemon with scheduled queries writing to local files or syslog:
```json
// /etc/osquery/osquery.conf
{
"schedule": {
"installed_software": {
"query": "SELECT name, version, install_date FROM programs;",
"interval": 86400,
"description": "Daily software inventory"
},
"listening_ports": {
"query": "SELECT lp.pid, lp.port, lp.protocol, p.name, p.path FROM listening_ports lp LEFT JOIN processes p ON lp.pid = p.pid WHERE lp.address != '127.0.0.1';",
"interval": 3600,
"description": "Hourly external listening ports"
},
"missing_patches": {
"query": "SELECT hotfix_id, installed_on FROM patches WHERE hotfix_id NOT IN (SELECT hotfix_id FROM patches WHERE installed_on > date('now', '-30 days'));",
"interval": 86400,
"description": "Daily patch compliance check"
}
},
"options": {
"logger_path": "/var/log/osquery",
"logger_plugin": "filesystem"
}
}
```
**Use case**: Small environments (50-500 endpoints) where centralized management is not yet justified. Logs are collected by existing SIEM or file forwarder.
### Model 3: FleetDM (The Recommended Control Plane)
FleetDM is an open-source management platform for osquery. It provides:
- Centralized query scheduling across thousands of endpoints
- Live query capability (ask a question; get answers in seconds)
- Policy enforcement (compliance checks with pass/fail reporting)
- Software inventory and vulnerability mapping
- Device health monitoring
- SSO integration
- API for automation and reporting
**Deployment**:
```
┌─────────────┐ ┌─────────────┐ ┌─────────────────┐
│ FleetDM │────▶│ MySQL │────▶│ Redis (cache) │
│ (Web/API) │ │ ( datastore)│ │ │
└──────┬──────┘ └─────────────┘ └─────────────────┘
│ HTTPS (TLS 1.2+)
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ osquery │ │ osquery │ │ osquery │
│ (Windows) │ │ (Linux) │ │ (macOS) │
└─────────────┘ └─────────────┘ └─────────────┘
```
**FleetDM pricing**:
- **Free tier**: Up to 1,000 hosts, full osquery management, basic vulnerability mapping
- **Premium**: ~$7/host/year for advanced features, SSO, API access, premium support
- **Self-hosted**: The software is open-source; you pay only for infrastructure
**The business case**:
| Solution | Cost for 1,000 hosts/year | Data Sovereignty |
|----------|--------------------------|------------------|
| Tenable.io | ~$50,000-€100,000 | Data in vendor cloud |
| Qualys VMDR | ~$40,000-€80,000 | Data in vendor cloud |
| FleetDM + osquery | ~$7,000 (premium) or $0 (free) + infrastructure | **Data in your infrastructure** |
---
## Query Packs for Vulnerability Discovery
### Windows Vulnerability Discovery Pack
```sql
-- windows-vuln-discovery.sql
-- Run via: osqueryi < windows-vuln-discovery.sql
-- 1. End-of-life operating systems
SELECT
si.computer_name,
os.name AS os_name,
os.version AS os_version,
os.build AS os_build,
CASE
WHEN os.version LIKE '6.1%' THEN 'Windows 7/Server 2008 R2 - END OF LIFE'
WHEN os.version LIKE '6.2%' THEN 'Windows 8/Server 2012 - END OF LIFE'
WHEN os.version LIKE '6.3%' THEN 'Windows 8.1/Server 2012 R2 - END OF LIFE'
WHEN os.version LIKE '10.0.1%' THEN 'Windows 10/Server 2016 - Check build'
WHEN os.build < '17763' THEN 'Windows 10/Server 2019 - Outdated build'
ELSE 'Current or check manually'
END AS eol_status
FROM os_version os
CROSS JOIN system_info si;
-- 2. Missing critical patches (last 90 days)
SELECT
si.computer_name,
COUNT(*) AS missing_patches
FROM patches p
CROSS JOIN system_info si
WHERE p.installed_on < date('now', '-90 days')
OR p.installed_on IS NULL
GROUP BY si.computer_name
HAVING missing_patches > 5;
-- 3. Software with known vulnerable versions (customizable)
SELECT
si.computer_name,
p.name,
p.version,
CASE
WHEN p.name LIKE '%Adobe Reader%' AND CAST(REPLACE(p.version, '.', '') AS INTEGER) < 2023000 THEN 'POTENTIALLY VULNERABLE'
WHEN p.name LIKE '%Java%' AND p.version LIKE '8u%' AND CAST(SUBSTR(p.version, 3) AS INTEGER) < 381 THEN 'POTENTIALLY VULNERABLE'
WHEN p.name LIKE '%Chrome%' AND CAST(REPLACE(SUBSTR(p.version, 1, 2), '.', '') AS INTEGER) < 120 THEN 'POTENTIALLY VULNERABLE'
ELSE 'REVIEW MANUALLY'
END AS vuln_status
FROM programs p
CROSS JOIN system_info si
WHERE p.name IN ('Adobe Reader', 'Java', 'Google Chrome', 'Mozilla Firefox', 'Microsoft Edge')
OR p.name LIKE '%Adobe%'
OR p.name LIKE '%Java%'
OR p.name LIKE '%Chrome%';
-- 4. Local administrators (excessive count = risk)
SELECT
si.computer_name,
COUNT(*) AS admin_count,
GROUP_CONCAT(u.username, '; ') AS admin_users
FROM users u
JOIN user_groups ug ON u.uid = ug.uid
JOIN groups g ON ug.gid = g.gid
CROSS JOIN system_info si
WHERE g.groupname = 'Administrators'
GROUP BY si.computer_name
HAVING admin_count > 3;
-- 5. Services listening on external interfaces
SELECT
si.computer_name,
lp.port,
lp.protocol,
p.name AS process_name,
p.path AS process_path,
lp.address
FROM listening_ports lp
LEFT JOIN processes p ON lp.pid = p.pid
CROSS JOIN system_info si
WHERE lp.address NOT IN ('127.0.0.1', '::1', '0.0.0.0')
AND lp.address NOT LIKE '169.254.%'
AND lp.port > 0;
-- 6. Firewall disabled profiles
SELECT
si.computer_name,
f.name AS profile_name,
f.enabled AS firewall_enabled,
CASE WHEN f.enabled = 0 THEN 'CRITICAL: FIREWALL DISABLED' ELSE 'OK' END AS status
FROM windows_firewall_rules f
CROSS JOIN system_info si
WHERE f.enabled = 0;
-- 7. BitLocker encryption status (Windows)
SELECT
si.computer_name,
d.letter,
d.type,
d.encrypted,
CASE WHEN d.encrypted = 0 THEN 'UNENCRYPTED' ELSE 'ENCRYPTED' END AS encryption_status
FROM bitlocker_info d
CROSS JOIN system_info si;
```
### Linux Vulnerability Discovery Pack
```sql
-- linux-vuln-discovery.sql
-- 1. OS version and kernel (check for EOL)
SELECT
si.hostname,
os.name,
os.version,
os.platform,
os.platform_like,
k.version AS kernel_version
FROM os_version os
CROSS JOIN system_info si
LEFT JOIN kernel_info k ON 1=1;
-- 2. Packages with known CVEs (requires vulners or manual correlation)
SELECT
si.hostname,
dp.name,
dp.version,
dp.source,
dp.arch
FROM deb_packages dp
CROSS JOIN system_info si
WHERE dp.name IN ('openssl', 'openssh-server', 'nginx', 'apache2', 'mysql-server', 'postgresql')
UNION ALL
SELECT
si.hostname,
rp.name,
rp.version,
rp.source,
rp.arch
FROM rpm_packages rp
CROSS JOIN system_info si
WHERE rp.name IN ('openssl', 'openssh-server', 'nginx', 'httpd', 'mariadb-server', 'postgresql-server');
-- 3. SSH hardening checks
SELECT
si.hostname,
c.key,
c.value,
CASE
WHEN c.key = 'PermitRootLogin' AND c.value = 'yes' THEN 'CRITICAL: Root login permitted'
WHEN c.key = 'PasswordAuthentication' AND c.value = 'yes' THEN 'HIGH: Password auth enabled'
WHEN c.key = 'Port' AND c.value != '22' THEN 'INFO: Non-standard port'
ELSE 'Review'
END AS risk
FROM ssh_configs c
CROSS JOIN system_info si
WHERE c.key IN ('PermitRootLogin', 'PasswordAuthentication', 'Port', 'Protocol', 'MaxAuthTries');
-- 4. Sudoers with NOPASSWD (privilege escalation risk)
SELECT
si.hostname,
su.source,
su.header,
su.rule_details
FROM sudo_rules su
CROSS JOIN system_info si
WHERE su.rule_details LIKE '%NOPASSWD%'
OR su.rule_details LIKE '%ALL=(ALL:ALL) ALL%';
-- 5. Listening ports with process attribution
SELECT
si.hostname,
lp.port,
lp.protocol,
lp.address,
p.name AS process_name,
p.pid,
p.path
FROM listening_ports lp
LEFT JOIN processes p ON lp.pid = p.pid
CROSS JOIN system_info si
WHERE lp.address NOT IN ('127.0.0.1', '::1', '0.0.0.0');
-- 6. Setuid/setgid binaries (privilege escalation paths)
SELECT
si.hostname,
f.path,
f.directory,
f.filename,
f.uid,
f.gid,
f.mode,
datetime(f.atime, 'unixepoch') AS last_accessed
FROM file f
CROSS JOIN system_info si
WHERE f.path IN ('/usr/bin', '/usr/sbin', '/bin', '/sbin')
AND (f.mode LIKE '%4000%' OR f.mode LIKE '%2000%');
-- 7. Container presence and image versions
SELECT
si.hostname,
dc.id,
dc.name,
dc.image,
dc.image_id,
dc.state,
dc.created
FROM docker_containers dc
CROSS JOIN system_info si
WHERE dc.state = 'running';
-- 8. Kubernetes pod security (if applicable)
SELECT
si.hostname,
kp.name,
kp.namespace,
kp.status,
kp.containers
FROM kubernetes_pods kp
CROSS JOIN system_info si;
```
### macOS Vulnerability Discovery Pack
```sql
-- macos-vuln-discovery.sql
-- 1. macOS version (check for EOL)
SELECT
si.computer_name,
os.name,
os.version,
os.platform,
os.build
FROM os_version os
CROSS JOIN system_info si;
-- 2. Installed applications (macOS apps)
SELECT
si.computer_name,
a.name,
a.bundle_short_version,
a.bundle_version,
a.path
FROM apps a
CROSS JOIN system_info si
WHERE a.name IN ('Safari', 'Google Chrome', 'Firefox', 'Microsoft Edge', 'Adobe Acrobat Reader', 'Zoom', 'Slack');
-- 3. Gatekeeper and SIP status
SELECT
si.computer_name,
g.key,
g.value
FROM gatekeeper g
CROSS JOIN system_info si
UNION ALL
SELECT
si.computer_name,
'SIP' AS key,
CASE WHEN sip.enabled = 1 THEN 'ENABLED' ELSE 'DISABLED' END AS value
FROM sip_config sip
CROSS JOIN system_info si;
-- 4. FileVault encryption status
SELECT
si.computer_name,
f.user_uuid,
f.status,
CASE WHEN f.status = 'Off' THEN 'UNENCRYPTED' ELSE 'ENCRYPTED' END AS encryption_status
FROM filevault_users f
CROSS JOIN system_info si;
```
---
## Building the Custom TVM Platform on osquery + FleetDM
### Step 1: Deploy FleetDM (1 day)
```bash
# Option A: Docker Compose (fastest for proof of concept)
git clone https://github.com/fleetdm/fleet.git
cd fleet/tools/osquery
docker-compose up -d
# Option B: Binary deployment for production
curl -L https://github.com/fleetdm/fleet/releases/latest/download/fleet.zip -o fleet.zip
unzip fleet.zip
./fleet prepare db
./fleet serve
```
### Step 2: Enroll Endpoints (1 day)
Generate an enrollment secret in FleetDM, then deploy osquery with FleetDM configuration:
```bash
# Windows (via Intune, GPO, or script)
# Install osquery MSI with FleetDM flags
osqueryd.exe --enroll_secret=YOUR_SECRET --tls_server=fleet.yourcompany.com:443
# Linux (via package manager + config)
apt install osquery
# Edit /etc/osquery/osquery.flags:
# --enroll_secret=YOUR_SECRET
# --tls_server=fleet.yourcompany.com:443
systemctl enable osqueryd && systemctl start osqueryd
# macOS (via MDM or script)
brew install osquery
# Similar flag configuration
launchctl load /Library/LaunchDaemons/com.facebook.osqueryd.plist
```
### Step 3: Define Policies (Compliance Checks)
FleetDM policies are scheduled queries that evaluate to PASS or FAIL:
```sql
-- Policy: Disk encryption enabled (Windows)
SELECT 1 FROM bitlocker_info WHERE encrypted = 1;
-- Policy: macOS FileVault enabled
SELECT 1 FROM filevault_users WHERE status = 'On';
-- Policy: No password authentication on SSH (Linux)
SELECT 1 FROM ssh_configs WHERE key = 'PasswordAuthentication' AND value = 'no';
-- Policy: No root login via SSH (Linux)
SELECT 1 FROM ssh_configs WHERE key = 'PermitRootLogin' AND value = 'no';
-- Policy: Windows Firewall enabled on all profiles
SELECT 1 FROM windows_firewall_rules WHERE enabled = 1 GROUP BY name HAVING COUNT(*) = 3;
-- Policy: Critical OS patches within 30 days
SELECT 1 FROM patches WHERE installed_on > date('now', '-30 days');
```
**Dashboard output**: FleetDM shows percentage compliance per policy across all enrolled hosts.
### Step 4: Vulnerability Correlation (The Custom Layer)
FleetDM's free tier includes basic CVE mapping for installed software. For advanced correlation, build a custom pipeline:
```python
# vuln-correlator.py
# Runs nightly: pulls FleetDM software inventory, correlates with CVE database
import requests
import sqlite3
from datetime import datetime
# 1. Pull software inventory from FleetDM API
FLEET_API = "https://fleet.yourcompany.com/api/v1"
HEADERS = {"Authorization": "Bearer YOUR_API_TOKEN"}
hosts = requests.get(f"{FLEET_API}/hosts", headers=HEADERS).json()["hosts"]
# 2. Connect to local CVE database (NVD dump or vulners)
conn = sqlite3.connect("cve-db.sqlite")
cursor = conn.cursor()
findings = []
for host in hosts:
host_id = host["id"]
host_name = host["hostname"]
# Get installed software
software = requests.get(f"{FLEET_API}/hosts/{host_id}/software", headers=HEADERS).json()["software"]
for app in software:
name = app["name"]
version = app["version"]
# Query CVE database for this software+version
cursor.execute("""
SELECT cve_id, severity, description
FROM cves
WHERE software_name = ? AND affected_versions LIKE ?
""", (name, f"%{version}%"))
vulns = cursor.fetchall()
for cve_id, severity, description in vulns:
findings.append({
"host": host_name,
"software": name,
"version": version,
"cve": cve_id,
"severity": severity,
"description": description[:200]
})
# 3. Generate report
with open(f"vuln-report-{datetime.now().strftime('%Y%m%d')}.json", "w") as f:
import json
json.dump(findings, f, indent=2)
# 4. Push critical findings to SIEM or Slack
# (Integration code here)
```
### Step 5: AI-Assisted Prioritization
Feed osquery/FleetDM data into the AI TVM prioritization engine:
```
[FleetDM Software Inventory] ──▶ [CVE Correlator] ──▶ [AI Prioritization]
[FleetDM Policy Failures] ──▶ [Risk Scoring] ──▶ [AI Prioritization]
[osquery Listening Ports] ──▶ [Exposure Analysis] ──▶ [AI Prioritization]
[osquery OS Version] ──▶ [EOL Detection] ──▶ [AI Prioritization]
──▶ [Executive Brief]
```
The AI receives structured, queryable data from osquery—not proprietary scan reports. This means:
- You can ask the AI: "Which hosts have both Adobe Reader and an open RDP port?"
- You can ask the AI: "Show me all Linux servers running kernel versions with known CVEs"
- You can ask the AI: "What changed in our software inventory since last week?"
---
## The Consultant's Delivery Model
### Engagement 1: Osquery Discovery Sprint (5 days)
| Day | Activity | Deliverable |
|-----|----------|-------------|
| 1 | Deploy FleetDM proof-of-concept | Operational FleetDM instance |
| 2 | Enroll 10-20 representative hosts | Live endpoint data flowing |
| 3 | Run vulnerability discovery query packs | Raw findings exported |
| 4 | Build custom queries for client's specific concerns | Client-specific query library |
| 5 | Present findings + propose scaled deployment | Board-ready report; deployment roadmap |
**Investment**: €3,500€5,500 (labor only; software is free)
**Standalone value**: Complete asset and vulnerability inventory of representative estate
### Engagement 2: Custom Platform Build (30 days)
- Scale FleetDM to full estate
- Build custom query library for client's specific compliance and security needs
- Integrate CVE correlation pipeline
- Build executive dashboards
- Train internal team on query authoring
- Hand over operational control
### Engagement 3: Continuous Improvement Retainer
- Monthly: New CVE correlation rules, query tuning, policy updates
- Quarterly: Purple team exercise using osquery data for detection validation
- Annually: Platform architecture review, query library refresh
---
## When Osquery Is the Right Choice
| Scenario | Recommendation |
|----------|---------------|
| Client has 50-5,000 endpoints, no existing scanner | **osquery + FleetDM is ideal.** Cheaper, more flexible, and sovereign. |
| Client has 5,000-50,000 endpoints, heterogeneous | **osquery + FleetDM can scale.** Consider premium tier or multi-node deployment. |
| Client needs compliance audit trails (PCI, SOC 2) | **Supplement with commercial scanner.** Auditors prefer vendor-validated reports. osquery provides operational intelligence; commercial scanner provides audit evidence. |
| Client has heavy OT/ICS environment | **osquery for IT endpoints; specialized scanner for OT.** osquery does not speak Modbus or OPC-UA. |
| Client wants "set and forget" | **Commercial scanner may be better.** osquery requires ongoing query authoring and maintenance. |
---
## Talking Points for the CTO
**When they say**: *"We are considering Tenable but it is expensive."*
**You respond**:
> *"Tenable is excellent at what it does. But it is a rented microscope with a fixed lens. Osquery is a laboratory. For the cost of one Tenable subscription, you can build a sovereign vulnerability discovery platform that answers questions Tenable never thought to ask. Let us run a 5-day proof of concept. If osquery does not find actionable vulnerabilities in your environment, you have the evidence to justify Tenable. If it does, you have a cheaper, more flexible alternative that you own outright."*
**When they say**: *"We do not have the expertise to write SQL queries."*
**You respond**:
> *"You do not need to write them from scratch. The osquery community has published thousands of battle-tested queries. FleetDM includes hundreds of pre-built policies. We start with those, customize them for your environment, and train your team to extend them. The expertise grows with the platform."*
**When they say**: *"Our SIEM already collects endpoint data."*
**You respond**:
> *"Your SIEM collects logs. Logs are what the system chose to record. Osquery queries are what you choose to ask. A log might tell you a process started. An osquery query can tell you every process with a network connection, its parent process, its binary hash, and whether that hash matches a known good baseline. The difference is interrogation versus observation."*
---
## Integration With Existing Frameworks
| Document | Integration |
|----------|-------------|
| [Zero-Budget Vulnerability Discovery](zero-budget-vulnerability-discovery.md) | osquery is the most powerful zero-budget discovery method; it replaces or supplements PowerShell/SSH scripts |
| [AI-Assisted TVM Blueprint](ai-assisted-tvm.md) | osquery provides the structured data feed for AI prioritization; it is the discovery layer of the AI TVM architecture |
| [Perimeter Scanning Capability](perimeter-scanning-capability.md) | osquery covers internal endpoints; perimeter scanning covers external attack surface; together they provide complete visibility |
| [Modular Engagements](../core/modular-engagements.md) | osquery sprint can be delivered as a standalone 5-day module or as the foundation of a larger TVM engagement |
| [Business Case Template](business-case-template.md) | osquery + FleetDM costs vs. commercial scanner costs |
---
*For script-based discovery without agents, see [Zero-Budget Vulnerability Discovery](zero-budget-vulnerability-discovery.md).*
*For the AI prioritization layer, see [AI-Assisted TVM Blueprint](ai-assisted-tvm.md).*
*For external attack surface scanning, see [Perimeter Scanning Capability](perimeter-scanning-capability.md).*

View File

@@ -0,0 +1,344 @@
# Perimeter Scanning Capability: Build, Partner, or Hybrid?
> *"You cannot prioritize what you cannot see. And your internal vulnerability scanner will never tell you what the internet sees."*
This document provides a strategic framework for building external attack surface visibility—the "outside-in" perspective that reveals what adversaries (and AI-powered scanners like Mythos) see when they look at your organization from the public internet.
It addresses the build-vs-partner decision for perimeter scanning and maps external findings into the AI-assisted TVM prioritization engine.
---
## Why External Scanning Is Non-Negotiable
### The Asymmetry Problem
An adversary attacking your organization starts from the outside. They see:
- Your public IP ranges
- Your exposed services and ports
- Your forgotten cloud storage buckets
- Your expired certificates
- Your development sites still publicly accessible
- Your subsidiary domains you forgot you owned
Your internal vulnerability scanner sees none of this. It scans from the inside, authenticated, with full knowledge of the network. The adversary scans from the outside, unauthenticated, with zero prior knowledge.
**The board framing**:
> *"Your internal scanner says your web servers are patched. But from the internet, we can see three development instances running outdated Apache versions on ports you did not know were exposed. The internal scanner is blind to what the adversary sees first. External scanning closes that blindness."*
### The Mythos-Specific Risk
AI-powered scanning agents do not sleep, do not get bored, and do not miss open ports because they were in a hurry. They:
- Scan entire IPv4 space continuously
- Correlate services with CVE databases in real time
- Chain findings: open port + service version + known exploit = instant target
- Discover forgotten assets faster than human reconnaissance teams
If you are not scanning your perimeter at least as aggressively as your adversaries, you are relying on luck.
---
## The Three Models
| Model | Description | Investment | Timeline | Best For |
|-------|-------------|-----------|----------|----------|
| **Build (Open-Source)** | Self-hosted Nuclei, OpenVAS, Amass on cheap VPS infrastructure | Low (€200-500/month infrastructure) | 1-2 weeks to operational | Tech-savvy teams; consultants who want independence; proof-of-concept |
| **Partner (Commercial)** | Shodan Enterprise, Censys, Tenable.asm, Cortex Xpanse, Mandiant ASM | Medium to high (€10K-€100K/year) | Immediate (SaaS) | Organizations needing continuous monitoring, compliance evidence, or limited internal expertise |
| **Hybrid** | Open-source stack for active scanning + commercial platform for passive discovery and trends | Medium (€15K-€30K/year) | 2-4 weeks | Most organizations; balances cost, capability, and coverage |
---
## Model 1: Build (Open-Source Stack)
### The Consultant's Scanning Infrastructure
For a consulting practice, owning your own scanning capability provides independence, speed, and a differentiator.
**Infrastructure**: 2-3 cheap VPS instances (Hetzner, DigitalOcean, Vultr)
- €5-10/month per instance
- Distributed across geographies (EU, US, Asia) to simulate global adversary perspective
- Containerized scanning workloads
**Core Stack**:
| Tool | Purpose | Why It Matters |
|------|---------|---------------|
| **Amass** | DNS enumeration, subdomain discovery, asset mapping | Finds forgotten domains, dev sites, acquisitions |
| **Naabu** | Fast port scanning | Identifies exposed services beyond standard ports |
| **httpx** | Web service fingerprinting | Identifies technologies, versions, and potential vulnerabilities |
| **Nuclei** | Vulnerability detection (10,000+ templates) | Specific CVE detection, misconfiguration checks, exposed panels |
| **Subfinder** | Passive subdomain discovery | Leverages certificate transparency, search engines, archives |
| **Katana** | Web crawler | Discovers hidden endpoints, API paths, exposed files |
| **Gau** (GetAllUrls) | URL enumeration from archives | Finds old URLs that might still resolve to live services |
| **OpenVAS / Greenbone** | Full vulnerability scanning | Deep inspection of discovered services |
| **Nmap + NSE scripts** | Service detection and vulnerability checks | Reliable, comprehensive, scriptable |
### Deployment Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ SCANNING CONTROLLER │
│ (Scheduling, results aggregation, report generation) │
│ - Cron jobs or Jenkins/GitHub Actions │
│ - SQLite/PostgreSQL for results storage │
│ - Python/PowerShell for report generation │
└────────────────────┬────────────────────────────────────────┘
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ EU VPS │ │ US VPS │ │ ASIA VPS│
│ Amass │ │ Amass │ │ Amass │
│ Nuclei │ │ Nuclei │ │ Nuclei │
│ Naabu │ │ Naabu │ │ Naabu │
└─────────┘ └─────────┘ └─────────┘
```
### The First External Scan Protocol
```bash
# 1. PASSIVE RECONNAISSANCE (no packets sent to target)
amass enum -d client-domain.com -o amass-results.txt
subfinder -d client-domain.com -o subfinder-results.txt
# Merge and deduplicate
cat amass-results.txt subfinder-results.txt | sort -u > all-domains.txt
# 2. DISCOVER LIVE SERVICES
cat all-domains.txt | httpx -o live-web-services.txt -tech-detect -status-code
cat all-domains.txt | naabu -p - -o live-ports.txt
# 3. VULNERABILITY DETECTION
nuclei -list live-web-services.txt -severity critical,high -o nuclei-findings.txt
# 4. DEEP INSPECTION (for high-value targets)
nmap -sV -sC -O --script vuln -iL live-ports-targets.txt -oA deep-scan
# 5. REPORT GENERATION
# Aggregate, deduplicate, prioritize
```
**What this produces in 4 hours**:
- Complete subdomain map
- All live web services with technology fingerprinting
- Critical and high-severity vulnerability findings
- Exposed development sites, admin panels, default credentials
- Certificate expiration warnings
- Geographic distribution of exposed services
### Limitations of the Build Model
| Limitation | Impact | Mitigation |
|-----------|--------|------------|
| IP blocking by CDNs / WAFs | Incomplete scan results | Rotate source IPs; use multiple VPS locations; respect rate limits |
| Legal exposure | Scanning without explicit authorization is illegal in many jurisdictions | Always have written authorization; define scope strictly; exclude third-party infrastructure |
| Maintenance burden | Tools require updates; templates require refresh | Automated CI/CD pipeline for tool updates; weekly template sync for Nuclei |
| No historical trending | Point-in-time snapshots only | Store results in database; generate trend reports quarterly |
| Limited cloud asset discovery | Cannot see inside AWS/Azure/GCP without API access | Supplement with cloud-native discovery (see zero-budget discovery) |
---
## Model 2: Partner (Commercial Platforms)
### Shodan / Censys (Passive Discovery)
**What they do**: Continuously scan the entire internet and index services, certificates, devices, and vulnerabilities. You query their database instead of scanning yourself.
**Best for**:
- Discovering forgotten assets ("We did not know we had a server in that IP range")
- Historical tracking ("When did this service first appear on the internet?")
- IoT/OT device discovery (Shodan specializes in industrial control systems)
- Certificate transparency monitoring (detects unauthorized certificates)
**Pricing**:
- Shodan API: ~$60/month for developer; Enterprise starts at ~$10K/year
- Censys: Similar pricing tiers
**The consultant use case**: Even without enterprise licensing, API credits allow you to query client IP ranges during assessments. The output is professional-grade and defensible.
### Tenable Attack Surface Management (Tenable.asm)
**What it does**: Continuous attack surface monitoring combining external scanning, cloud API integration, and business context.
**Best for**:
- Clients who need compliance-ready external scanning
- Continuous monitoring (not point-in-time)
- Integration with Tenable.io / Tenable.sc for unified internal + external view
**Pricing**: ~$15K-€50K/year depending on asset count
### Cortex Xpanse (Palo Alto Networks)
**What it does**: Enterprise-grade attack surface management with threat intelligence integration.
**Best for**:
- Large enterprises with complex M&A history
- Organizations needing integration with Palo Alto firewalls and Prisma Cloud
- High-frequency M&A environments where attack surface changes constantly
### Mandiant Attack Surface Management (Google Cloud)
**What it does**: Combines attack surface monitoring with Mandiant threat intelligence.
**Best for**:
- Organizations facing advanced persistent threats
- Clients who want attack surface data correlated with APT TTPs
---
## Model 3: Hybrid (Recommended for Most Clients)
The hybrid model combines the strengths of both approaches:
### The Consultant's Hybrid Stack
| Function | Tool | Model |
|----------|------|-------|
| **Continuous passive discovery** | Shodan API + Censys API | Partner (€500-€1K/month) |
| **Active vulnerability scanning** | Nuclei + OpenVAS on consultant VPS | Build (€200/month) |
| **Deep penetration testing** | Nmap + custom scripts + manual validation | Build (labor) |
| **Cloud asset correlation** | AWS/Azure/GCP APIs + native security tools | Build (free APIs) |
| **Historical trending and reporting** | Self-hosted database + Grafana + AI synthesis | Build (€50/month) |
| **Compliance validation** | Tenable.asm or Qualys WAS (optional) | Partner (if required) |
### Why Hybrid Wins
1. **Cost efficiency**: Passive discovery via APIs is cheap. Active scanning is cheaper self-hosted than commercial.
2. **Coverage**: APIs find things your scanners miss (historical data, third-party mentions). Active scanning validates exploitability.
3. **Independence**: You are not locked into a single vendor. If Shodan raises prices, you can shift to Censys or increase active scanning.
4. **Credibility**: Having your own infrastructure demonstrates technical competence. Clients trust consultants who own their tools.
---
## The Perimeter-to-TVM Integration
External scanning findings must feed into the vulnerability prioritization engine. Here is how:
### The Outside-In Risk Multiplier
A vulnerability on an **internet-facing** system is exponentially more dangerous than the same vulnerability on an internal workstation. The AI-assisted TVM engine weights findings accordingly:
| Finding Location | Risk Multiplier | Why |
|-----------------|-----------------|-----|
| Internet-facing, no WAF | 10x | Direct adversary access; no defence in depth |
| Internet-facing, behind WAF | 5x | WAF bypass possible; still directly reachable |
| DMZ, reachable from internet | 4x | Compromise enables lateral movement |
| Internal, privileged access | 3x | High impact if compromised; requires initial access first |
| Internal, standard user | 1x | Baseline risk |
| Air-gapped / OT | 2x-5x | Isolation is protection, but compromise is catastrophic |
### The Integration Pipeline
```
[External Scan Results]
├─ Nuclei findings (CVEs, misconfigs)
├─ Shodan/Censys exposed services
├─ Certificate issues
└─ Cloud-exposed storage/assets
[Correlation Engine]
├─ Map external finding to internal asset (IP → hostname → owner)
├─ Cross-reference with internal vulnerability scan
├─ Check compensating controls (WAF, CDN, rate limiting)
└─ Apply outside-in risk multiplier
[AI Prioritization]
├─ Exploitability prediction
├─ Threat intelligence correlation
├─ Business impact assessment
└─ Generate ranked remediation list
[Executive Dashboard]
├─ "Top 10 internet-facing risks"
├─ "Attack surface trend: growing or shrinking?"
├─ "Mean time to remediate externally exposed vulnerability"
└─ Board-ready brief
```
### The Weekly Cadence
| Day | Activity | Source |
|-----|----------|--------|
| Monday | Review Shodan/Censys alerts for new exposed services | Passive APIs |
| Tuesday | Run targeted Nuclei scan on new/changed assets | Active scanning |
| Wednesday | Correlate external findings with internal vulnerability data | Integration engine |
| Thursday | Generate AI-prioritized action list | AI TVM engine |
| Friday | Executive brief: "What changed on our perimeter this week?" | Automated synthesis |
---
## The Board Conversation
**When the CTO asks**: *"Do we need to pay for external scanning? Can't we just use our internal scanner?"*
**You respond**:
> *"Your internal scanner sees what is inside your walls. Mythos—and every criminal scanner on the internet—sees what is outside your walls. In our first external scan of your perimeter, we found [X] services exposed to the internet that your internal team did not know existed. [Y] of them have known vulnerabilities. [Z] of them are running end-of-life software. The internal scanner will never find these because they are outside its scope. External scanning is not optional. It is the perspective your adversary already has."*
**When the CFO asks**: *"How much does this cost?"*
**You respond**:
> *"A hybrid approach—combining API-based passive monitoring with active open-source scanning—costs approximately €1,000-€2,000 per month. That is less than the cost of one incident response retainer day. And it provides continuous visibility, not a quarterly snapshot. If you need compliance-ready reports, we can add Tenable or Qualys later. But the baseline visibility is achievable now, at low cost."*
---
## Legal and Ethical Considerations
### Authorization Is Mandatory
Never scan a client's infrastructure—or any infrastructure—without explicit, written authorization.
**The authorization letter must specify**:
- Exact IP ranges, domains, and cloud accounts in scope
- Excluded systems (production payment gateways, safety-critical OT, third-party services)
- Scanning intensity and timing restrictions
- Emergency contact for scan-related incidents
- Data handling: where scan results are stored, who has access, retention period
### Rate Limiting and Resilience
- Respect `robots.txt` on web services (though adversaries do not)
- Limit concurrent connections to avoid service disruption
- Scan during maintenance windows for critical services
- Have an immediate stop mechanism if unexpected impact occurs
### Data Sovereignty
- Scan results may contain sensitive data (service versions, internal hostnames, certificate details)
- Store results in the client's jurisdiction
- Encrypt at rest and in transit
- Delete results after the engagement unless contract specifies retention
---
## The Consultant's Advantage
Owning perimeter scanning capability provides three competitive advantages:
1. **Speed**: You can deliver external attack surface findings in 24 hours, not the 2-week procurement cycle of commercial platforms.
2. **Differentiation**: Most M365 consultants do not offer attack surface management. You do.
3. **Entry vector**: External scanning often reveals the most compelling findings—exposed admin panels, outdated services, forgotten acquisitions. These findings naturally lead to broader engagement.
**The pitch**:
> *"Before we discuss your M365 security or endpoint management, let us scan your perimeter. In 24 hours, we will show you what the internet sees. I suspect we will find something that changes your prioritization. If we do not, the scan was free. If we do, we have the evidence to justify the security investments you have been considering."*
---
## Integration With Existing Frameworks
| Document | Integration |
|----------|-------------|
| [AI-Assisted TVM Blueprint](ai-assisted-tvm.md) | Perimeter findings feed the AI prioritization engine with outside-in risk weighting |
| [Zero-Budget Vulnerability Discovery](zero-budget-vulnerability-discovery.md) | Internal discovery (scripts + osquery) + external scanning = complete visibility |
| [Business Case Template](business-case-template.md) | Perimeter scanning costs (€1K-€2K/month hybrid) vs. incident response costs |
| [Osquery: The Sovereign Discovery Platform](osquery-custom-platform.md) | osquery covers internal endpoint visibility; perimeter scanning covers external attack surface; together they provide complete visibility |
| [Modular Engagements](../core/modular-engagements.md) | Perimeter scan can be delivered as a standalone 2-3 day module or included in AI TVM |
---
*For internal vulnerability discovery without commercial tools, see [Zero-Budget Vulnerability Discovery](zero-budget-vulnerability-discovery.md).*
*For the sovereign endpoint discovery platform, see [Osquery: The Sovereign Discovery Platform](osquery-custom-platform.md).*
*For the AI-assisted prioritization layer, see [AI-Assisted TVM Blueprint](ai-assisted-tvm.md).*

View File

@@ -0,0 +1,322 @@
# Rapid Modernisation Plan
> *"We must change our strategy from 'detect the attacker in time' to 'become the target that is not worth attacking.' Reactive mode is unsustainable. We must ensure the game is played on our field."*
## For the Executive Reader
This is not a three-year digital transformation. It is a **180-day strategic reset** with measurable business outcomes at each phase gate.
| Phase | Timeline | What the Board Sees |
|-------|----------|---------------------|
| **Hygiene** | Days 0-30 | Visibility. For the first time, we know every identity, asset, and gap that could end the company. |
| **Control** | Days 30-60 | Containment. The highest-risk exposures are closed using tools already owned. |
| **Sovereignty** | Days 60-90 | Ownership. Proprietary intelligence is reclaimed. Recovery from disaster is proven, not assumed. |
| **Antifragility** | Days 90-180 | Advantage. The organization learns faster from disruption than competitors do. |
**Investment principle**: Configuration first. Procurement only if justified. Most value is extracted from existing tools before any new purchase is discussed.
**Governance**: Weekly steering committee. Monthly board update. Quarterly antifragility assessment. Hard go/no-go gates at days 30, 60, and 90.
**Modularity**: While this document presents the full 180-day program, every phase can be delivered as an independent, fixed-scope module. See [Modular Engagements](../core/modular-engagements.md) for the menu of standalone engagements.
*For the business case and financial justification, see [Business Case Template](business-case-template.md).*
*For board conversation guidance, see [C-Suite Conversation Guide](../core/c-suite-conversation-guide.md).*
---
## For the Practitioner
This playbook provides a **time-boxed, phase-gated roadmap** for transforming a fragile enterprise into an antifragile one. It is designed for immediate deployment in consulting engagements and can be adapted to organizational size, industry, and regulatory context.
The plan is structured in **four phases**: Hygiene (30 days), Control (60 days), Sovereignty (90 days), and Antifragility (180 days). Each phase builds on the previous. Skipping phases creates the illusion of progress while leaving structural fragility intact.
> **Core tenet**: Before any new purchase is discussed, exhaust the capabilities of existing tooling. See the [Zero-Budget Hardening Playbook](zero-budget-hardening.md) for the tactical expression of this principle.
---
## Phase 1: Hygiene (Days 030)
**Theme**: *You cannot defend what you cannot see.*
The first 30 days are aggressive, disruptive, and non-negotiable. The goal is not perfection; it is **visibility**. Every unknown identity, unmapped dependency, and unmonitored access path is a latent failure waiting to happen.
### Week 1-2: Identity and Access Blitz
**Tool strategy**: Use existing AD / Entra ID / IAM. No new purchases.
| Action | Owner | Deliverable | Existing Tool Leverage |
|--------|-------|-------------|------------------------|
| Aggressive identity audit | IAM / Security | Complete inventory of all human and non-human identities | ADUC, Entra ID portal, AWS IAM console |
| Disable all unknown / unused accounts | IAM | List of disabled accounts with business justification for exceptions | Existing IAM + PowerShell / CLI scripts |
| Rotate all critical passwords and shared secrets | Security Ops | Rotation log with verification | Existing IAM + LAPS (free from Microsoft) |
| Target: admin accounts, service accounts, krbtgt equivalents | AD / Cloud IAM | Documentation of every privileged account | Existing directory services |
| Implement password hygiene (minimum: audit) | IAM | Baseline report on password policy compliance | Native password policies + audit logs |
### Week 2-3: Perimeter and Communication Mapping
**Tool strategy**: Use native firewall management, open-source scanners, and manual audit before purchasing new NDR/VM platforms.
| Action | Owner | Deliverable | Existing Tool Leverage |
|--------|-------|-------------|------------------------|
| Audit all vendor / supplier access paths | Security / Procurement | Inventory of VPN, RDP, Citrix, SSH, FTP, SCP, API keys | Existing IAM, VPN logs, firewall logs |
| Review and document firewall rules | Network Team | Rule set with business justification for each | Native firewall management interfaces |
| Map public-facing assets from external perspective | Security | Attack surface report with P0 classification | Free/open-source: Shodan, certificate transparency logs, nmap |
| Implement aggressive vulnerability scanning | Security | Weekly scan results with trending | Existing scanner, Microsoft Defender Vulnerability Management, or OpenVAS |
### Week 3-4: Visibility and Monitoring Baseline
**Tool strategy**: Maximize existing EDR/SIEM before considering new platforms. A spreadsheet CMDB is infinitely better than no CMDB.
| Action | Owner | Deliverable | Existing Tool Leverage |
|--------|-------|-------------|------------------------|
| Deploy endpoint detection on all managed devices | SOC / MDE | Coverage report: % of estate monitored | Existing EDR (Defender, CrowdStrike, SentinelOne) |
| Establish log aggregation for critical systems | Security | Centralized logging for T0 and T1 assets | Existing SIEM, syslog server, or cloud native logging (Sentinel, CloudWatch, Cloud Logging) |
| Create initial CMDB seed for critical systems | IT / Security | CMDB populated with crown jewels | Existing ITAM, ServiceNow, or spreadsheet |
| Document "kill chain": shortest path to organizational failure | Security Architect | Threat model and mitigation map | Manual analysis + stakeholder interviews |
### Phase 1 Exit Criteria
- [ ] 100% of identities known and validated
- [ ] 100% of privileged access reviewed
- [ ] All public-facing assets identified and scanned
- [ ] Centralized logging operational for critical systems
- [ ] CMDB seeded with T0/T1 assets
- [ ] Initial "kill chain" documented
### Phase 1 Mantra
> *"Do not be afraid to break things temporarily. Disable first, justify second. Visibility before permission."*
---
## Phase 2: Control (Days 3060)
**Theme**: *What we have seen, we must now contain.*
With visibility established, the next 30 days focus on **closing the highest-risk gaps** without introducing operational paralysis. This is the phase of quick wins and surface reduction.
### Week 5-6: Attack Surface Reduction (ASR)
**Tool strategy**: ASR rules and PAWs are native Microsoft capabilities. For non-Microsoft environments, use existing endpoint management.
| Action | Owner | Deliverable | Existing Tool Leverage |
|--------|-------|-------------|------------------------|
| Eliminate shared accounts where possible | IAM | Reduction metric: % of shared accounts decommissioned | Existing IAM + access review process |
| Implement Attack Surface Reduction rules on endpoints | Endpoint Security | ASR policy deployed and compliance measured | Microsoft Defender ASR (already owned in E3/E5) |
| Harden admin access: dedicated PAWs, no browsing, no email | Security | PAW architecture documented and deployed | Existing Windows / Intune / GPO |
| Review and minimize permissions across all platforms | IAM / App Owners | Permission matrix with least-privilege gaps identified | Native IAM interfaces + scripts |
### Week 6-7: Network and DNS Security
**Tool strategy**: Use existing DNS infrastructure, firewall segmentation, and open-source sensors (Zeek/Suricata) before buying NDR.
| Action | Owner | Deliverable | Existing Tool Leverage |
|--------|-------|-------------|------------------------|
| Deploy DNS security (filtering, logging, anomaly detection) | Network | DNS security coverage report | Existing DNS infrastructure, Quad9/Cloudflare free tiers, Microsoft DNS security |
| Segment IT/OT networks where they intersect | Network / OT | Network segmentation diagram and policy | Existing firewalls and VLANs |
| Deploy network sensors at critical boundaries | SOC | Sensor coverage map with alerting validated | Zeek or Suricata (open-source) or existing IDS/IPS |
### Week 7-8: Multi-Factor Authentication and Conditional Access
**Tool strategy**: MFA and conditional access are native capabilities of Entra ID, Okta, and cloud IAM. No additional purchase required.
| Action | Owner | Deliverable | Existing Tool Leverage |
|--------|-------|-------------|------------------------|
| Enforce MFA on all remote access paths | IAM | MFA coverage: 100% of remote access | Entra ID, Okta, Duo, or native cloud IAM MFA |
| Implement conditional access policies | IAM / Cloud | Policy set: device compliance, location, risk score | Entra ID Conditional Access, AWS IAM, GCP IAM |
| Review and harden M365 / Google Workspace security | Cloud Team | Cloud security posture report | Microsoft Secure Score, Google Security Health Analytics |
### Phase 2 Exit Criteria
- [ ] Shared accounts reduced by minimum 50%
- [ ] ASR rules active on all managed endpoints
- [ ] MFA enforced on 100% of remote and privileged access
- [ ] DNS security operational
- [ ] Network segmentation policy defined and initial segments implemented
- [ ] Conditional access policies active for cloud workloads
### Phase 2 Mantra
> *"The goal is not to block everything. It is to ensure that every allowed path is known, justified, and monitored."*
---
## Phase 3: Sovereignty (Days 6090)
**Theme**: *Reclaim what should never have been rented.*
This is where the antifragile approach diverges sharply from conventional hardening. The focus shifts from defending the perimeter to **owning the intelligence** that drives the organization.
### Week 9-10: AI Sovereignty Assessment
**Tool strategy**: Discovery requires interviews and proxy log analysis. No purchase needed for assessment.
| Action | Owner | Deliverable | Existing Tool Leverage |
|--------|-------|-------------|------------------------|
| Inventory all AI usage: approved and shadow | Security / AI Lead | AI usage map with data classification | Proxy logs, SaaS billing review, employee interviews |
| Classify AI workloads by sovereignty requirement | Security Architect | T0/T1/T2 AI asset classification | Existing data classification framework |
| Identify highest-value local AI pilot candidate | AI Lead / Business | Pilot scope document with success criteria | Business stakeholder interviews |
| Assess vendor AI terms: data usage, training, termination | Legal / Security | Risk register for each AI provider | Legal review of existing contracts |
### Week 10-11: Local AI Infrastructure Deployment
**Tool strategy**: Start with existing hardware or low-cost sovereign cloud. Use open-source inference servers (Ollama, vLLM, llama.cpp).
| Action | Owner | Deliverable | Existing / Low-Cost Tool Leverage |
|--------|-------|-------------|----------------------------------|
| Deploy local inference infrastructure (on-prem or sovereign cloud) | Infrastructure | Operational inference cluster | Underutilized servers, retired workstations, or sovereign cloud VM |
| Establish model versioning and artifact management | MLOps / Security | Model registry with provenance tracking | Git + DVC or simple artifact storage |
| Implement access controls for model weights and training data | Security | T0-class protection for AI assets | Existing file servers, encryption, IAM |
| Deploy initial pilot: RAG or fine-tuned model on proprietary data | AI Team | Working pilot with performance baseline | Ollama, llama.cpp, or vLLM (open-source) + quantized open models |
### Week 11-12: Backup, Recovery, and Validation
**Tool strategy**: Use existing backup and DR infrastructure. The goal is to test and document, not to buy.
| Action | Owner | Deliverable | Existing Tool Leverage |
|--------|-------|-------------|------------------------|
| Perform full recovery drill of one critical system from backup | IT / Security | Recovery time documented, gaps identified | Existing backup solution |
| Validate backup integrity for all T0 assets | Backup Admin | Integrity report with sample restorations | Existing backup solution + integrity scripts |
| Test local AI pilot under degraded network conditions | AI / Infrastructure | Resilience validation report | Existing network infrastructure + manual testing |
| Document and exercise incident response for AI-specific threats | SOC / Security | Runbook: model poisoning, data exfiltration, adversarial input | Existing IR framework + internal knowledge |
### Phase 3 Exit Criteria
- [ ] All AI usage inventoried and classified
- [ ] Local inference infrastructure operational
- [ ] One high-value AI pilot deployed and measured
- [ ] T0 protection applied to model weights and training data
- [ ] Critical system recovery drill completed successfully
- [ ] AI-specific incident response runbook created
### Phase 3 Mantra
> *"We are moving from being consumers of intelligence to manufacturers of our own. The vault is built; now we fill it."*
---
## Phase 4: Antifragility (Days 90180)
**Theme**: *Build systems that grow stronger from disruption.*
The final phase converts the hardened foundation into an adaptive, learning organization. This is where antifragility becomes operational reality.
### Month 4: Structural Decoupling and Optionality
**Tool strategy**: Documentation, architecture, and open-source chaos tools (Chaos Mesh, Gremlin free tier, custom scripts). Work, not purchases.
| Action | Owner | Deliverable | Existing / Free Tool Leverage |
|--------|-------|-------------|------------------------------|
| Document exit architecture for all major platform dependencies | Enterprise Architecture | 90-day exit plan per critical vendor | Architecture documentation, existing runbooks |
| Implement abstraction layers for proprietary integrations | Engineering | Interface documentation and migration test | Existing development tools and frameworks |
| Establish dual-vendor readiness for one critical category | Procurement / Engineering | Technical proof of capability | Existing engineering capacity, open standards |
| Deploy chaos engineering: simulate critical dependency failure | Resilience Team | Chaos experiment report with findings | Chaos Mesh (open-source), custom scripts, Gremlin free tier |
### Month 5: Stress-to-Signal Conversion
**Tool strategy**: Process and culture changes require no licensing. Use existing EDR/SIEM for detection validation.
| Action | Owner | Deliverable | Existing Tool Leverage |
|--------|-------|-------------|------------------------|
| Implement blameless post-mortem process with structural mandates | Culture / Security | Post-mortem template and governance | Existing collaboration tools (Confluence, SharePoint, Notion) |
| Deploy production chaos engineering with automated rollback | Resilience Team | Monthly chaos experiment schedule | Existing orchestration + open-source chaos tools |
| Create feedback loop: incident findings → architecture changes | Security Architect | Closed-loop metrics: mean time to structural fix | Existing ticketing system (Jira, ServiceNow) |
| Launch "red team as a service": continuous adversarial testing | Security | Monthly red team report | Internal team + existing EDR/SIEM for detection validation |
### Month 6: Defensive AI and Continuous Modernisation
**Tool strategy**: Defensive AI runs on the local inference infrastructure already deployed. Posture measurement uses existing APIs and open-source dashboards.
| Action | Owner | Deliverable | Existing / Low-Cost Tool Leverage |
|--------|-------|-------------|----------------------------------|
| Expand local AI to defensive use cases: anomaly detection, code review, vulnerability prioritization | AI / Security | Defensive AI capability map | Local AI cluster deployed in Phase 3 |
| Implement automated security posture measurement | Security | Continuous compliance dashboard | Existing APIs (Microsoft Graph, AWS APIs) + Grafana or open-source dashboard |
| Evaluate and migrate additional AI workloads to local infrastructure | AI Lead | Migration roadmap with quarterly targets | Local AI infrastructure + business case templates |
| Conduct first antifragility maturity assessment | Consultant / Security | Baseline maturity score with gap analysis | Spreadsheet or existing GRC tool |
| Pilot organizational integration: embed security in one product team | Consultant / Engineering | Shift-left pilot metrics | Existing team structure + collaboration tools |
| **Deploy AI-assisted TVM operationalization** | AI / Security | AI TVM dashboard; <48h critical CVE response | Defender Exposure Management + Azure OpenAI or local LLM; see [AI-Assisted TVM Blueprint](ai-assisted-tvm.md) |
### Phase 4 Exit Criteria
- [ ] Exit architectures documented for top 5 vendor dependencies
- [ ] Chaos engineering operational in production
- [ ] Mean time to structural fix < 14 days from incident
- [ ] Defensive AI pilot operational
- [ ] First antifragility maturity assessment completed
- [ ] Quarterly antifragility review calendar established
### Phase 4 Mantra
> *"We do not want fewer incidents. We want incidents that teach us something we could not have learned any other way."*
---
## Governance and Cadence
### Weekly Steering Committee
- Review blockers and escalations
- Validate phase exit criteria
- Adjust scope based on organizational readiness
### Monthly Board Update
- Risk reduction metrics
- Antifragility maturity trend
- Investment vs. risk-exposure reduction
- Strategic narrative: "This is not a cost centre; it is optionality insurance"
### Quarterly Retrospective
- What failed that taught us something?
- What assumptions have been invalidated?
- What new dependencies have emerged?
- What can be simplified or removed?
---
## Success Metrics
| Dimension | Metric | Target |
|-----------|--------|--------|
| **Visibility** | % of assets in CMDB | 100% of T0/T1 within 30 days |
| **Control** | Mean time to contain new identity | < 1 hour |
| **Sovereignty** | % of proprietary AI workloads local | 100% of T0-class within 90 days |
| **Resilience** | Recovery time for critical system | < 4 hours |
| **Learning** | Structural fixes per incident | ≥ 1 |
| **Optionality** | Vendor dependencies without exit plan | 0 |
---
## Adaptation Guide
### Small Organizations (< 100 employees)
- Compress Phases 1-2 into 30 days
- Use managed sovereign cloud for local AI instead of on-premises hardware
- Focus on identity, backup, and one high-value AI pilot
- Leverage Microsoft Business Premium or Google Workspace security features fully before any additional purchase
### Regulated Industries (Finance, Healthcare, Critical Infrastructure)
- Extend Phase 1 to 45 days for compliance mapping
- Integrate regulatory requirements into T0 classification
- Add compliance validation gates at each phase exit
### Highly Distributed Organizations
- Prioritize network segmentation and DNS security in Phase 1
- Deploy edge inference nodes in Phase 3 instead of central cluster
- Emphasize operational resilience and disconnected operations
### Organizations with Heavy Technical Debt
- Accept that 20 years of debt cannot be cleared in 180 days
- Use defensive AI in Phase 4 to accelerate debt identification and prioritization
- Focus on "kill chain" protection rather than comprehensive cleanup
- Map every action to CIS IG1 to show standards alignment without additional framework investment
---
*Next: [Implementation Playbook](implementation-playbook.md)*
*Previous: [T0 Asset Framework](../core/t0-asset-framework.md)*

View File

@@ -0,0 +1,265 @@
# Zero-Budget Hardening Playbook
> *"The most expensive security tool is the one you already bought and never turned on."*
This playbook provides tactical guidance for hardening an enterprise's security posture using **existing tools, native platform capabilities, and open-source alternatives**. It is designed for consultants whose clients need to reduce technological debt and improve resilience without additional software procurement.
The philosophy is simple: **maximize current investment before discussing new investment**. This builds trust, demonstrates competence, and preserves optionality for strategic purchases later.
---
## The Underutilization Audit
Before proposing any new tool, conduct this audit. It typically reveals that the client already owns 60-80% of the capabilities they need.
### Microsoft-Centric Environments (Most Common)
> **Critical distinction**: Most of our clients own **E3**, not E5. The table below shows the E5 ideal; see [M365 E3 Hardening](m365-e3-hardening.md) for the pragmatic E3 reality.
| Capability | What E5 Includes | What E3 Includes | What Is Often Unused | Activation Effort |
|-----------|------------------|------------------|---------------------|-------------------|
| Endpoint Detection | Defender for Endpoint P2 (EDR, ASR) | Defender Antivirus only (no EDR) | Real-time protection, network protection | Low |
| SIEM / Log Analytics | Microsoft Sentinel | Log Analytics only (no Sentinel) | Basic KQL queries, log forwarding | Medium |
| Identity Protection | Entra ID P2 (PIM, conditional access, risk) | Entra ID Free (per-user MFA only) | Per-user MFA, basic audit | Low |
| Email Security | Defender for Office 365 P2 (Safe Links, Safe Attachments) | EOP only (basic anti-phishing) | Anti-malware, anti-spam tuning | Low |
| Data Protection | Microsoft Purview (DLP, labels) | None | N/A | N/A |
| Cloud Security | Microsoft Defender for Cloud | Basic Defender for Cloud (limited) | Secure score review | Low |
| PAM (Basic) | Entra ID PIM + LAPS | LAPS only (no PIM) | LAPS deployment | Low |
**E3 Strategy**: Maximize native E3 capabilities, augment with open-source tools (Wazuh, Sysmon), and selectively license add-ons for critical users rather than blanket E5 upgrades.
**The Pitch (E3 Clients)**:
> *"You own E3, not E5. That means we do not have EDR, conditional access, or advanced email filtering out of the box. But we do have solid foundations: antivirus, basic MFA, audit logging, and EOP. Our first job is to turn every E3 knob to maximum, then close the most dangerous gaps with free tools like Sysmon and Wazuh. If gaps remain that threaten your specific risk profile, we will size a selective upgrade—not a blanket one."*
### Multi-Cloud / Heterogeneous Environments
| Capability | Native Free/Cheap Options |
|-----------|--------------------------|
| Vulnerability scanning | AWS Inspector (basic), Azure Update Manager, Google OS Config |
| Configuration compliance | AWS Config (basic), Azure Policy, Google Organization Policy |
| Log aggregation | CloudWatch Logs, Azure Monitor Logs, Cloud Logging |
| Identity security | AWS IAM Access Analyzer, Azure AD Identity Protection, Google Cloud IAM Recommender |
| Network monitoring | VPC Flow Logs, Azure NSG Flow Logs, Google Cloud VPC Flow Logs |
| Cost anomaly detection | AWS Cost Anomaly Detection, Azure Cost Management, Google Cloud Billing Alerts |
### Open-Source Force Multipliers
When native capabilities are insufficient, these open-source tools can close gaps without license costs:
| Category | Tool | When to Use |
|----------|------|-------------|
| EDR / XDR | Wazuh | Need centralized endpoint visibility but no EDR budget |
| SIEM | Wazuh (again), Graylog, Grafana Loki | Need log analysis without commercial SIEM |
| Vulnerability Management | OpenVAS | Need scanning without commercial VM platform |
| Network Monitoring | Zeek, Suricata | Need IDS/IPS without commercial NDR |
| Asset Discovery | OpenLDAP scripts, Nmap, Masscan | Need network asset discovery |
| Threat Intelligence | MISP (free tier), AlienVault OTX | Need IOC sharing and correlation |
| Password Auditing | Hashcat, John the Ripper | Need to audit password strength internally |
| Backup Verification | Custom scripts (rsync, hash verification) | Need to validate backup integrity |
| Local AI Inference | Ollama, llama.cpp, vLLM | Need sovereign AI without API costs |
---
## The 30-Day Zero-Budget Sprint
This sprint assumes the client has a typical Microsoft-centric environment with E3 or E5 licensing. Adapt for other environments.
### Week 1: Turn On What You Own
> **Note for E3 clients**: Skip the ASR and advanced EDR steps below. E3 includes Defender Antivirus only. See [M365 E3 Hardening](m365-e3-hardening.md) for the E3-specific week 1 plan. The steps below assume E5 or Defender for Endpoint P2.
**Day 1-2: Microsoft Defender for Endpoint (E5 Only)**
- Verify onboarding coverage: what % of endpoints are reporting?
- Enable ASR rules in **Audit** mode (not block) to measure impact:
- Block executable content from email client and webmail
- Block JavaScript or VBScript from launching downloaded executable content
- Block Office applications from creating child processes
- Block Office applications from injecting code into other processes
- Block Adobe Reader from creating child processes
- Block persistence through WMI event subscription
- Enable exploit protection with default settings
- Enable network protection in **Audit** mode
**Day 3-4: Entra ID (Azure AD) Hardening**
- **E5 clients**: Enable security defaults **or** configure conditional access:
- Require MFA for all users, all cloud apps
- Block legacy authentication
- Require compliant or hybrid Azure AD joined device for admin roles
- Enable PIM for Global Administrator and other privileged roles
- **E3 clients**: Enable per-user MFA for all users (no conditional access available)
- Block legacy authentication tenant-wide
- Review and reduce standing admin assignments manually
- Document conditional access as a gap for steering committee
**Day 5: Email Security**
- **E5 clients**: Enable Safe Links and Safe Attachments for all recipients; configure anti-phishing policies with impersonation protection
- **E3 clients**: Tune EOP anti-phishing, anti-malware, and anti-spam to maximum aggression; configure impersonation protection in EOP; document Safe Links/Safe Attachments gap
- Enable mailbox auditing for all users (works in E3)
### Week 2: Visibility and Hygiene
**Day 6-7: Log Aggregation**
- Enable diagnostic settings for all Azure resources to Log Analytics
- Enable Microsoft 365 auditing
- If no Sentinel, use Log Analytics + KQL for basic querying
**Day 8-9: Identity Hygiene**
- Export all users, groups, and service principals
- Disable unused accounts (> 90 days inactive, no owner)
- Identify shared mailboxes with login capability and restrict
- Review enterprise applications (OAuth consents) and revoke suspicious grants
**Day 10: Secure Score Review**
- Review Microsoft Secure Score (Defender for Cloud + M365)
- Pick 5 improvements that require **no purchase**
- Execute them
### Week 3: Configuration and Control
**Day 11-12: Windows Defender Firewall**
- Enforce firewall on all profiles (domain, private, public)
- Enable logging for dropped packets
- Review and document any exceptions
**Day 13-14: LAPS (Local Administrator Password Solution)**
- Deploy LAPS via GPO or Intune
- Set unique random passwords for all local admin accounts
- Configure password expiration (30-60 days)
**Day 15: DNS Security**
- Enable DNS over HTTPS (DoH) on Windows 11 endpoints via Intune/GPO
- Configure DNS filtering (Quad9, Cloudflare for Teams free tier, or native Microsoft DNS security)
- Enable DNS query logging if infrastructure supports it
### Week 4: Validation and Documentation
**Day 16-17: Backup Verification**
- Inventory all backup jobs
- Select one non-critical system and perform test restore
- Document gaps in coverage or recovery time
**Day 18-19: External Perspective**
- Run basic external scan using free tools (Shodan search for your IP ranges, SSL Labs for public websites)
- Document exposed services and missing TLS configurations
**Day 20: Metrics and Reporting**
- Calculate "before and after" metrics:
- EDR coverage %
- MFA enrollment %
- Secure Score change
- Number of disabled unused accounts
- Number of ASR audit-mode triggers
- Present to stakeholders with cost: **$0 in new licensing**
---
## The 60-90 Day Extension: Configuration as Control
Once the initial sprint proves value, extend into structural improvements that require work but not purchase.
### Conditional Access Refinement
| Policy | Target | Risk Addressed |
|--------|--------|----------------|
| Require MFA from untrusted locations | All users | Credential stuffing, brute force from abroad |
| Require compliant device for sensitive apps | Finance, HR, Engineering | Data exfiltration from unmanaged devices |
| Block download from unmanaged devices | SharePoint, OneDrive | Shadow IT data leakage |
| Require password change on high user risk | All users | Compromised credential remediation |
### ASR Rules: From Audit to Block
After 30 days of audit-mode data:
- Review ASR rule hits
- Identify false positives and create exclusions
- Switch high-confidence rules to **Block** mode
- Monitor for 2 weeks, then iterate
### Automated Response (No SOAR Required)
Use native platform automation:
| Platform | Native Automation | Use Case |
|----------|-------------------|----------|
| Microsoft | Logic Apps + Sentinel / Defender APIs | Auto-isolate high-risk device, auto-disable compromised account |
| AWS | EventBridge + Lambda | Auto-snapshot compromised EC2, auto-revoke suspicious IAM key |
| Azure | Logic Apps + Azure Monitor | Auto-scale compromised resource, auto-trigger runbook |
| Google Cloud | Cloud Functions + Cloud Monitoring | Auto-suspend suspicious service account |
These require no additional licensing—only development time.
---
## AI Sovereignty on Existing Hardware
Local AI does not require a $50,000 GPU cluster to start. Many organizations have underutilized servers or workstations that can run quantized models.
### Minimum Viable Local AI
| Component | Specification | Typical Source |
|-----------|--------------|----------------|
| CPU inference host | 8+ cores, 32GB+ RAM | Underutilized server, retired workstation |
| Storage | 100GB SSD for models and data | Existing SAN or local SSD |
| GPU (optional) | NVIDIA with 8GB+ VRAM for faster inference | Existing CAD/ML workstation |
| Software | Ollama or llama.cpp | Free, open-source |
| Model | Llama 3.1 8B or Mistral 7B (4-bit quantized) | Free download |
**Pilot Workflow**: Internal code review assistant or security log summarizer. These are low-risk, high-signal use cases that prove local AI viability without disrupting operations.
---
## Common Objections and Responses
| Objection | Response |
|-----------|----------|
| "We need a proper EDR, not Defender." | Defender for Endpoint is a Leader in Gartner Magic Quadrant. Most organizations have not enabled its advanced features. Let us turn those on first and measure. |
| "Open source is not enterprise-grade." | Zeek, Suricata, Wazuh, and Ollama are used by Fortune 500 companies and government agencies. The issue is not the tool; it is the expertise to run it. |
| "We don't have time to configure this." | Configuration is a one-time investment with perpetual returns. Buying a new tool also requires configuration—plus negotiation, procurement, and onboarding. |
| "Our auditor wants to see vendor support." | For audit evidence, native platform capabilities (Microsoft, AWS, Google) come with vendor backing. Open-source can be supplemented with commercial support if needed. |
| "The board wants us to buy something." | The board wants risk reduction. Show them risk reduction at zero incremental cost, and they will trust you when you later recommend strategic purchases. |
---
## The Consultant's Value Proposition
When you deliver zero-budget hardening, you demonstrate:
1. **Independence**: You are not here to sell software. You are here to solve problems.
2. **Competence**: You know how to extract value from complex platforms.
3. **Speed**: Visible improvement in 30 days builds momentum and political capital.
4. **Trust**: When you later recommend a purchase, it will be because the gap genuinely requires it—not because you have a quota.
### The Opening Pitch
> *"Before we talk about what to buy, let us talk about what you already own. In our experience, most organizations are utilizing less than 40% of their existing security capabilities. Our 30-day sprint will turn on, tune, and operationalize what you have already paid for. If there is still a gap after that, we will recommend the minimum viable purchase to close it."*
---
## Integration With Rapid Modernisation
The Zero-Budget Hardening Playbook maps directly onto the [Rapid Modernisation Plan](rapid-modernisation-plan.md):
| Rapid Modernisation Phase | Zero-Budget Focus |
|--------------------------|-------------------|
| Hygiene (Days 0-30) | Turn on existing EDR, enable MFA, configure conditional access, inventory identities |
| Control (Days 30-60) | ASR rules, LAPS, DNS security, log aggregation with existing tools |
| Sovereignty (Days 60-90) | Local AI on existing hardware, backup verification with existing solution |
| Antifragility (Days 90-180) | Open-source network monitoring, native automation, chaos engineering with free tools |
---
*Previous: [Rapid Modernisation Plan](rapid-modernisation-plan.md)*
*Next: [Implementation Playbook](implementation-playbook.md)*

View File

@@ -0,0 +1,625 @@
# Zero-Budget Vulnerability Discovery
> *"Most organizations do not know what vulnerabilities they have because they have never looked. Not because Tenable is too expensive. Because nobody wrote a PowerShell script and ran it."*
This playbook provides practical, script-based methods for discovering vulnerabilities across Windows servers, Linux servers, containers, and network devices **without purchasing commercial vulnerability scanners** like Tenable, Qualys, or Rapid7. It is designed for the first sweep—the baseline discovery that proves value before any procurement discussion.
The approach is **agentless and authentication-based** where possible: we use existing administrative access (SSH, WinRM, RDP, Azure/AWS APIs) to collect inventory and correlate it with vulnerability data. No agents. No new licenses. Just scripts, open-source tools, and expertise.
---
## The Philosophy: Discovery Before Procurement
Before recommending Tenable, Qualys, or any commercial scanner, we prove that:
1. The client does not know their inventory
2. There are critical vulnerabilities that can be found with free tools
3. The commercial scanner will be worth the money—once we know what gaps it needs to fill
**The rule**: If a script run from a laptop finds 50 critical missing patches in 2 hours, the business case for a commercial scanner becomes trivial. The scanner is no longer a gamble. It is an operationalization of proven need.
---
## Method 1: Windows Server Enumeration (PowerShell)
Most Windows environments have at least partial administrative access. A PowerShell script run with domain admin or local admin credentials can enumerate the entire estate in hours.
### The Basic Script: What to Collect
```powershell
# Save as Get-ServerVulnBaseline.ps1
# Run from a management workstation with domain admin or appropriate privileges
$Computers = Get-ADComputer -Filter {OperatingSystem -like "*Server*"} | Select-Object -ExpandProperty Name
$Results = @()
foreach ($Computer in $Computers) {
try {
$Session = New-CimSession -ComputerName $Computer -OperationTimeoutSec 30
# OS Version and Build
$OS = Get-CimInstance -CimSession $Session -ClassName Win32_OperatingSystem
# Installed Hotfixes
$Hotfixes = Get-CimInstance -CimSession $Session -ClassName Win32_QuickFixEngineering |
Select-Object -ExpandProperty HotFixID
# Installed Software (Add/Remove Programs)
$Software = Get-CimInstance -CimSession $Session -ClassName Win32_Product |
Select-Object Name, Version, Vendor
# Windows Features / Roles
$Features = Get-WindowsFeature -ComputerName $Computer | Where-Object {$_.Installed} |
Select-Object -ExpandProperty Name
# Antivirus Status (Windows Defender or third-party)
$AV = Get-CimInstance -CimSession $Session -Namespace "root\SecurityCenter2" -ClassName AntiVirusProduct -ErrorAction SilentlyContinue
# Firewall Status
$Firewall = Get-NetFirewallProfile -CimSession $Session | Select-Object Name, Enabled
# Local Administrators
$Admins = Get-LocalGroupMember -Group "Administrators" -ErrorAction SilentlyContinue
$Results += [PSCustomObject]@{
ComputerName = $Computer
OSVersion = $OS.Caption
OSBuild = $OS.BuildNumber
LastBoot = $OS.LastBootUpTime
Hotfixes = ($Hotfixes -join ";")
SoftwareCount = $Software.Count
KeySoftware = ($Software | Where-Object {$_.Name -match "SQL|IIS|Exchange|SharePoint|Remote Desktop|Citrix"} | ForEach-Object {"$($_.Name)=$($_.Version)"} -join ";")
Features = ($Features -join ";")
AVProduct = if ($AV) { $AV.displayName } else { "None detected" }
FirewallEnabled = ($Firewall | Where-Object {$_.Enabled -eq $true}).Count
LocalAdmins = ($Admins | Measure-Object).Count
Reachable = $true
}
Remove-CimSession -CimSession $Session
}
catch {
$Results += [PSCustomObject]@{
ComputerName = $Computer
OSVersion = "Unreachable"
Reachable = $false
Error = $_.Exception.Message
}
}
}
$Results | Export-Csv -Path "ServerBaseline.csv" -NoTypeInformation
```
**What this produces in 30 minutes**:
- A CSV of every Windows Server with OS build, patches, software, roles, AV status, firewall status
- Immediate red flags: servers with no AV, no firewall, ancient OS builds, excessive local admins
- A hotfix list you can correlate against Microsoft Security Response Center bulletins
### The OS Build Risk Filter
Once you have the CSV, filter for end-of-life or near-end-of-life OS builds:
| OS / Build | Status | Risk |
|-----------|--------|------|
| Windows Server 2008 R2 / 2012 R2 | End of life | Critical |
| Windows Server 2016 (Build 14393) | Extended support | High |
| Windows Server 2019 (Build 17763) | Active, but check patch level | Medium |
| Windows Server 2022 (Build 20348) | Current | Low |
**The conversation**:
> *"We ran a script for 30 minutes and found 12 servers running operating systems that no longer receive security patches. Three of them are internet-facing. We do not need a €50,000 scanner to tell us that is a kill chain. We need it to track the remediation. But first, we fix these 12."*
---
## Method 2: Linux Server Enumeration (Bash / SSH)
For Linux estates, SSH-based enumeration is fast and requires no agents.
### The Basic Script
```bash
#!/bin/bash
# Save as linux-vuln-baseline.sh
# Run from a jump host with SSH key access to target servers
SERVERS=$(cat server-list.txt)
OUTPUT_DIR="./linux-baseline-$(date +%Y%m%d)"
mkdir -p $OUTPUT_DIR
for SERVER in $SERVERS; do
echo "Scanning $SERVER..."
ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no $SERVER "
echo '=== OS ==='
cat /etc/os-release
echo '=== KERNEL ==='
uname -r
echo '=== PACKAGES ==='
if command -v rpm >/dev/null; then rpm -qa --last; fi
if command -v dpkg >/dev/null; then dpkg -l; fi
if command -v apt >/dev/null; then apt list --installed 2>/dev/null; fi
echo '=== SERVICES ==='
systemctl list-units --type=service --state=running
echo '=== LISTENING PORTS ==='
ss -tlnp
echo '=== USERS WITH SHELL ==='
grep -E 'bash|sh|zsh' /etc/passwd
echo '=== SUDOERS ==='
cat /etc/sudoers 2>/dev/null | grep -v '^#' | grep -v '^$'
echo '=== SSH CONFIG ==='
grep -E 'PermitRootLogin|PasswordAuthentication|Port' /etc/ssh/sshd_config
" > "$OUTPUT_DIR/$SERVER.txt" 2>&1
done
echo "Results in $OUTPUT_DIR"
```
**What this produces**:
- Per-server files with OS, kernel, all installed packages, running services, listening ports, user accounts, SSH hardening
- Immediate red flags: password authentication enabled, root login permitted, ancient kernels, unnecessary services exposed
### The Package-to-CVE Correlation
For the first sweep, you do not need a commercial correlator. Use open-source tools:
**Option A: Grype (recommended)**
```bash
# Install grype (single binary, no dependencies)
curl -sSfL https://raw.githubusercontent.com/anchore/grype/main/install.sh | sh -s -- -b /usr/local/bin
# On each server, generate SBOM and scan
syft packages dir:/ -o json > /tmp/sbom.json
grype sbom:/tmp/sbom.json -o table > /tmp/vulns.txt
```
**Option B: Vulners (for Linux package managers)**
```bash
# For Ubuntu/Debian with apt
apt-get install -y apt-vulns-severity
apt-get --just-print upgrade | apt-vulns-severity
# For RHEL/CentOS with yum
yum install -y yum-plugin-security
yum --security check-update
```
**What this produces**:
- A list of installed packages with known CVEs
- Severity ratings
- Whether fixes are available in the distribution repositories
---
## Method 3: Container and Application SBOM
Modern environments run containers. Containers bundle vulnerabilities. SBOM + CVE correlation is the fastest way to find them.
### SBOM Generation
```bash
# Install Syft
curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
# Generate SBOM from running containers
docker ps --format "{{.Names}}" | while read container; do
syft $container -o spdx-json > "sboms/${container}.json"
done
# Generate SBOM from container images in registry
# (Requires registry access; adapt for ACR, ECR, GCR, Harbor, etc.)
```
### CVE Scanning the SBOMs
```bash
# Install Grype
curl -sSfL https://raw.githubusercontent.com/anchore/grype/main/install.sh | sh -s -- -b /usr/local/bin
# Scan all SBOMs
for sbom in sboms/*.json; do
grype sbom:$sbom -o json > "vulns/$(basename $sbom .json)-vulns.json"
done
# Aggregate critical findings
jq -r '.matches[] | select(.vulnerability.severity == "Critical") | [.artifact.name, .artifact.version, .vulnerability.id, .vulnerability.severity] | @tsv' vulns/*.json | sort | uniq -c | sort -rn > critical-vulns.txt
```
**What this produces in 1 hour**:
- A complete inventory of every container's software components
- Every known CVE in those components
- Critical vulnerabilities ranked by frequency (if 15 containers have the same vulnerable log4j version, that is your top fix)
**The conversation**:
> *"We generated software bills of materials for your 40 running containers and found 340 known vulnerabilities. 12 are critical. Five of those critical vulnerabilities are in your customer-facing API container. We have the updated base image ready. No scanner purchase required."*
---
## Method 4: Network-Based Unauthenticated Scanning
When you cannot authenticate to every system, network scanning fills gaps.
### OpenVAS / Greenbone (Free)
Greenbone Community Edition is a full vulnerability scanner that requires only network access:
```bash
# Deploy via Docker (fastest way to test)
docker run -d -p 443:443 --name openvas greenbone/community-edition:latest
# Log in, create a target list, run scan
# Produces: full vulnerability report with CVSS, CVE references, and remediation guidance
```
**Limitations**: Community edition is not licensed for commercial use in some jurisdictions. For client engagements, use Greenbone Cloud Service (pay-per-scan) or deploy OpenVAS from source.
### Nmap Vulnerability Scripts
```bash
# Fast service discovery
nmap -sV -sC -O --top-ports 1000 -oA network-sweep $TARGET_NETWORK
# Vulnerability detection with NSE scripts
nmap --script vuln -p 21,22,23,25,53,80,110,135,139,143,443,445,993,995,1723,3306,3389,5900,8080 $TARGET_IP
# SMB vulnerability check (ETERNALBLUE, etc.)
nmap --script smb-vuln* -p 445 $TARGET_IP
# SSL/TLS weakness check
nmap --script ssl-enum-ciphers,ssl-heartbleed,ssl-poodle -p 443 $TARGET_IP
```
**What this produces**:
- Unauthenticated vulnerability findings
- Service versions that can be correlated with CVEs
- Network topology and unexpected exposed services
### ProjectDiscovery Stack (Modern, Fast, Free)
```bash
# Install
GO111MODULE=on go install -v github.com/projectdiscovery/naabu/v2/cmd/naabu@latest
GO111MODULE=on go install -v github.com/projectdiscovery/httpx/cmd/httpx@latest
GO111MODULE=on go install -v github.com/projectdiscovery/nuclei/v2/cmd/nuclei@latest
GO111MODULE=on go install -v github.com/owasp-amass/amass/v4/...@master
# Reconnaissance pipeline
# 1. Find live hosts
naabu -list targets.txt -o live-hosts.txt
# 2. Identify web services
httpx -list live-hosts.txt -o web-services.txt
# 3. Run vulnerability templates (10,000+ community templates)
nuclei -list web-services.txt -severity critical,high -o findings.txt
# 4. DNS enumeration
amass enum -d example.com -o dns-findings.txt
```
**What Nuclei produces**:
- Specific CVE detections (CVE-2024-XXXX)
- Misconfiguration findings (exposed .git, default credentials, open redirects)
- Technology fingerprinting
- All findings mapped to specific CVEs with remediation links
---
## Method 5: Osquery Cross-Platform Discovery (The Sovereign Method)
> *"Tenable is a rented microscope. osquery is a laboratory."*
For clients who want **owned visibility** rather than rented scanner reports, osquery is the most powerful zero-budget discovery method available. It is an open-source agent that exposes the operating system as a SQL database—Windows, Linux, and macOS.
### Why osquery Belongs Here
| Script-Based Discovery | osquery-Based Discovery |
|----------------------|------------------------|
| Point-in-time (run once, get a snapshot) | Continuous or scheduled (run every hour, every day) |
| Per-platform scripts (PowerShell for Windows, bash for Linux) | Single SQL query language across all platforms |
| Static output (CSV, text files) | Structured, queryable data you can ask follow-up questions of |
| Requires admin access every time | Agent enrolls once; queries run remotely via FleetDM |
| Hard to scale past 100 systems | Scales to 10,000+ endpoints with FleetDM control plane |
| Cannot detect runtime state (running processes, open ports in real time) | Real-time process, network, and configuration visibility |
### The 2-Hour osquery Proof of Concept
```bash
# Install osquery on a management workstation
# Windows: choco install osquery
# macOS: brew install osquery
# Ubuntu: apt install osquery
# Run interactive discovery queries against the local system
# (For remote systems, copy the binary or use FleetDM enrollment)
# 1. Windows software inventory with vulnerability flagging
osqueryi "SELECT si.computer_name, p.name, p.version,
CASE WHEN p.name LIKE '%Adobe%' AND CAST(REPLACE(p.version, '.', '') AS INTEGER) < 2023000 THEN 'POTENTIALLY VULNERABLE' ELSE 'REVIEW' END AS status
FROM programs p CROSS JOIN system_info si;"
# 2. Linux listening ports with process attribution
osqueryi "SELECT si.hostname, lp.port, lp.protocol, p.name, p.path
FROM listening_ports lp LEFT JOIN processes p ON lp.pid = p.pid
CROSS JOIN system_info si WHERE lp.address NOT IN ('127.0.0.1', '::1');"
# 3. SSH hardening check (Linux)
osqueryi "SELECT si.hostname, c.key, c.value,
CASE WHEN c.key = 'PermitRootLogin' AND c.value = 'yes' THEN 'CRITICAL' ELSE 'OK' END AS risk
FROM ssh_configs c CROSS JOIN system_info si
WHERE c.key IN ('PermitRootLogin', 'PasswordAuthentication', 'Port');"
# 4. End-of-life OS detection (all platforms)
osqueryi "SELECT si.hostname, os.name, os.version, os.build,
CASE
WHEN os.platform = 'windows' AND os.version LIKE '6.1%' THEN 'Windows 7/2008 R2 - EOL'
WHEN os.platform = 'windows' AND os.version LIKE '6.2%' THEN 'Windows 8/2012 - EOL'
WHEN os.platform = 'centos' AND os.version LIKE '7%' THEN 'CentOS 7 - EOL June 2024'
WHEN os.platform = 'ubuntu' AND os.version LIKE '18.04%' THEN 'Ubuntu 18.04 - EOL April 2023'
ELSE 'Check manually'
END AS eol_status
FROM os_version os CROSS JOIN system_info si;"
```
**What this produces in 2 hours**:
- Software inventory across all enrolled endpoints with version-based vulnerability flagging
- Real-time network exposure map (every listening port, every process)
- Configuration drift detection (firewall status, SSH hardening, encryption state)
- End-of-life operating system inventory
### Scaling to the Estate: FleetDM (Free Tier)
FleetDM is the open-source management platform for osquery. Free for up to 1,000 hosts:
```bash
# Deploy FleetDM in Docker (15 minutes)
git clone https://github.com/fleetdm/fleet.git
cd fleet/tools/osquery
docker-compose up -d
# Enroll endpoints with a single command per host
# FleetDM provides live query capability: ask a question, get answers in seconds
```
**For the complete osquery blueprint**—including query packs for Windows, Linux, and macOS vulnerability discovery, compliance policies, CVE correlation pipeline, and the consultant's 5-day delivery model—see **[Osquery: The Sovereign Discovery Platform](osquery-custom-platform.md)**.
### When to Use osquery vs. Scripts
| Scenario | Use Scripts | Use osquery |
|----------|------------|-------------|
| One-time sweep of 20-50 servers | ✅ Fast, no installation | Overkill |
| Continuous monitoring of 200+ endpoints | ❌ Unsustainable | ✅ Designed for this |
| Client needs compliance dashboards | ❌ Ad-hoc reports | ✅ Built-in policy engine |
| Cross-platform environment (Windows + Linux + macOS) | ❌ Separate scripts | ✅ Single query language |
| Client wants to own the data and queries | ❌ Vendor-dependent | ✅ Full sovereignty |
---
## Method 6: Cloud-Native Discovery (No Agents)
For Azure / AWS / GCP environments, the cloud provider already has the data. You just need to query it.
### Azure
```powershell
# Azure VM inventory with OS info
Get-AzVM | Select-Object Name, ResourceGroupName, Location,
@{Name="OS";Expression={$_.StorageProfile.OsDisk.OsType}},
@{Name="ImagePublisher";Expression={$_.StorageProfile.ImageReference.Publisher}},
@{Name="ImageOffer";Expression={$_.StorageProfile.ImageReference.Offer}},
@{Name="ImageSKU";Expression={$_.StorageProfile.ImageReference.Sku}},
@{Name="ImageVersion";Expression={$_.StorageProfile.ImageReference.Version}}
# Azure Update Manager: which VMs are missing critical updates?
# (Requires Azure Update Manager enabled; basic is free)
Get-AzSoftwareUpdateConfiguration | Where-Object {$_.ScheduleConfiguration.Frequency -eq "Hourly"}
# Azure Defender for Cloud secure score (free tier)
Get-AzSecuritySecureScore
```
### AWS
```bash
# EC2 instance inventory
aws ec2 describe-instances --query 'Reservations[].Instances[].[InstanceId,ImageId,PlatformDetails,InstanceType,LaunchTime,State.Name]' --output table
# Inspector findings (if Inspector v1/v2 is enabled; v2 is free for basic scanning)
aws inspector2 list-findings --severity CRITICAL
# Systems Manager Patch Compliance (if SSM agent is installed)
aws ssm describe-instance-patch-states --filters Key=Compliance,Values=NON_COMPLIANT
```
### GCP
```bash
# VM inventory
gcloud compute instances list --format="table(name,zone,status,machineType,disks[0].licenses[0])"
# OS Config vulnerability reports (if OS Config is enabled)
gcloud compute os-config vuln-reports list
```
**The conversation**:
> *"You already own Azure Update Manager and AWS Inspector Basic. They are free. You are not using them. Before we discuss Tenable, let us turn on the vulnerability discovery tools you already pay for as part of your cloud subscription."*
---
## Method 7: The SBOM-to-CVE Pipeline (Your Brainstorm, Implemented)
You mentioned SBOM collection and CVE validation. Here is a lightweight, zero-cost pipeline:
### Architecture
```
[Target System] → [Syft SBOM Generator] → [Grype CVE Scanner] → [Local AI Prioritizer] → [Executive Brief]
```
### Step-by-Step
```bash
#!/bin/bash
# zero-budget-tvm-pipeline.sh
# Run this from a management host with SSH/WinRM access to the estate
mkdir -p sboms vulns reports
# 1. COLLECT: Generate SBOMs from accessible systems
# Windows (via PowerShell remoting)
pwsh -c "
\$Servers = Get-ADComputer -Filter {OperatingSystem -like '*Server*'}
foreach (\$s in \$Servers) {
# Use Syft Windows binary if available, or fallback to registry enumeration
Invoke-Command -ComputerName \$s.Name -ScriptBlock {
Get-ItemProperty 'HKLM:\\Software\\Microsoft\\Windows\\CurrentVersion\\Uninstall\\*' |
Select-Object DisplayName, DisplayVersion, Publisher | Export-Csv C:\\tmp\\software.csv
} -ErrorAction SilentlyContinue
}
"
# Linux (via SSH)
for server in $(cat linux-servers.txt); do
ssh $server "
if command -v syft >/dev/null; then
syft dir:/ -o spdx-json
else
# Fallback: package manager output
if command -v rpm >/dev/null; then rpm -qa; fi
if command -v dpkg >/dev/null; then dpkg -l; fi
fi
" > "sboms/${server}.json" 2>/dev/null &
done
wait
# 2. SCAN: Correlate with CVE database
for sbom in sboms/*.json; do
if command -v grype >/dev/null; then
grype sbom:$sbom -o json > "vulns/$(basename $sbom .json)-vulns.json"
fi
done
# 3. PRIORITIZE: Extract critical/high, aggregate
jq -s '
[.[] | .matches[]? | select(.vulnerability.severity == "Critical" or .vulnerability.severity == "High") |
{cve: .vulnerability.id, severity: .vulnerability.severity, package: .artifact.name, version: .artifact.version}]
| group_by(.cve)
| map({cve: .[0].cve, severity: .[0].severity, package: .[0].package, affected_systems: length})
| sort_by(.affected_systems)
| reverse
' vulns/*.json > reports/aggregated-vulns.json
# 4. REPORT: Generate human-readable summary
cat > reports/executive-summary.md << 'EOF'
# Vulnerability Discovery Report
## Generated: $(date)
### Top Findings
EOF
jq -r '.[:20] | .[] | "- **\(.cve)** (\(.severity)): \(.package) — \(.affected_systems) systems affected"' reports/aggregated-vulns.json >> reports/executive-summary.md
echo "Report complete: reports/executive-summary.md"
```
**What this produces in 2-4 hours**:
- SBOMs for all accessible systems
- CVE correlation for every software component
- Aggregation: "CVE-2024-XXXX affects 23 of your servers"
- Executive summary: top 20 findings in Markdown
---
## The First Sweep Protocol
When you walk into a client with no vulnerability management program, run this sequence:
### Day 1: Discovery
| Hour | Activity | Tools |
|------|----------|-------|
| 0-1 | Identify scan targets from AD, Azure, AWS, or network range | Active Directory, cloud consoles |
| 1-3 | Run Windows PowerShell enumeration script | PowerShell, CIM sessions |
| 3-5 | Run Linux SSH enumeration script | Bash, SSH |
| 5-6 | Run network scan (Nmap + Nuclei) on external perimeter | Nmap, Nuclei |
| 6-8 | Generate container SBOMs and scan with Grype | Syft, Grype, Docker |
### Day 2: Correlation
| Hour | Activity | Tools |
|------|----------|-------|
| 0-2 | Correlate OS builds with Microsoft end-of-life list | Manual / spreadsheet |
| 2-4 | Correlate Linux packages with CVE database | Grype, vulners |
| 4-6 | Aggregate findings: top 20 vulnerabilities by frequency and severity | jq, Excel |
| 6-8 | Validate top 5 findings manually (exploitability check) | Nuclei, manual research |
### Day 3: Presentation
| Hour | Activity | Output |
|------|----------|--------|
| 0-2 | Create one-page executive summary | Markdown / PowerPoint |
| 2-4 | Present to steering committee: "Here is what we found in 48 hours with scripts" | Meeting |
| 4-6 | Discuss: what is the remediation path? What tools do we need to sustain this? | Roadmap |
**The conversation at Day 3**:
> *"In 48 hours, using only scripts and free tools, we found 340 known vulnerabilities across your estate. 23 are critical. Five of those are on internet-facing systems. Three are on end-of-life operating systems that cannot be patched. We can fix the patchable ones in two weeks. The unpatachable ones require architecture decisions. Here is the evidence. Now we can have an honest conversation about whether Tenable is worth the investment—or whether we build this capability with open-source tooling first."*
---
## When to Recommend Commercial Scanners
After the first sweep, you will know whether the client needs a commercial scanner:
| Scenario | Recommendation |
|----------|---------------|
| First sweep found <100 vulns, mostly patchable | **Do not buy Tenable yet.** Use scripts + cloud-native scanning + Intune/WSUS/SCCM for 6 months. Reassess. |
| First sweep found 100-500 vulns, client wants continuous visibility | **Deploy osquery + FleetDM first.** Provides owned, continuous monitoring for a fraction of scanner cost. Reassess in 6 months. |
| First sweep found 500+ vulns, heterogeneous estate | **Consider Tenable or Qualys** for continuous scanning and compliance reporting. Scripts cannot sustain at this scale. osquery can supplement for real-time data. |
| Client needs compliance evidence (PCI, ISO 27001, SOC 2) | **Commercial scanner required.** Auditors want vendor-validated scan reports, not scripts. |
| Client has OT/IOT/embedded devices | **Specialized scanner required.** Traditional tools do not speak Modbus, BACnet, or proprietary protocols. |
| Client wants continuous attack surface monitoring | **Consider Tenable.asm, Cortex Xpanse, or Mandiant ASM.** Script-based discovery is point-in-time. |
---
## Honest Limitations
| What Script-Based Discovery Does Well | What It Cannot Do |
|--------------------------------------|-------------------|
| Finds missing patches and known CVEs | Cannot find zero-days or configuration logic flaws |
| Maps software inventory accurately | Cannot assess business impact without human context |
| Identifies end-of-life systems | Cannot provide the compliance audit trail auditors demand |
| Generates SBOMs for containers | Cannot scan air-gapped or offline systems without physical access |
| Costs zero in licensing | Requires administrative access (SSH/WinRM/domain admin) |
| Produces evidence fast | Requires technical expertise to interpret and act on findings |
**Note**: osquery addresses several script-based limitations: it enables continuous monitoring, scales to thousands of endpoints via FleetDM, and provides real-time process/network visibility. The trade-off is agent deployment and query maintenance. See [Osquery: The Sovereign Discovery Platform](osquery-custom-platform.md).
---
## Integration With AI-Assisted TVM
The output of zero-budget discovery feeds directly into the AI-assisted TVM prioritization engine:
```
[zero-budget discovery] → Raw vulnerability data + SBOMs + OS inventories
[AI Prioritization] → Exploitability prediction + asset criticality + threat intel correlation
[Remediation Pipeline] → AI-generated scripts → human validation → deployment → validation
[Continuous Monitoring] → Re-scan → drift detection → quarterly purple team exercise
```
**The retained value**: Even if the client later buys Tenable, the SBOM pipeline and container scanning remain valuable. Tenable does not generate SBOMs as cleanly as Syft. Tenable does not scan containers as natively as Grype. The open-source stack **complements** the commercial scanner, it does not replace it.
---
*For the sovereign discovery platform built on osquery, see [Osquery: The Sovereign Discovery Platform](osquery-custom-platform.md).*
*For the AI-assisted TVM prioritization layer, see [AI-Assisted TVM Blueprint](ai-assisted-tvm.md).*
*For the perimeter scanning strategy, see [Perimeter Scanning Capability](perimeter-scanning-capability.md).*
*For the business case including tool costs, see [Business Case Template](business-case-template.md).*

View File

@@ -0,0 +1,209 @@
# CIS Controls v8 Mapping
> *"CIS IG1 is 56 safeguards that every organization must implement. It is not aspirational. It is the floor."*
This document maps the [Rapid Modernisation Plan](../playbooks/rapid-modernisation-plan.md) and the antifragile workstreams to CIS Controls v8 Implementation Groups. The goal is to show clients that antifragile hardening is not an alternative to standards—it is the fastest path to meeting them while building real resilience.
---
## Implementation Group 1 (IG1): The Minimum Viable Posture
IG1 is the **safeguards that every organization should implement to protect against common, known threats**. We treat IG1 as a non-negotiable 90-day target. Most organizations can achieve IG1 primarily through **configuration of existing tools** rather than new procurement.
### Control 1: Inventory and Control of Enterprise Assets
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | Active Directory / cloud IAM census | Existing identity provider |
| Hygiene (Days 0-30) | CMDB seeding with T0/T1 assets | Existing ITAM or spreadsheet |
| Control (Days 30-60) | Automated discovery of new assets | Existing EDR or NAC |
**Antifragile Angle**: You cannot defend what you cannot see. But inventory without ownership is just a list. Every asset in the CMDB must have an owner, a criticality rating, and a dependency map.
### Control 2: Inventory and Control of Software Assets
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | Software inventory via EDR or SCCM | Existing endpoint management |
| Hygiene (Days 0-30) | Unauthorized software detection | Existing EDR |
| Sovereignty (Days 60-90) | AI tool inventory and shadow AI discovery | Proxy logs + interviews |
**Antifragile Angle**: Software inventory is not about license compliance. It is about understanding your **attack surface**. Every unauthorized application is a potential path for an adversary.
### Control 3: Data Protection
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | Data classification by criticality | Manual + existing DLP if available |
| Sovereignty (Days 60-90) | Ensure proprietary AI data never leaves perimeter | Local AI infrastructure |
| Antifragility (Days 90-180) | Automated data loss prevention | Existing CASB or DLP |
**Antifragile Angle**: Data protection is not encryption at rest. It is **ensuring your proprietary signal does not train your competitor's model**. Local AI is a data protection control.
### Control 4: Secure Configuration of Enterprise Assets and Software
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Control (Days 30-60) | ASR rule deployment on endpoints | Microsoft Defender (often already owned) |
| Control (Days 30-60) | Secure baseline for cloud resources | Azure Policy / AWS Config / GCP Org Policy |
| Antifragility (Days 90-180) | Automated drift detection and remediation | Existing configuration management |
**Antifragile Angle**: Secure configuration is not a project. It is a **continuous state**. Every deviation from baseline is a fragility. Automate the detection and remediation of drift.
### Control 5: Account Management
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | Identity census and orphan elimination | Existing AD / IAM |
| Hygiene (Days 0-30) | Privileged account inventory and rotation | Existing AD / IAM + PAM if owned |
| Control (Days 30-60) | JIT elevation and PAW deployment | Existing PAM or native tools (PIM, AWS IAM Identity Center) |
**Antifragile Angle**: Account management is not about password complexity. It is about **reducing the number of keys that can unlock the kingdom**. Every account is a latent failure mode.
### Control 6: Access Control Management
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Control (Days 30-60) | Least-privilege review across platforms | Existing IAM + manual review |
| Control (Days 30-60) | Conditional access policies | Entra ID / Okta / native cloud IAM |
| Antifragility (Days 90-180) | Automated access reviews and revocation | Existing IAM or GRC tool |
**Antifragile Angle**: Access control is not about denying access. It is about **ensuring every allowed access is known, justified, and temporary**.
### Control 7: Continuous Vulnerability Management
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | External vulnerability scanning | Open-source or existing scanner |
| Control (Days 30-60) | Internal vulnerability scanning | Existing scanner or EDR-integrated |
| Antifragility (Days 90-180) | Risk-based prioritization and SLA | Existing vulnerability management platform |
**Antifragile Angle**: Vulnerability management is not about scanning everything. It is about **finding the shortest path to compromise and closing it first**.
### Control 8: Audit Log Management
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | Centralized log aggregation for critical systems | Existing SIEM or syslog server |
| Control (Days 30-60) | Log integrity protection | Existing SIEM or file integrity monitoring |
| Antifragility (Days 90-180) | Automated log analysis and anomaly detection | Existing SIEM or local AI pilot |
**Antifragile Angle**: Logs are not compliance artifacts. They are **the raw material of organizational memory**. If an attacker deletes your logs, they delete your ability to learn.
### Control 9: Email and Web Browser Protections
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Control (Days 30-60) | Anti-phishing and safe links | Microsoft Defender for O365 (often already owned) |
| Control (Days 30-60) | Browser isolation or hardening | Existing endpoint management |
**Antifragile Angle**: Email is the primary initial access vector for most adversaries. Hardening it is not optional. Fortunately, most organizations already own the tools to do so.
### Control 10: Malware Defenses
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | EDR deployment and coverage validation | Existing EDR |
| Control (Days 30-60) | ASR rules and exploit protection | Microsoft Defender (often already owned) |
| Antifragility (Days 90-180) | Behavioral detection tuning | Existing EDR |
**Antifragile Angle**: Malware defence is not about signature updates. It is about **behavioural visibility**: can you see anomalous process execution, lateral movement, and data staging?
### Control 11: Data Recovery
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | Backup coverage inventory | Existing backup solution |
| Sovereignty (Days 60-90) | Recovery drill: one critical system | Existing backup solution |
| Antifragility (Days 90-180) | Automated backup verification and recovery testing | Existing backup solution + scripting |
**Antifragile Angle**: Backups that have not been restored are **theological constructs**. They require faith, not evidence. We test.
### Control 12: Network Infrastructure Management
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | Network diagram and firewall rule audit | Existing firewall management |
| Control (Days 30-60) | DNS security and network segmentation | Existing DNS and firewall infrastructure |
| Antifragility (Days 90-180) | Automated network policy validation | Existing configuration management |
**Antifragile Angle**: Network infrastructure is not about speed. It is about **containment**: when one segment fails, how many others can you save?
### Control 13: Network Monitoring and Defense
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Control (Days 30-60) | Network sensor deployment at critical boundaries | Existing IDS/IPS or open-source Zeek/Suricata |
| Antifragility (Days 90-180) | Automated threat detection and response | Existing SIEM + SOAR or scripted response |
**Antifragile Angle**: Network monitoring is not about catching everything. It is about **detecting the anomaly that matters before it becomes the incident that kills you**.
### Control 14: Security Awareness and Skills Training
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Control (Days 30-60) | Phishing simulation and targeted training | Existing security awareness platform |
| Antifragility (Days 90-180) | Security champions program | No tool required—organizational design |
**Antifragile Angle**: Awareness is not about compliance videos. It is about **building a human sensor network** that reports anomalies faster than any technology.
### Control 15: Service Provider Management
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | Vendor access audit and inventory | Manual + existing IAM |
| Control (Days 30-60) | Supplier access lockdown and time-bounding | Existing PAM or IAM |
| Sovereignty (Days 60-90) | AI vendor risk assessment and exit planning | Manual + legal review |
**Antifragile Angle**: Supplier management is not about contracts. It is about **ensuring your suppliers cannot become your single point of failure**.
### Control 16: Application Software Security
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Sovereignty (Days 60-90) | AI-assisted code review pilot | Local AI on existing hardware |
| Antifragility (Days 90-180) | SAST/DAST integration into CI/CD | Existing DevOps tooling |
**Antifragile Angle**: Application security is not about finding every bug. It is about **making the development pipeline inhospitable to entire classes of vulnerabilities**.
### Control 17: Incident Response Management
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | IR contact list and escalation paths | Manual + existing ticketing |
| Sovereignty (Days 60-90) | AI-specific incident response runbook | Manual + existing IR framework |
| Antifragility (Days 90-180) | Automated containment playbooks | Existing SOAR or scripted response |
**Antifragile Angle**: Incident response is not about playbooks. It is about **the speed at which you convert an incident into a structural improvement**.
### Control 18: Penetration Testing
| Rapid Modernisation Phase | Action | Typical Tool Investment |
|--------------------------|--------|------------------------|
| Antifragility (Days 90-180) | Red team engagement or adversarial simulation | External provider or internal team |
| Antifragility (Days 90-180) | Continuous purple team exercises | Existing EDR + internal team |
**Antifragile Angle**: Penetration testing is not a compliance checkbox. It is **controlled failure that teaches you where your kill chain lives**.
---
## IG2 and IG3: The Antifragile Extension
We do not stop at IG1. IG2 and IG3 are implemented selectively based on the organization's kill chain and risk profile:
| IG | When We Pursue It | How We Fund It |
|----|-------------------|----------------|
| IG1 | Always. Non-negotiable 90-day target. | Primarily existing tool configuration |
| IG2 | When the organization processes sensitive data or faces targeted threats. | Reallocated savings from IG1 efficiency |
| IG3 | When the organization is critical infrastructure or faces advanced persistent threats. | Strategic security investment, justified by kill chain analysis |
---
## The IG1-as-Foundation Pitch
> *"CIS IG1 is 56 safeguards. Most organizations we assess have implemented fewer than 20. We are not suggesting you buy 36 new products. We are suggesting you configure what you already own to meet the minimum viable security posture. This is not a procurement project. It is a configuration project. And we can prove value in the first 30 days."*
---
*Next: [NIST CSF Mapping](nist-csf-mapping.md)*
*Previous: [Move Fast and Fix Things](../core/move-fast-and-fix-things.md)*

View File

@@ -0,0 +1,163 @@
# NIST Cybersecurity Framework 2.0 Mapping
> *"The CSF is not a checklist. It is a language for talking about risk. We speak it fluently, but we never let it slow us down."*
This document maps the antifragile rapid modernisation approach to the NIST Cybersecurity Framework (CSF) 2.0 functions. It is designed for consultants who must bridge the gap between operational speed and regulatory or stakeholder expectations.
---
## The Six Functions
NIST CSF 2.0 organizes cybersecurity outcomes into six functions: **GOVERN, IDENTIFY, PROTECT, DETECT, RESPOND, RECOVER**. The antifragile approach treats GOVERN as the missing keystone in most organizations and emphasizes continuous learning across all functions.
### GOVERN
**NIST Definition**: Establish and monitor the organization's cybersecurity risk management strategy, expectations, and policy.
**The Gap**: Most organizations have policies. Few have governance that is **alive**—updated by incidents, informed by stress, and capable of adaptation.
**Antifragile Expression**:
| Rapid Modernisation Phase | Action | Existing Tool Leverage |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | Establish kill chain risk register | Spreadsheet or existing GRC tool |
| Hygiene (Days 0-30) | Define T0 asset classification policy | Manual + existing asset management |
| Control (Days 30-60) | Integrate security into change management | Existing ITSM (ServiceNow, Jira, etc.) |
| Antifragility (Days 90-180) | Quarterly governance review tied to incident learning | Existing meeting cadence + decision log |
**Key Principle**: Governance is not a document. It is a **feedback loop** between risk, decision, action, and learning.
### IDENTIFY
**NIST Definition**: Understand the organization's current cybersecurity risks.
**The Gap**: Organizations often know their assets but not their **dependencies**. They know their vulnerabilities but not their **kill chain**.
**Antifragile Expression**:
| Rapid Modernisation Phase | Action | Existing Tool Leverage |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | Asset inventory with dependency mapping | Existing AD, EDR, cloud IAM |
| Hygiene (Days 0-30) | External attack surface enumeration | Open-source tools + existing vulnerability scanner |
| Control (Days 30-60) | Vendor and supplier dependency mapping | Existing procurement + IAM data |
| Sovereignty (Days 60-90) | AI usage and data flow discovery | Proxy logs + interviews |
**Key Principle**: Identification is not about completeness. It is about **finding the shortest path to failure and illuminating it**.
### PROTECT
**NIST Definition**: Use safeguards to prevent or reduce cybersecurity risk.
**The Gap**: Protection is often equated with purchasing. We equate it with **configuration, reduction, and ownership**.
**Antifragile Expression**:
| Rapid Modernisation Phase | Action | Existing Tool Leverage |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | Identity hardening: disable, rotate, enforce hygiene | Existing AD / IAM |
| Control (Days 30-60) | ASR, MFA, conditional access, PAWs | Microsoft Defender / Entra ID (often already owned) |
| Control (Days 30-60) | Network segmentation and DNS security | Existing firewall and DNS infrastructure |
| Sovereignty (Days 60-90) | Local AI deployment with T0 controls | Existing server hardware or sovereign cloud |
| Antifragility (Days 90-180) | Chaos engineering and graceful degradation | Existing infrastructure + open-source tools |
**Key Principle**: The best protection is not a thicker wall. It is **reducing the attack surface that the wall must defend**.
### DETECT
**NIST Definition**: Find and analyze possible cybersecurity attacks and compromises.
**The Gap**: Detection is often about alert volume. We focus on **signal quality** and the speed of conversion from anomaly to understanding.
**Antifragile Expression**:
| Rapid Modernisation Phase | Action | Existing Tool Leverage |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | Centralized logging for critical systems | Existing SIEM or syslog infrastructure |
| Control (Days 30-60) | EDR behavioural detection tuning | Existing EDR |
| Control (Days 30-60) | Network anomaly detection at boundaries | Existing IDS/IPS or Zeek/Suricata |
| Antifragility (Days 90-180) | AI-assisted log analysis and threat hunting | Local AI pilot on proprietary data |
**Key Principle**: Detection is not about seeing everything. It is about **seeing the thing that matters before it becomes the thing that kills you**.
### RESPOND
**NIST Definition**: Take action regarding a detected cybersecurity incident.
**The Gap**: Response is often reactive and manual. We build **pre-positioned capability** that activates faster than human coordination.
**Antifragile Expression**:
| Rapid Modernisation Phase | Action | Existing Tool Leverage |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | IR contact matrix and escalation paths | Existing communication tools |
| Control (Days 30-60) | Automated containment for high-confidence alerts | Existing SOAR or scripted playbooks |
| Sovereignty (Days 60-90) | AI-specific incident response runbooks | Existing IR framework + local knowledge |
| Antifragility (Days 90-180) | Red team validation of response speed | Internal or external red team |
**Key Principle**: Response is not about heroics. It is about **the mean time between detection and containment approaching zero**.
### RECOVER
**NIST Definition | Restore assets and operations affected by cybersecurity incidents.
**The Gap**: Recovery is often theoretical. Backups exist but have never been tested. Runbooks exist but have never been executed.
**Antifragile Expression**:
| Rapid Modernisation Phase | Action | Existing Tool Leverage |
|--------------------------|--------|------------------------|
| Hygiene (Days 0-30) | Backup coverage inventory and gap analysis | Existing backup solution |
| Sovereignty (Days 60-90) | Live recovery drill: one critical system | Existing backup solution |
| Antifragility (Days 90-180) | Quarterly recovery drills with automation | Existing backup + orchestration scripts |
| Antifragility (Days 90-180) | Chaos engineering: simulate infrastructure failure | Existing infrastructure + open-source tools |
**Key Principle**: Recovery is not about having backups. It is about **knowing—provably—that you can rebuild faster than your adversary can destroy**.
---
## The Antifragile CSF Profile
A CSF Profile describes the organization's current and target state. The antifragile profile is distinctive:
| Function | Typical Organization | Antifragile Organization |
|----------|---------------------|-------------------------|
| **GOVERN** | Annual policy review | Continuous governance updated by every incident |
| **IDENTIFY** | Asset inventory updated quarterly | Real-time dependency mapping with kill chain focus |
| **PROTECT** | Layered defenses purchased annually | Reduced attack surface through ownership and decoupling |
| **DETECT** | SIEM with thousands of daily alerts | High-signal detection with AI-assisted analysis |
| **RESPOND** | Incident response plan in a binder | Automated containment with human oversight |
| **RECOVER** | Backups with annual test | Quarterly validated recovery with chaos engineering |
---
## Communicating to Auditors and Regulators
When auditors ask how the antifragile approach maps to "accepted frameworks":
> *"Our approach is fully aligned with NIST CSF 2.0. We emphasize GOVERN as the enabling function and integrate continuous learning across IDENTIFY, PROTECT, DETECT, RESPOND, and RECOVER. Our 180-day roadmap delivers measurable maturity improvement against every CSF function, with evidence produced at each phase gate."*
**Evidence Package per Phase**:
| Phase | CSF Functions Addressed | Evidence Produced |
|-------|------------------------|-------------------|
| Hygiene (0-30 days) | GOVERN, IDENTIFY | Asset inventory, risk register, kill chain analysis |
| Control (30-60 days) | PROTECT, DETECT | Configuration baselines, detection rule effectiveness, MFA coverage |
| Sovereignty (60-90 days) | PROTECT, GOVERN | Local AI deployment evidence, vendor risk assessments, recovery drill results |
| Antifragility (90-180 days) | All six | Chaos experiment reports, structural fix metrics, maturity assessment |
---
## Crosswalk: NIST CSF ↔ CIS Controls ↔ Antifragile Actions
| NIST CSF Function | CIS Controls v8 | Antifragile Action |
|-------------------|-----------------|-------------------|
| GOVERN | Control 1, 2 (governance integration) | Kill chain risk register, T0 classification |
| IDENTIFY | Control 1, 2, 7 | Asset census, dependency mapping, shadow AI discovery |
| PROTECT | Control 4, 5, 6, 9, 10, 11, 12, 15 | ASR, MFA, PAWs, local AI, backup validation |
| DETECT | Control 8, 13 | Centralized logging, EDR tuning, network sensors |
| RESPOND | Control 17 | Automated containment, IR runbooks, red team validation |
| RECOVER | Control 11, 18 | Recovery drills, chaos engineering, structural improvement |
---
*Previous: [CIS Controls Mapping](cis-controls-mapping.md)*

View File

@@ -0,0 +1,292 @@
# Vertical Reference: Banking and Financial Services
> *"A bank's trust is its only real asset. Technical debt in security is a withdrawal from that account."*
This document adapts the antifragile rapid modernisation approach for banking and financial services—one of the most regulated, most targeted, and most technologically heterogeneous sectors. Banks face adversaries ranging from criminal syndicates to nation-states, while navigating DORA, PSD2, GDPR, NIS2, and national banking regulations.
---
## The Banking Security Context
### What Makes Banking Different
| Factor | Enterprise Default | Banking Reality |
|--------|-------------------|-----------------|
| Regulatory density | Moderate | Extreme (DORA, PSD2, GDPR, NIS2, Basel, national banking laws) |
| Adversary motivation | Financial (ransomware, fraud) | Financial + espionage + destabilization |
| Transaction speed | Batch, daily | Real-time, 24/7, instant payments |
| Legacy systems | 5-10 years old | 20-40 years old (mainframes, COBOL) |
| Third-party reliance | Moderate | High (fintech APIs, payment processors, SWIFT) |
| Data sensitivity | Personal data | Personal + financial + transaction patterns + behavioural biometrics |
### The Legacy Problem
Many banks run core banking systems on mainframes or mid-range systems that predate modern security architecture. These systems:
- Use legacy authentication (no MFA natively)
- Log minimally or opaquely
- Have no API layer; integration occurs via file transfer or terminal emulation
- Run on operating systems with limited patch support
Our approach does not demand legacy replacement. It demands **compensating controls** and **isolation architecture**.
---
## Regulatory Landscape
### DORA (Digital Operational Resilience Act) — EU
Effective January 2025, DORA imposes comprehensive ICT risk management requirements on EU financial entities.
| DORA Requirement | Antifragile Application |
|-----------------|------------------------|
| ICT risk management framework (Article 6) | Kill chain analysis as primary risk methodology; T0 asset classification for critical banking systems |
| ICT-related incident management (Article 10) | Sub-hour detection and containment targets; automated reporting to lead overseer |
| Digital operational resilience testing (Article 11) | Quarterly recovery drills for core banking; annual red team; threat-led penetration testing (TLPT) |
| ICT third-party risk (Article 12) | Vendor exit architectures for all critical ICT providers; contract clawbacks for security failures |
| Information sharing (Article 14) | Anonymized incident signals shared via sector ISACs; defensive AI trained on collective threat data |
### PSD2 (Revised Payment Services Directive)
| PSD2 Requirement | Security Implication |
|-----------------|---------------------|
| Strong Customer Authentication (SCA) | MFA for payment initiation and account access |
| Dynamic linking | Authentication code must be specific to transaction amount and payee |
| Secure communication | TLS 1.2+, mutual authentication for TPP APIs |
| Access for TPPs (Third Party Providers) | New API attack surface; strict OAuth scope control |
### NIS2 for Systemic Banks
Systemic banks fall under NIS2 as "essential entities" with:
- 24-hour incident reporting to CSIRT
- Supply chain security obligations
- Board-level accountability for cybersecurity
### National Regulations
| Jurisdiction | Key Regulation |
|-------------|---------------|
| Germany | BAIT (BAIT-Rahmen), MaRisk |
| UK | CBEST, STAR, SYSC |
| US | FFIEC guidelines, SOX, GLBA |
| Switzerland | FINMA Circular 2023/1 |
---
## The Antifragile Posture for Banking
### Pillar 1: Structural Decoupling — Core Banking Isolation
**Principle**: The core banking system must be structurally isolated from internet-facing channels, third-party APIs, and general corporate IT.
**Antifragile Moves**:
| Layer | Isolation Requirement |
|-------|----------------------|
| **Channel layer** | Internet banking, mobile apps, open banking APIs → DMZ, WAF, API gateway |
| **Integration layer** | API gateway, middleware, ESB → validates, transforms, rate-limits all traffic |
| **Core layer** | Core banking, payments engine, general ledger → no direct internet; access only via integration layer |
| **Data layer** | Customer databases, transaction history → encrypted at rest; access via service accounts only |
| **Reporting layer** | Data warehouse, BI, regulatory reporting → read-only from core; no write-back |
**The Conversation**:
> *"Your core banking system is a Tier 0 asset. It should not know the internet exists. Every request must pass through an integration layer that validates, logs, and rate-limits. If a mobile app vulnerability is exploited, the adversary should hit the API gateway—not the general ledger."*
### Pillar 2: Optionality Preservation — Fintech and TPP Independence
**Principle**: Open banking and fintech integration create dependencies. The bank must retain the option to disconnect, replace, or limit any third party without operational paralysis.
**Antifragile Moves**:
- **API abstraction layer**: All TPP connections via bank-controlled API gateway; no direct TPP-to-core connections
- **Scope-limited OAuth**: TPP tokens granted only for specific accounts, specific data sets, specific time windows
- **Circuit breakers**: Automatic disconnection of TPPs exhibiting anomalous behaviour (high request rates, unusual data access patterns)
- **TPP risk register**: Every connected TPP rated for security maturity with quarterly re-assessment
- **Exit architecture**: Technical and contractual ability to revoke TPP access within 1 hour
### Pillar 3: Stress-to-Signal Conversion — Fraud as Intelligence
**Principle**: Every fraud attempt, successful or not, is free threat intelligence. The bank must learn faster than the adversary adapts.
**Antifragile Moves**:
- **Real-time fraud detection**: Local AI models trained on proprietary transaction data to detect anomalies without cloud exfiltration
- **Fraud-to-structure pipeline**: Every confirmed fraud case must produce at least one control improvement
- **Behavioral biometrics**: Device fingerprinting, typing cadence, mouse movement patterns—signals that improve with volume
- **Mule account detection**: Graph analysis on account opening and transaction patterns to identify money laundering networks
### Pillar 4: Sovereign Intelligence — Payments Data Never Leaves
**Principle**: Payment transaction data reveals economic behaviour, business relationships, and operational patterns. It must never train a third-party AI.
**Antifragile Moves**:
- **Local fraud models**: Train on transaction history, merchant categories, geolocation, and temporal patterns locally
- **On-premise transaction monitoring**: AML/sanctions screening engines run on bank-controlled hardware
- **Closed-loop analytics**: Customer segmentation, product recommendation, and risk scoring using local models
- **Data residency by design**: Primary data storage in national or EU jurisdiction; encryption keys in HSM under bank control
**The Conversation**:
> *"Your payments data is not just customer data. It is a map of your economy. Sending it to a cloud AI for 'fraud optimization' is not a technology partnership. It is an intelligence transfer. Local models. Local hardware. Local keys."*
### Pillar 5: Asymmetric Payoff — Resilience Over Perfection
**Principle**: Banks cannot prevent all fraud or all attacks. The antifragile bank designs systems where small security investments yield disproportionate reductions in catastrophic risk.
**Antifragile Moves**:
- **Segmented transaction limits**: Real-time limits by channel, geography, time, and customer segment; limits the blast radius of compromised credentials
- **Synthetic account testing**: Maintain honeypot accounts that alert on any access attempt
- **Rapid account freezing**: Sub-60-second ability to freeze accounts, revoke tokens, and block cards
- **Distributed ledger backup**: Critical transaction records replicated to immutable, geographically distributed storage
---
## The Rapid Modernisation Plan: Banking Variant
### Phase 1: Hygiene (Days 0-30) — Banking-Specific Additions
In addition to standard hygiene:
| Action | Owner | Deliverable | Regulatory Link |
|--------|-------|-------------|----------------|
| Inventory all systems processing payment data | Security / Architecture | PCI-DSS / payment system asset inventory | PSD2, PCI-DSS |
| Map all open banking / TPP connections | API Team | TPP connection matrix with data flows | PSD2 |
| Audit SWIFT infrastructure access and messaging | Security / Treasury | SWIFT CSP compliance gap analysis | SWIFT CSP |
| Verify data residency for customer and transaction data | Legal / Cloud | Data residency attestation | GDPR, DORA |
| Inventory cryptographic key material and HSMs | Security | Key management inventory | DORA, national crypto regs |
### Phase 2: Control (Days 30-60) — Banking-Specific Additions
| Action | Owner | Deliverable | Regulatory Link |
|--------|-------|-------------|----------------|
| Implement API gateway security: rate limiting, OAuth scope enforcement, input validation | API / Security | API security configuration audit | PSD2, DORA |
| Harden SWIFT infrastructure: dedicated network, restricted access, CSP controls | Security / Treasury | SWIFT CSP self-assessment | SWIFT CSP |
| Deploy tokenization for card data where not already present | Security / Payments | Tokenization coverage report | PCI-DSS |
| Implement privileged access vaulting for core banking admins | Security | PAM coverage for core banking | DORA, internal audit |
| Encrypt all backup and archive data with HSM-managed keys | Backup / Security | Encryption coverage report | GDPR, DORA |
### Phase 3: Sovereignty (Days 60-90) — Banking-Specific Additions
| Action | Owner | Deliverable | Regulatory Link |
|--------|-------|-------------|----------------|
| Deploy local AI for fraud detection pilot | AI / Fraud | Fraud detection model with false positive/negative rates | DORA (resilience testing) |
| Conduct core banking recovery drill | Operations / Security | Recovery time objective (RTO) validation | DORA Article 11 |
| Test TPP disconnection procedure | API / Security | TPP revocation time measurement | PSD2, DORA |
| Validate incident reporting automation to regulator | Security / Legal | Automated reporting pipeline test | DORA Article 10 |
### Phase 4: Antifragility (Days 90-180) — Banking-Specific Additions
| Action | Owner | Deliverable | Regulatory Link |
|--------|-------|-------------|----------------|
| Threat-led penetration testing (TLPT) | External / Security | TLPT report with remediation | DORA Article 11 |
| Chaos engineering on channel layer (non-production) | Resilience | Chaos experiment findings | DORA resilience testing |
| Red team exercise including TPP exploitation | Security | Red team report with kill chain | DORA, internal audit |
| Board-level cybersecurity briefing with antifragile metrics | CISO / Board | Quarterly board report | DORA governance, NIS2 |
---
## SWIFT Customer Security Programme (CSP)
For banks using SWIFT messaging:
| CSP Control | Antifragile Implementation |
|------------|---------------------------|
| 1.1: Restrict Internet Access | SWIFT infrastructure on dedicated VLAN with no internet; jump host access only |
| 1.2: Secure the Operating System | Hardened OS baseline, automated patching, application whitelisting |
| 1.3: Restrict Logical Access | Vaulted credentials, MFA, session recording for all SWIFT access |
| 1.4: Malware Protection | EDR on SWIFT workstations, network segmentation, email security |
| 1.5: Software Integrity | Signed software only, integrity monitoring, change control |
| 2.1: Internal Data Flow Security | Encryption for all SWIFT data in transit within the bank |
| 2.2: Security Event Monitoring | Dedicated logging for SWIFT infrastructure; alerting on anomalous access |
| 2.3: Transaction Business Controls | Dual authorization for high-value messages; anomaly detection on message patterns |
| 2.4: Connection Integrity | Mutual TLS, certificate pinning, connection anomaly detection |
| 2.5: Service Providers | Due diligence on SWIFT service bureaus; exit clauses; audit rights |
| 2.6: Customer Environment Security | Annual self-assessment with independent validation |
| 2.7: Penetration Testing | Annual penetration testing of SWIFT infrastructure |
| 2.8: Cyber Incident Information Sharing | Participation in sector ISACs; anonymized threat sharing |
| 2.9: Transaction Controls for Funds Transfers | Additional validation for high-risk corridors and counterparties |
| 2.10: Operational Risk Management | Integration of SWIFT risk into enterprise operational risk framework |
| 2.11: Security Awareness Training | Role-specific training for SWIFT operators and administrators |
---
## M365 in Banking
Banks often use M365 for corporate functions while maintaining strict separation from payment systems.
| Consideration | Banking Requirement |
|--------------|---------------------|
| **License tier** | E3 is common; E5 for security/ compliance officers. Defender for Office 365 P2 strongly recommended for email security. |
| **Data loss prevention** | E3 has no native DLP. Critical gap for banks. Recommend Purview or third-party DLP. |
| **Email archiving** | 7+ year immutable retention for regulatory inquiries. Requires Exchange Online Plan 2 or add-on. |
| **eDiscovery** | Legal hold and eDiscovery required for litigation and regulatory requests. Purview required for advanced features. |
| **Customer data in M365** | Strictly prohibit customer PII in Teams/SharePoint unless DLP and encryption are active |
| **Third-party apps** | Disable user consent; require admin approval for all enterprise apps |
| **Mobile access** | Intune-managed devices only; block unmanaged device access to email and SharePoint |
See [M365 E3 Hardening](../playbooks/m365-e3-hardening.md) for tactical guidance, and apply these banking overlays.
---
## Core Banking and Legacy System Security
### Compensating Controls for Legacy
When core banking systems cannot be modernized directly:
| Legacy Limitation | Compensating Control |
|------------------|---------------------|
| No native MFA | Place terminal access behind PAM vault with MFA gate; no direct user login |
| Minimal logging | Deploy screen/session recording for all access; instrument file transfers |
| No encryption in transit | Force all connectivity through TLS-terminating proxy or VPN |
| Weak password policies | Vault all service account passwords; rotate automatically; no human knowledge |
| No patch support | Isolate on dedicated network segment; application whitelisting; intrusion detection |
| File-based integration | Scan all files at transfer points; validate checksums; log all movements |
### The Integration Layer as Security Boundary
For banks with legacy core systems, the integration layer (API gateway, ESB, middleware) becomes the **security control point**:
- All authentication modernized at the integration layer
- All logging enriched at the integration layer
- All rate limiting and circuit breaking enforced at the integration layer
- All input validation performed at the integration layer
The core banking system sees only validated, logged, controlled traffic.
---
## Cryptography and Key Management
Banking regulators are increasingly specific about cryptographic controls.
| Control | Implementation |
|---------|---------------|
| **Key generation** | HSM-generated for all production keys; dual control for key ceremonies |
| **Key storage** | HSM or hardware-backed key stores only; no software-only keys for signing or encryption |
| **Key rotation** | Automated rotation for TLS keys; annual rotation for long-term signing keys |
| **Quantum readiness** | Inventory all cryptographic implementations; begin crypto-agility planning |
| **Key escrow** | Split knowledge for backup keys; geographic separation of escrow components |
---
## Evidence Package for Regulators and Auditors
| Regulatory Request | Evidence from Antifragile Program |
|-------------------|----------------------------------|
| DORA ICT risk framework | Kill chain analysis, T0 asset register, risk-based vulnerability prioritization |
| DORA resilience testing | Quarterly recovery drill reports, annual TLPT/penetration test, chaos engineering results |
| DORA incident reporting | Mean-time-to-detect, mean-time-to-contain, automated reporting pipeline test results |
| DORA third-party risk | Vendor risk register, exit architectures, contract security clauses |
| PSD2 SCA compliance | MFA coverage report, dynamic linking validation, TPP access audit |
| SWIFT CSP | Self-assessment with independent validation, penetration test report |
| GDPR data protection | Data residency attestation, encryption coverage, DLP policy, breach notification test |
| Internal audit | Antifragile maturity assessment, control effectiveness metrics, remediation tracking |
---
*Previous: [Vertical: Power Utilities](vertical-power-utilities.md)*

View File

@@ -0,0 +1,297 @@
# Vertical Reference: Power and Utilities
> *"The grid does not care about your quarterly targets. It cares whether you understood the boundary between IT and operations before the adversary did."*
This document adapts the antifragile rapid modernisation approach for power generation, transmission, distribution, and water utilities. These organizations operate industrial control systems (ICS/SCADA) where safety and availability are paramount, regulatory oversight is intense, and the convergence of IT and OT creates existential attack surfaces.
---
## The Power and Utility Context
### What Makes This Sector Different
| Factor | Enterprise Default | Power/Utility Reality |
|--------|-------------------|----------------------|
| Downtime tolerance | Hours | Seconds to minutes (protection systems); hours for generation |
| Safety impact | Data loss, financial harm | Physical harm, loss of life, environmental catastrophe |
| System lifetime | 3-5 years | 20-40 years (generation, transmission, protection relays) |
| Regulatory driver | GDPR, industry standards | NIS2, CER, IEC 62351, NERC CIP (North America), national energy regulators |
| OT/IT boundary | Often porous or nonexistent | Legally and physically mandated; convergence is the primary risk |
| Supply chain | Moderate depth | Extreme (multi-vendor, multi-national, obsolete equipment) |
| Remote access | Common, convenient | Heavily restricted; often requires physical presence or dedicated lines |
### The IT/OT Convergence Problem
Power utilities historically operated OT networks (SCADA, EMS, DMS, protection relays) as **air-gapped systems**. Over the past two decades, convergence has introduced:
- Remote diagnostics over internet-connected VPNs
- Centralized patch management through IT SCCM/WSUS
- Business intelligence systems reading OT historian data
- Vendor remote support terminals in control centers
- Smart grid and Advanced Metering Infrastructure (AMI) connecting customer-facing IT to grid operations
Every convergence point is a **potential bridge for adversaries** from IT to OT.
**The executive framing**:
> *"Your control room does not need email. Your protection relays do not need internet access. Every connection between your IT network and your operational technology is a connection an adversary can cross. We are not adding bureaucracy. We are re-establishing the boundary that keeps the lights on."*
---
## Regulatory Landscape
### EU NIS2 Directive (2023)
Power utilities and water suppliers are classified as **essential entities** under NIS2.
| NIS2 Requirement | Power/Utility Application |
|-----------------|--------------------------|
| Risk management measures | Kill chain analysis for IT→OT bridges; physical security assessment |
| Supply chain security | Vendor access inventory for all OT equipment; firmware provenance tracking |
| Incident reporting (24h → 72h) | Automated detection and reporting to national CSIRT and energy regulator |
| Business continuity | Black start capability; grid islanding procedures; manual override validation |
| Cryptography | Encrypted communications for all IT/OT integration points |
| MFA | Hardware tokens for all remote access to OT or critical IT systems |
| Vulnerability handling | Risk-based prioritization with **safety impact assessment** |
### CER Directive (Critical Entities Resilience)
Requires power utilities to demonstrate resilience against:
- Natural disasters
- Cyberattacks
- Supply chain disruptions
- Pandemics and workforce unavailability
**Antifragile application**: Chaos engineering for non-safety systems; cross-training for manual procedures; distributed spare parts inventory.
### Sector-Specific Standards
| Standard | Scope |
|----------|-------|
| **IEC 62351** | Power systems cybersecurity: communications protocols, authentication, encryption |
| **IEC 61850** | Substation communication (GOOSE, SV); security extensions for IEC 61850-90-20 |
| **NERC CIP** | North American electric reliability; mandatory standards with heavy penalties |
| **ENTSO-E Cybersecurity Guidance** | European transmission system operator requirements |
| **BDEW Whitepaper** | German energy sector cybersecurity best practices |
---
## The Antifragile Posture for Power and Utilities
### Pillar 1: Structural Decoupling — The IT/OT Firewall
**Principle**: IT and OT must be decoupled to the maximum extent compatible with operational requirements. The air gap is the default. Any bridge must be justified, documented, and monitored.
**Antifragile Moves**:
| Action | Implementation | Priority |
|--------|---------------|----------|
| **Network segmentation** | Physically separate IT and OT; unidirectional gateway or data diode for IT→OT data flows | P0 |
| **No AD trust to OT** | OT AD (if any) must be a separate forest with one-way trust or no trust | P0 |
| **Jump host architecture** | All IT-to-OT access via hardened, monitored jump hosts with session recording | P1 |
| **Vendor access airlock** | Vendor VPNs terminate in dedicated DMZ; no direct OT access; remote hands or on-site escort for OT | P1 |
| **Remove internet from OT** | OT VLANs have no direct internet egress; updates via offline media or controlled proxy | P0 |
| **AMI/ Smart Grid isolation** | Advanced Metering Infrastructure on dedicated network; no direct path to SCADA or EMS | P1 |
### Pillar 2: Optionality Preservation — Vendor and Technology Independence
**Principle**: Power utilities depend on vendors for SCADA, protection relays, turbine control, and substation automation. This dependency must not become a single point of failure.
**Antifragile Moves**:
- **Multi-vendor strategy for critical systems**: No single vendor should control >50% of protection, control, or monitoring functions
- **Spare parts inventory**: Maintain critical spares for legacy OT equipment that vendors no longer support
- **Firmware escrow and provenance**: Require vendors to deposit firmware; verify cryptographic signatures before deployment
- **Local competence**: Train internal staff to operate and maintain systems without vendor support for 30 days
- **Protocol independence**: Where possible, support multiple communication protocols to avoid single-vendor lock-in
### Pillar 3: Stress-to-Signal Conversion — OT Incident Learning
**Principle**: OT incidents are rare but high-impact. The organization must learn from every anomaly, near-miss, and exercise.
**Antifragile Moves**:
- **OT security operations centre (SOC) integration**: Feed OT alarms into the SOC with analysts trained on industrial protocols
- **Monthly tabletop exercises**: Simulate OT-specific scenarios (compromised EMS, rogue protection relay settings, ransomware on engineering workstations)
- **Post-incident structural mandate**: Every OT incident or near-miss must produce at least one architectural or procedural change
- **Red team with bounded OT scope**: Annual exercise including OT reconnaissance, constrained by safety requirements
### Pillar 4: Sovereign Intelligence — Local AI for the Grid
**Principle**: Grid data is among the most sensitive an organization possesses. It reveals generation capacity, topology, switching patterns, load profiles, and operational routines.
**Antifragile Moves**:
- **Local AI for OT anomaly detection**: Analyze historian data, DCS logs, and protection relay events without cloud exfiltration
- **Closed-loop digital twin**: Train models on local OT data to predict equipment failures; never export raw telemetry
- **Air-gapped AI inference**: Deploy inference nodes in OT DMZ with no return path to IT or internet
- **Load forecasting sovereignty**: Local models for demand prediction using proprietary grid data
**The executive framing**:
> *"Your grid data tells an adversary exactly when and where to strike. It tells a competitor your capacity constraints. Sending it to a cloud AI for 'optimization' is not a technology decision. It is a national security and competitive intelligence decision. Local models on local hardware. Full stop."*
### Pillar 5: Asymmetric Payoff — Resilience Over Prevention
**Principle**: In power utilities, perfect prevention is impossible. The goal is to survive and recover faster than the adversary can exploit.
**Antifragile Moves**:
- **Black start capability**: Maintain the ability to restart the grid from shutdown without external power
- **Grid islanding**: Design systems so that sections can disconnect and operate independently during disturbances
- **Manual override procedures**: Every automated system must have a documented, tested manual procedure
- **Redundant communication paths**: Power line carrier, microwave, satellite backup for SCADA and protection communications
- **Protection relay independence**: Electromechanical or static relays as backup for digital relays in critical paths
---
## The Rapid Modernisation Plan: Power/Utility Variant
### Phase 1: Hygiene (Days 0-30)
In addition to standard hygiene:
| Action | Owner | Deliverable |
|--------|-------|-------------|
| Inventory all OT assets: DCS, SCADA, EMS, protection relays, RTUs, AMI | OT Security / Engineering | OT asset inventory with vendor and firmware versions |
| Map all IT-to-OT network connections | Network / OT | Connection matrix with business justification per connection |
| Audit vendor remote access: who, how, when, for how long | OT Security / Procurement | Vendor access log and hardened policy |
| Identify OT systems with internet connectivity | Network | List with immediate remediation plan |
| Document manual override procedures for critical systems | OT Engineering | Procedure manual, signed off by operations and safety |
| Validate backup of EMS / DMS configurations | OT Engineering | Backup integrity test report |
### Phase 2: Control (Days 30-60)
| Action | Owner | Deliverable |
|--------|-------|-------------|
| Implement network segmentation: IT/OT DMZ with unidirectional gateway | Network / OT | Segmentation architecture and validated firewall rules |
| Harden vendor access: time-bounded, session-recorded, MFA with hardware tokens | OT Security | Vendor access gateway operational |
| Enable OT logging: historian, DCS, firewall, protection relay events | OT Security | Centralized OT log aggregation (air-gapped SIEM or historian) |
| Patch OT systems: test in lab, deploy in maintenance windows | OT Engineering | Patch management procedure with safety gates |
| Secure engineering workstations (EWS): application whitelisting, no internet | OT Security | EWS hardening standard deployed |
### Phase 3: Sovereignty (Days 60-90)
| Action | Owner | Deliverable |
|--------|-------|-------------|
| Deploy local AI for OT anomaly detection pilot | AI / OT Security | OT anomaly detection with false positive tuning |
| Validate black start / islanding procedures | Operations | Test report with time-to-recovery metrics |
| Conduct OT-specific tabletop exercise | Security / Operations | Exercise report with structural improvements |
| Implement firmware integrity monitoring | OT Security | Baseline hashes for critical OT firmware |
| Test protection relay fail-over to electromechanical backup | Engineering | Fail-over test report |
### Phase 4: Antifragility (Days 90-180)
| Action | Owner | Deliverable |
|--------|-------|-------------|
| Annual red team with bounded OT scope | Security | Red team report with kill chain analysis |
| Chaos engineering on non-safety IT systems | Resilience | Monthly experiment schedule and findings |
| Vendor exit architecture for critical OT platforms | Procurement / Engineering | 90-day vendor transition plan per critical system |
| Cross-training: operations staff on manual procedures | Operations | Training completion metrics |
| Participate in sector ISAC information sharing | Security | Threat intelligence integration report |
---
## Substation and Protection Specifics
### IEC 61850 Security
IEC 61850 (substation communication) uses GOOSE and Sampled Values (SV) that were not designed with security in mind.
**Hardening priorities**:
- **IEC 61850-90-20**: Implement cybersecurity recommendations for IEC 61850 networks
- **Authentication**: Digitally sign GOOSE messages where IEDs support it
- **Network segmentation**: GOOSE/SV traffic on dedicated VLAN; no routing to IT networks
- **IED hardening**: Disable unused services; change default passwords; enable logging
- **Configuration management**: Version control for SCL files; change detection for IED settings
### Protection Relay Security
Protection relays are the **safety-critical edge** of the grid. Compromise can cause physical damage.
| Control | Implementation |
|---------|---------------|
| Access control | Vaulted credentials; multi-person approval for settings changes |
| Logging | All settings changes logged with before/after values |
| Integrity | Cryptographic checksums for firmware and settings files |
| Redundancy | Independent protection schemes (e.g., distance + differential) |
| Manual backup | Electromechanical or static relay backup for critical digital protections |
---
## Generation-Specific Considerations
### Thermal / Nuclear / Hydro
| Generation Type | Specific Risk | Control |
|----------------|--------------|---------|
| **Thermal** | Turbine control system compromise | Dedicated turbine control network; no IT connectivity |
| **Nuclear** | Safety system interference | Air-gapped safety systems; regulatory compliance with national nuclear authority |
| **Hydro** | Dam control / spillway gate manipulation | Physical controls for critical water management; redundant level sensors |
| **Renewables** | Inverter-based resource (IBR) vulnerability | Secure firmware updates; anti-islanding protection; grid support function validation |
### Distributed Energy Resources (DER)
Solar, wind, and battery inverters connect to the distribution grid with varying security maturity.
- **Action**: DER interconnection standards must include cybersecurity requirements
- **Action**: Monitor DER communications for anomalous commands or settings changes
- **Action**: Aggregate DER visibility in DMS/ADMS without direct control paths
---
## Water and Wastewater Utilities
Water utilities share many characteristics with power but have additional concerns:
| Concern | Application |
|---------|-------------|
| **Safety** | Contamination prevention, pressure management, chemical dosing control |
| **SCADA/OT** | Treatment plant automation, distribution pump control, reservoir level management |
| **Criticality** | Water is life-sustaining; outages have immediate public health impact |
| **Regulation** | EPA (US), Drinking Water Inspectorate (UK), national health authorities |
**Additional controls for water utilities**:
- **Physical security** for treatment chemicals (chlorine, fluoride) to prevent intentional contamination
- **Redundant water quality sensors** with cross-validation
- **Manual override capability** for all automated chemical dosing systems
- **Isolation of IT from operational water quality monitoring**
---
## M365 in Power and Utilities
Corporate IT in power utilities uses M365 but must be strictly separated from OT.
| Consideration | Power/Utility Requirement |
|--------------|--------------------------|
| **Data residency** | M365 data in EU/national datacenters; verify tenant location |
| **Conditional access** | Block M365 access from non-corporate devices for privileged users; geo-restrict admin access |
| **Guest access** | Strictly prohibit in OT-connected tenants; heavily vet in corporate tenant |
| **Teams / SharePoint** | Never used for OT document sharing or control room communication |
| **Mobile device management** | Field engineer tablets Intune-managed; restricted app installation |
| **Email security** | EOP baseline minimum; Defender for Office 365 P2 recommended for critical infrastructure |
See [M365 E3 Hardening](../playbooks/m365-e3-hardening.md) for tactical hardening, and apply these overlays.
---
## Evidence Package for Regulators
| Requirement | Evidence from Antifragile Program |
|------------|----------------------------------|
| NIS2 risk management | Kill chain analysis, T0 asset classification, IT/OT connection matrix |
| NIS2 incident handling | IR runbooks, OT-specific response procedures, quarterly drill reports |
| NIS2 business continuity | Black start test reports, islanding validation, manual procedure verification |
| NIS2 supply chain security | Vendor risk register, firmware provenance, vendor exit architectures |
| NIS2 encryption | Data classification with encryption mapping, TLS configuration audits |
| NIS2 vulnerability handling | Vulnerability scan reports with safety-impact prioritization |
| CER resilience | Chaos engineering results, cross-training metrics, spare parts inventory |
---
*Previous: [NIST CSF Mapping](nist-csf-mapping.md)*
*Next: [Vertical: Telco](vertical-telco.md)*

View File

@@ -0,0 +1,307 @@
# Vertical Reference: Telecommunications
> *"A telco's network is its nervous system. Compromise it, and you do not just steal data—you control the medium through which a nation communicates."*
This document adapts the antifragile rapid modernisation approach for telecommunications providers—mobile network operators, fixed-line operators, internet service providers, and converged operators. These organizations manage national infrastructure, process massive volumes of subscriber data, and face adversaries ranging from criminal fraudsters to nation-state actors seeking communications intelligence.
---
## The Telecommunications Context
### What Makes Telco Different
| Factor | Enterprise Default | Telco Reality |
|--------|-------------------|---------------|
| Scale | Thousands of endpoints | Millions of subscribers, hundreds of thousands of network elements |
| Real-time requirement | Batch acceptable | Call setup, SMS, data sessions are real-time; latency matters |
| Regulatory driver | GDPR, industry standards | GDPR + NIS2 + telecom-specific security frameworks + national licensing conditions |
| Adversary motivation | Financial (ransomware, fraud) | Financial + espionage + surveillance + network disruption |
| Signaling exposure | Minimal | SS7, Diameter, GTP, SIP are exposed to hundreds of partner networks globally |
| Supply chain | Moderate | Extreme (equipment vendors from multiple geopolitical blocs, legacy switches, proprietary protocols) |
| Customer data depth | Personal data | Personal + location + communication patterns + device identity + lawful intercept capability |
### The Convergence Challenge
Telcos are converging previously separate networks:
- **Fixed and mobile** (FMC — Fixed Mobile Convergence)
- **IT and network** (cloud-native 5G core, NFV, SDN)
- **Consumer and enterprise** (unified platforms, shared infrastructure)
- **Communications and content** (streaming, advertising, IoT platforms)
Every convergence multiplies the attack surface and blurs accountability.
---
## Regulatory Landscape
### EU NIS2 Directive (2023)
Telcos are classified as **essential entities** under NIS2 with stringent obligations.
| NIS2 Requirement | Telco Application |
|-----------------|------------------|
| Risk management measures | Network-wide kill chain analysis; signaling security assessment |
| Supply chain security | Equipment vendor risk (especially high-risk vendors); firmware provenance |
| Incident reporting (24h → 72h) | Automated detection and reporting to national regulator and ENISA |
| Business continuity | Network resilience testing; disaster recovery for core network functions |
| Cryptography | Encryption for signaling, management, and subscriber data |
| MFA | Hardware tokens for all core network and network management access |
| Vulnerability handling | Rapid patching of network elements with service continuity planning |
### Telecom-Specific Security Frameworks
| Framework | Scope |
|-----------|-------|
| **ETSI EN 303 645** | Cybersecurity for consumer IoT devices (relevant for telco IoT offerings) |
| **GSMA FS.38** | Fraud and security framework for mobile operators |
| **GSMA Network Equipment Security Assurance Scheme (NESAS)** | Vendor security assessment for 5G equipment |
| **3GPP SA3** | Security architecture and procedures for mobile systems |
### National Telecom Security Frameworks
Many EU member states have additional national requirements:
- **Germany**: Telekommunikation-Sicherheitsverordnung (TSI)
- **UK**: Telecommunications (Security) Act 2021
- **France**: ANSSI guides for operators of vital importance
---
## The Antifragile Posture for Telecommunications
### Pillar 1: Structural Decoupling — Network Segmentation
**Principle**: The core network must be structurally isolated from internet-facing services, enterprise IT, and third-party APIs.
**Antifragile Moves**:
| Layer | Isolation Requirement |
|-------|----------------------|
| **Core network** | Signaling (MME, AMF, HSS/UDM, PCRF/PCF) on dedicated network; no direct internet access |
| **Radio access network (RAN)** | gNodeB / eNodeB management plane separated from user plane; no direct core access from RAN management |
| **Customer-facing services** | BSS (billing, CRM), OSS (operations), customer portals in DMZ with strict core access controls |
| **Enterprise services** | MPLS, SD-WAN, dedicated APNs on isolated infrastructure segments |
| **IoT platforms** | Dedicated network slice or APN; no direct subscriber data access without API gateway |
| **Interconnect** | SS7, Diameter, SIP, GTP signaling firewalls at every partner boundary |
### Pillar 2: Optionality Preservation — Vendor and Protocol Independence
**Principle**: Telcos depend on a small number of equipment vendors for core network functions. This concentration is a strategic vulnerability.
**Antifragile Moves**:
- **Multi-vendor RAN**: Open RAN architectures reduce dependency on single radio vendors
- **Cloud-native core portability**: 5G core deployed on container platforms portable across cloud providers
- **Protocol abstraction**: API gateways abstract subscriber-facing services from core network protocols
- **Vendor exit architecture**: Technical ability to replace core network vendor within defined timeframe
- **Firmware diversity**: Avoid identical firmware versions across all instances of a network element
### Pillar 3: Stress-to-Signal Conversion — Fraud and Attack Intelligence
**Principle**: Telcos process billions of transactions. Every fraud attempt, signaling anomaly, and attack probe is intelligence that should improve defences.
**Antifragile Moves**:
- **Real-time fraud detection**: Local AI models on call detail records, signaling data, and subscriber behaviour
- **Signaling anomaly detection**: SS7/Diameter/GTP firewalls with behavioural analysis
- **SIM swap detection**: Correlate SIM changes with account access, device fingerprint, and location
- **Wangiri / IRSF detection**: Identify missed-call fraud and international revenue share fraud patterns
- **Fraud-to-structure pipeline**: Every confirmed fraud case produces control improvement
### Pillar 4: Sovereign Intelligence — Subscriber Data Never Leaves
**Principle**: Subscriber data (location, communication patterns, device identity, web browsing) is among the most sensitive data a state or criminal actor can access.
**Antifragile Moves**:
- **Local AI for network optimization**: Traffic prediction, energy saving, capacity planning on local infrastructure
- **Closed-loop fraud models**: Train on proprietary CDR and signaling data without cloud exfiltration
- **On-premise lawful intercept management**: Strict control over intercept capabilities; no third-party access
- **Data minimization for analytics**: Aggregate where possible; pseudonymize where individual analysis required
**The executive framing**:
> *"Your subscribers' location history, communication patterns, and digital behaviour are a map of your society. Sending that data to a cloud AI for 'network optimization' is not a technology partnership. It is an intelligence transfer. Local models. Local hardware. Local accountability."*
### Pillar 5: Asymmetric Payoff — Resilience at Scale
**Principle**: Telco failures affect millions instantly. Small investments in redundancy and rapid recovery yield massive reductions in societal and financial impact.
**Antifragile Moves**:
- **Distributed core architecture**: 5G core functions geographically distributed; failure of one data centre does not disable a region
- **Automated failover**: Base station controllers, DNS, and authentication functions with sub-minute failover
- **Synthetic monitoring**: Continuous health checks from subscriber perspective (call setup, data throughput, SMS delivery)
- **Chaos engineering on non-real-time systems**: Test resilience of billing, provisioning, and analytics without impacting calls
---
## Signaling Security
### SS7 and SIGTRAN
SS7 is the legacy signaling protocol connecting mobile networks globally. It was designed without security and remains vulnerable:
| Vulnerability | Risk | Control |
|--------------|------|---------|
| Location tracking | Subscriber location exposed to any SS7 peer | SS7 firewall with location query filtering; home routing for SMS |
| Call/SMS interception | Forwarding rules modified remotely | SS7 firewall with message screening; MAP operation filtering |
| Fraud (CLID spoofing) | Caller ID manipulated for fraud | SS7 firewall with consistency checks; whitelist trusted partners |
| Denial of service | Flood of signaling messages | Rate limiting; anomaly detection; SS7 firewall with DDoS mitigation |
**Action**: Deploy SS7/STP firewalls (e.g., Oracle, Procera, Mavenir) with strict filtering rules. Monitor for anomalous signaling patterns.
### Diameter and GTP
Diameter (LTE) and GTP (GPRS Tunneling Protocol) have replaced some SS7 functions but introduce their own vulnerabilities:
| Vulnerability | Risk | Control |
|--------------|------|---------|
| Diameter impersonation | Fake HSS/PCRF responses | Diameter edge agent with mutual authentication |
| GTP tunnel hijacking | Subscriber session takeover | GTP firewall; tunnel endpoint validation |
| Interconnect bypass | Roaming fraud via fake partner | Roaming hub validation; partner security assessment |
### SIP Security (VoLTE/VoNR / IMS)
The IP Multimedia Subsystem (IMS) enables voice over LTE/5G using SIP.
- **SIP firewall**: Filter malformed messages, prevent enumeration, block unauthorized registration
- **Toll fraud prevention**: Restrict international calling routes; detect anomalous call patterns
- **SPIT prevention**: Voice spam detection and filtering
---
## 5G Security Specifics
### 5G Core (5GC) Architecture
5G introduces a cloud-native, service-based architecture (SBA) with new security considerations:
| Element | Security Consideration |
|---------|----------------------|
| **AMF (Access and Mobility Management Function)** | Authentication gateway; compromise enables subscriber impersonation |
| **SMF (Session Management Function)** | Controls data sessions; compromise enables traffic redirection |
| **UPF (User Plane Function)** | Data forwarding; must be distributed and physically secured |
| **AUSF (Authentication Server Function)** | 5G-AKA authentication; keys must be HSM-protected |
| **UDM (Unified Data Management)** | Subscriber database; encryption at rest and strict access control |
| **PCF (Policy Control Function)** | QoS and charging policies; integrity critical for revenue assurance |
| **NRF (NF Repository Function)** | Service discovery; compromise enables man-in-the-middle between network functions |
**Security controls**:
- **TLS 1.3** for all service-based interfaces (SBI)
- **OAuth 2.0** for NF-to-NF authentication
- **Network slice isolation**: Strict separation between enterprise, consumer, and IoT slices
- **Edge security**: MEC (Multi-Access Edge Computing) nodes are physically distributed and harder to secure
### Network Slicing
Network slicing creates logical separation on shared physical infrastructure.
- **Slice isolation is logical, not physical**: A hypervisor compromise can bridge slices
- **Action**: Micro-segmentation between slices; independent encryption keys per slice
- **Action**: Slice-specific monitoring and anomaly detection
- **Action**: Independent security policies per slice (enterprise slice stricter than consumer)
---
## The Rapid Modernisation Plan: Telco Variant
### Phase 1: Hygiene (Days 0-30)
In addition to standard hygiene:
| Action | Owner | Deliverable |
|--------|-------|-------------|
| Inventory all network elements: RAN, core, transport, OSS, BSS | Network Engineering | Network asset inventory with vendor and firmware versions |
| Map all signaling interconnects: SS7, Diameter, GTP, SIP | Network Security | Interconnect matrix with partner security assessment |
| Audit roaming partner access and security posture | Roaming / Security | Partner risk register |
| Inventory subscriber data flows and storage locations | Data Protection / Security | Data flow map with residency verification |
| Identify all network management interfaces with internet exposure | Network Security | Exposure list with remediation plan |
### Phase 2: Control (Days 30-60)
| Action | Owner | Deliverable |
|--------|-------|-------------|
| Deploy signaling firewalls (SS7, Diameter, GTP, SIP) | Network Security | Firewall ruleset with anomaly detection |
| Implement network slice security policies | 5G Core Team | Slice isolation validation report |
| Harden network management: dedicated NOC access, MFA, session recording | Operations / Security | NOC access control operational |
| Encrypt management traffic across all network layers | Network Engineering | Encryption coverage report |
| Patch critical network elements with service continuity planning | Network Engineering | Patch schedule with rollback procedures |
### Phase 3: Sovereignty (Days 60-90)
| Action | Owner | Deliverable |
|--------|-------|-------------|
| Deploy local AI for fraud detection and network anomaly detection | AI / Security | Fraud detection pilot with false positive tuning |
| Validate core network disaster recovery and failover | Operations | Failover test report with recovery times |
| Conduct signaling security tabletop exercise | Security / Network | Exercise report with structural improvements |
| Implement firmware integrity monitoring for network elements | Network Security | Baseline hashes for critical firmware |
| Test lawful intercept process security and audit | Legal / Security | LI audit report |
### Phase 4: Antifragility (Days 90-180)
| Action | Owner | Deliverable |
|--------|-------|-------------|
| Red team exercise including signaling and core network reconnaissance | Security | Red team report with kill chain |
| Chaos engineering on OSS/BSS systems | Resilience | Experiment findings |
| Vendor exit architecture for critical network platforms | Procurement / Engineering | 90-day transition plan per critical vendor |
| Cross-training: NOC staff on manual procedures | Operations | Training completion metrics |
| Participate in sector ISAC and GSMA intelligence sharing | Security | Threat intelligence integration report |
---
## Subscriber Data and Privacy
Telcos hold massive PII datasets with unique sensitivity:
| Data Type | Sensitivity | Control |
|-----------|------------|---------|
| **Location data** | Extreme: real-time and historical location | Strict access control; pseudonymization for analytics; retain only as legally required |
| **Call detail records (CDR)** | High: communication patterns | Encryption at rest; audit all access; data minimization |
| **Internet browsing (DNS, DPI)** | High: digital behavior | Aggregate where possible; DPI for security only with legal review |
| **Device identity (IMEI, IMSI)** | Moderate: device tracking | Secure storage; restrict access to fraud and network operations |
| **Lawful intercept data** | Extreme: legal and ethical | Strict chain of custody; independent audit; minimal retention |
**GDPR implications**:
- Subscriber data processing must have clear legal basis
- Data retention periods must be justified and enforced
- Subject access requests must be fulfillable across all systems
- Data breach notification: 72 hours to regulator
---
## M365 in Telecommunications
Corporate telco functions use M365 but must be separated from network operations.
| Consideration | Telco Requirement |
|--------------|------------------|
| **Data residency** | Subscriber data must remain in national/EU boundaries; verify M365 tenant location |
| **Conditional access** | Block admin access from non-corporate devices; geo-restrict privileged accounts |
| **Guest access** | Strictly vet all guests; prohibit in tenant with network engineering data |
| **Teams / SharePoint** | Never used for network topology, subscriber data, or security incident details |
| **Mobile device management** | Sales and field engineer devices Intune-managed; restricted app installation |
| **Email security** | EOP baseline; Defender for Office 365 P2 strongly recommended due to phishing targeting |
See [M365 E3 Hardening](../playbooks/m365-e3-hardening.md) for tactical hardening, and apply these overlays.
---
## Evidence Package for Regulators
| Requirement | Evidence from Antifragile Program |
|------------|----------------------------------|
| NIS2 risk management | Kill chain analysis, T0 asset classification, signaling security assessment |
| NIS2 incident handling | IR runbooks, signaling-specific response procedures, quarterly drill reports |
| NIS2 business continuity | Core network failover test reports, disaster recovery validation |
| NIS2 supply chain security | Vendor risk register (especially high-risk vendors), firmware provenance |
| NIS2 encryption | Encryption coverage for signaling, management, and subscriber data |
| NIS2 vulnerability handling | Vulnerability scan reports with network-impact prioritization |
| Telecom licensing | Lawful intercept audit, subscriber data protection evidence, network resilience metrics |
---
*Previous: [Vertical: Power and Utilities](vertical-power-utilities.md)*
*Next: [Vertical: Banking](vertical-banking.md)*