Files
antifragile/antifragile-consulting/playbooks/implementation-playbook.md
Tomas Kracmar 763da003d3 Initial commit: antifragile cybersecurity consulting blueprint
Complete repository of frameworks, playbooks, and assessment resources
for cybersecurity consultations focused on antifragile enterprise design.

Includes:
- Core philosophy and manifest (5 pillars)
- 12 modular engagement packages
- AI sovereignty and operations frameworks
- Zero-budget vulnerability discovery and hardening playbooks
- M365 E3 hardening and antifragile project plans
- Osquery sovereign discovery platform blueprint
- Perimeter scanning capability guide
- AI-assisted TVM blueprint for AI-powered adversaries
- Vertical specializations: banking, telco, power/utilities
- CIS Controls v8 and NIST CSF 2.0 mappings
- Risk registers and assessment templates
- C-suite conversation guide and business case templates
2026-05-09 16:53:22 +02:00

16 KiB

Implementation Playbook

"This is not an upgrade. It is an insurance policy against the obsolescence of your own company."

This playbook provides tactical, step-by-step guidance for delivering the Rapid Modernisation Plan in a client environment. It is organized by workstream and intended for hands-on consultants, security architects, and technical leads.


Table of Contents

  1. Engagement Kickoff
  2. Workstream: Identity and Access
  3. Workstream: Perimeter and Visibility
  4. Workstream: AI Sovereignty
  5. Workstream: Resilience and Recovery
  6. Workstream: Culture and Governance
  7. Common Failure Modes
  8. Tools and Templates

Engagement Kickoff

Pre-Engagement Checklist

Before arriving on-site or starting the remote engagement:

  • Client has signed SOW with explicit scope, authority, and escalation paths
  • Key stakeholders identified: CISO, CIO, legal, business unit sponsors
  • Initial data room access granted: AD exports, cloud IAM, network diagrams, CMDB if exists
  • Emergency contact list established with authority to disable accounts / block access
  • Backup verification: confirm backups exist and have been tested within last 90 days
  • "Get out of jail free" letter: written executive authorization for disruptive security actions

Day 0: Stakeholder Interviews

Interview each stakeholder for 30 minutes. Ask the same five questions:

  1. What is the shortest path to a business-ending incident here?
  2. What are you most worried about that you are not telling the board?
  3. What is the one system whose failure would stop revenue for 24 hours?
  4. Where is your proprietary data going that you cannot fully track?
  5. If you had to replace your primary cloud vendor in 90 days, could you?

Document answers. Look for contradictions between stakeholders—these reveal hidden dependencies.

Day 0: Establish the War Room

  • Physical or virtual space for daily standups
  • Shared dashboard: tasks, blockers, risks
  • Direct escalation path to executive sponsor
  • Decision log: every major decision recorded with rationale and owner

Workstream: Identity and Access

Objective

Eliminate unknown identities, reduce privilege, and establish just-in-time access before attackers exploit standing permissions.

Week 1: Identity Census

Step 1: Export all identities

  • Active Directory: all users, groups, computers, service accounts
  • Cloud IAM: AWS IAM, Azure AD / Entra ID, GCP IAM
  • SaaS platforms with local identity stores
  • Non-human identities: API keys, service principals, OAuth apps, managed identities

Step 2: Deduplicate and correlate

  • Match cloud identities to on-premises identities
  • Identify orphaned accounts: no owner, no recent use, no documented purpose
  • Identify over-privileged accounts: admin rights without justification

Step 3: Categorize by risk

Category Action Timeline
Orphaned, unused > 90 days Disable immediately Day 1-2
Shared accounts Target for elimination or vaulting Week 1-2
Admin / privileged Force password rotation + MFA enforcement Day 3-5
Service accounts with interactive logon Review and restrict Week 1-2
External / vendor access Audit and time-bound Week 1-2

Week 2: Privilege Reduction

Step 1: Implement Privileged Access Workstations (PAWs)

  • Dedicated machines for admin tasks
  • No internet browsing, no email, no non-admin applications
  • Physical or strongly virtualized separation

Step 2: Deploy Just-in-Time (JIT) elevation where possible

  • Azure AD PIM, AWS IAM Identity Center, or third-party PAM
  • Maximum elevation duration: 4 hours
  • Require approval for standing admin roles

Step 3: Password hygiene enforcement

  • Minimum 14 characters, no complexity requirements (NIST 800-63B)
  • Audit against known-breached password lists
  • Eliminate password rotation mandates unless compromise suspected

Week 3-4: MFA and Conditional Access

  • Enforce MFA on all remote access: VPN, cloud admin, RDP gateways
  • Implement risk-based conditional access:
    • Unmanaged device → require MFA + compliant device
    • Impossible travel → block or step-up
    • Legacy authentication → block entirely

Common Pitfalls

  • Over-scoping: Do not attempt to fix every identity in 30 days. Focus on privileged and external first.
  • Breaking automation: Service account password rotations can break CI/CD. Coordinate with application owners. Test in non-production first.
  • Shadow IT identities: SaaS platforms with standalone accounts (Slack, Zoom, etc.) are often missed. Use email domain scanning or CASB tools.

Workstream: Perimeter and Visibility

Objective

Know exactly what the organization looks like from the outside, and monitor every path that crosses the trust boundary.

Week 1-2: External Attack Surface Mapping

Step 1: Passive reconnaissance

  • Enumerate subdomains: certificate transparency logs, DNS brute force, search engine dorks
  • Identify exposed services: Shodan, Censys, custom port scanning from external vantage points
  • Map cloud assets: public S3 buckets, open storage accounts, exposed databases

Step 2: Active validation

  • Confirm ownership of discovered assets with client
  • Test for default credentials on exposed management interfaces
  • Document findings with risk ratings: P0 (immediate), P1 (urgent), P2 (planned)

Week 2-3: Internal Visibility

Step 1: Deploy endpoint detection

  • Microsoft Defender for Endpoint, CrowdStrike, SentinelOne, or equivalent
  • Target: 100% of managed Windows, macOS, Linux endpoints
  • Validate: can you see process execution, network connections, and file modifications?

Step 2: Network monitoring

  • Deploy sensors at:
    • Internet boundary
    • Internal network segments (especially IT/OT boundaries)
    • Critical server VLANs
  • Enable DNS query logging and analysis

Step 3: Log aggregation

  • Centralize logs from: identity systems, endpoints, firewalls, cloud control planes, critical applications
  • Minimum retention: 90 days hot, 1 year cold
  • Ensure tamper protection: attackers delete logs

Week 3-4: CMDB Seeding

  • Populate CMDB with T0 and T1 assets first
  • For each asset: owner, criticality, dependencies, recovery requirements
  • Accept imperfection. A partially correct CMDB is infinitely better than no CMDB.

Common Pitfalls

  • Scanning without authorization: Ensure written approval for active scanning. Some jurisdictions treat unauthorized scanning as criminal.
  • Alert fatigue: Do not enable every detection rule on day one. Start with high-confidence, high-impact alerts. Tune before expanding.
  • Log storage costs: Centralized logging is expensive. Prioritize critical systems. Use tiered storage.

Workstream: AI Sovereignty

Objective

Convert intelligence from a rented commodity into an owned, protected, T0-class asset.

Week 1-2: AI Usage Discovery

Step 1: Survey

  • Interview department heads: engineering, legal, marketing, operations, finance
  • Ask: "What AI tools are you using? What data are you putting into them?"
  • Expect 30-50% shadow usage. Employees use personal ChatGPT accounts, browser extensions, and mobile apps.

Step 2: Technical discovery

  • Review proxy logs for AI API traffic: OpenAI, Anthropic, Google, Azure OpenAI
  • Review SaaS billing for AI-enabled tools
  • Review browser extensions and endpoint software inventories

Step 3: Data flow mapping

For each discovered AI tool, document:

  • Data types entering the tool
  • Data residency and processing location
  • Vendor terms: training use, retention, deletion, subprocessing
  • Regulatory implications: GDPR, DORA, NIS2, industry-specific

Week 3-4: Local AI Infrastructure

Step 1: Select hardware or sovereign cloud

Option When to Use
On-premise GPU servers High volume, strict air-gap, existing data centre capacity
Sovereign cloud (EU, national) Regulatory requirements, no on-premises GPU expertise
Edge inference nodes Distributed organization, OT environments, low-latency requirements

Step 2: Select initial model

For most organizations, start with:

  • Base model: Llama 3, Mistral, or Qwen (7B-13B parameters, quantized to 4-bit)
  • Deployment: Ollama, vLLM, or llama.cpp for inference
  • Orchestration: LangChain or custom RAG pipeline for proprietary data integration
  • Fine-tuning: QLoRA for domain adaptation on proprietary datasets

Step 3: Deploy with T0 controls

  • Network segmentation: inference hosts have no direct internet egress
  • Access control: model weights encrypted at rest; access requires multi-party approval
  • Audit: log all prompts, responses, and model access
  • Backup: immutable backups of weights, configurations, and vector databases

Week 5-8: Pilot and Measure

Select one high-value, low-risk workflow:

Workflow Why It Works
Internal code review assistant Proprietary code never leaves perimeter; measurable quality improvement
Security log analysis Feeds defensive AI directly; reduces analyst workload
Policy / compliance document drafting High volume, repetitive, proprietary domain knowledge
Customer support triage Reduces response time; training data is historical tickets

Measurement criteria:

  • Accuracy vs. cloud baseline (human-evaluated on a sample)
  • Cost per inference (compute + personnel)
  • Data leakage incidents: zero
  • User satisfaction: qualitative survey

Common Pitfalls

  • Over-engineering the first deployment: Do not build a full MLOps platform for the pilot. Start simple. Prove value. Then scale.
  • Ignoring GPU availability: GPU procurement can take months. Have a cloud fallback for the pilot if on-premises hardware is delayed.
  • Neglecting prompt injection: Local models are not immune to adversarial prompts. Implement input validation and output filtering.
  • Forgetting the human loop: AI augments decisions; it does not replace accountability. Design workflows where humans retain final authority.

Workstream: Resilience and Recovery

Objective

Ensure that when—not if—a critical system fails, recovery is fast, tested, and deterministic.

Week 1-4: Backup Validation

Step 1: Inventory backup coverage

  • For every T0 and T1 asset: what is backed up, how often, where, by what mechanism
  • Identify gaps: databases without point-in-time recovery, VMs without application-consistent snapshots

Step 2: Test restoration

  • Select one critical system per week
  • Perform full restoration to isolated environment
  • Document: time to restore, data loss window, manual steps required, blockers encountered

Step 3: Fix what breaks

  • If a backup cannot be restored, the backup does not exist
  • Update procedures, fix tooling, re-test

Month 2-3: Recovery Automation

  • Automate the most common recovery scenarios: VM restore, database point-in-time recovery, Active Directory forest recovery
  • Document runbooks for scenarios that cannot be fully automated
  • Train multiple team members on each runbook

Month 3-6: Chaos Engineering

Step 1: Game days

  • Scheduled, announced simulations of failure scenarios
  • Example: simulate domain controller failure during business hours
  • Measure: detection time, escalation time, resolution time, communication quality

Step 2: Chaos experiments

  • Unannounced, bounded experiments in non-production
  • Example: terminate API service instances, block DNS resolution, fill disk space
  • Validate: auto-scaling, alerting, runbook accuracy

Step 3: Production chaos

  • Only after months of successful game days and non-production experiments
  • Start with low-risk failures: single instance termination, network latency injection
  • Always have automated rollback and a human kill switch

Common Pitfalls

  • Assuming backups work: Untested backups are prayers, not plans.
  • Recovery without validation: A restored system that cannot authenticate users or connect to databases is not recovered.
  • Chaos without guardrails: Never run chaos experiments when the organization is already under stress (active incident, change freeze, key personnel on leave).

Workstream: Culture and Governance

Objective

Embed antifragile principles into decision-making, hiring, and organizational habits.

Tactics

Blameless Post-Mortems

  • Within 48 hours of significant incident
  • Focus: what about the system allowed this mistake? Not: who made the mistake?
  • Mandatory output: at least one structural change (policy, architecture, or procedure)
  • Publish internally: transparency builds trust and disseminates learning

Security Champions Program

  • Identify one volunteer per team who acts as security liaison
  • Monthly 1-hour meeting: new threats, policy changes, team-specific concerns
  • Champions feed team context up and security guidance down

Red Team as a Service

  • Monthly or quarterly adversarial simulations
  • Report to CISO and board, not just IT
  • Measure: time to detect, time to contain, time to evict
  • Trend over time: the organization should get faster, not just more compliant

Antifragile Metrics Review

  • Monthly steering committee reviews:
    • Mean time to structural fix (from incident)
    • Number of chaos experiments run and lessons learned
    • % of vendor dependencies with documented exit plan
    • AI sovereignty maturity score

Common Pitfalls

  • Post-mortems without action: If findings are not tracked to completion, they become theater.
  • Security champions without authority: Champions need time allocation and executive backing, or they become scapegoats.
  • Metrics without narrative: Numbers alone do not persuade boards. Pair metrics with stories: "Here is what we learned, here is what we changed, here is why we are safer."

Common Failure Modes

Failure Mode Symptom Remedy
Scope creep 30-day phase stretches to 90 days Time-box ruthlessly. Document deferred items for next phase.
Tool obsession Team debates SIEM vendor for 3 weeks Pick the good-enough tool. Implementation beats selection.
Perfectionism CMDB project stalls waiting for completeness Seed with critical assets. Expand iteratively.
Vendor capture Recommendations always favor one provider Disclose relationships. Maintain independence. Document alternatives.
Executive fatigue Board stops attending updates Lead with business risk, not technical detail. Show cost of inaction.
Operational resistance IT refuses to disable legacy accounts Use the "get out of jail free" letter. Escalate to executive sponsor.
Pilot purgatory Local AI pilot runs forever without production migration Define hard success criteria and production migration date before starting.

Tools and Templates

Templates Included in This Repository

  • T0 Asset Classification Worksheet
  • AI Usage Discovery Interview Guide (see Workstream: AI Sovereignty)
  • Blameless Post-Mortem Template (to be added)
  • Chaos Experiment Planning Template (to be added)
  • Vendor Exit Architecture Template (to be added)
Category Options Notes
Endpoint Detection Microsoft Defender, CrowdStrike, SentinelOne Choose based on existing Microsoft footprint
SIEM / Log Analysis Sentinel, Splunk, Elastic, Wazuh Wazuh is open-source and sufficient for many environments
Identity Governance Azure AD / Entra ID, Okta, Saviynt Match to primary cloud identity provider
PAM / Vault CyberArk, Delinea, HashiCorp Vault Essential for service account and secret management
CMDB ServiceNow, Device42, GLPI, or spreadsheet Any CMDB is better than no CMDB
Local AI Inference Ollama, vLLM, llama.cpp, TGI Start simple; scale to TGI or vLLM for production load
Chaos Engineering Gremlin, Chaos Mesh, custom scripts Gremlin for enterprise; Chaos Mesh for Kubernetes

This playbook is a living document. Update it with lessons from every engagement.

Previous: Rapid Modernisation Plan