antifragile/antifragile-consulting/playbooks/implementation-playbook.md

# Implementation Playbook

> *"This is not an upgrade. It is an insurance policy against the obsolescence of your own company."*

This playbook provides tactical, step-by-step guidance for delivering the [Rapid Modernisation Plan](rapid-modernisation-plan.md) in a client environment. It is organized by workstream and intended for hands-on consultants, security architects, and technical leads.

---

## Table of Contents

1. [Engagement Kickoff](#engagement-kickoff)
2. [Workstream: Identity and Access](#workstream-identity-and-access)
3. [Workstream: Perimeter and Visibility](#workstream-perimeter-and-visibility)
4. [Workstream: AI Sovereignty](#workstream-ai-sovereignty)
5. [Workstream: Resilience and Recovery](#workstream-resilience-and-recovery)
6. [Workstream: Culture and Governance](#workstream-culture-and-governance)
7. [Common Failure Modes](#common-failure-modes)
8. [Tools and Templates](#tools-and-templates)

---

## Engagement Kickoff

### Pre-Engagement Checklist

Before arriving on-site or starting the remote engagement:

- [ ] Client has signed SOW with explicit scope, authority, and escalation paths
- [ ] Key stakeholders identified: CISO, CIO, legal, business unit sponsors
- [ ] Initial data room access granted: AD exports, cloud IAM, network diagrams, CMDB if exists
- [ ] Emergency contact list established with authority to disable accounts / block access
- [ ] Backup verification: confirm backups exist and have been tested within last 90 days
- [ ] "Get out of jail free" letter: written executive authorization for disruptive security actions

### Day 0: Stakeholder Interviews

Interview each stakeholder for 30 minutes. Ask the same five questions:

1. What is the shortest path to a business-ending incident here?
2. What are you most worried about that you are not telling the board?
3. What is the one system whose failure would stop revenue for 24 hours?
4. Where is your proprietary data going that you cannot fully track?
5. If you had to replace your primary cloud vendor in 90 days, could you?

Document answers. Look for contradictions between stakeholders—these reveal hidden dependencies.

### Day 0: Establish the War Room

- Physical or virtual space for daily standups
- Shared dashboard: tasks, blockers, risks
- Direct escalation path to executive sponsor
- Decision log: every major decision recorded with rationale and owner

---

## Workstream: Identity and Access

### Objective

Eliminate unknown identities, reduce privilege, and establish just-in-time access before attackers exploit standing permissions.

### Week 1: Identity Census

**Step 1: Export all identities**

- Active Directory: all users, groups, computers, service accounts
- Cloud IAM: AWS IAM, Azure AD / Entra ID, GCP IAM
- SaaS platforms with local identity stores
- Non-human identities: API keys, service principals, OAuth apps, managed identities

**Step 2: Deduplicate and correlate**

- Match cloud identities to on-premises identities
- Identify orphaned accounts: no owner, no recent use, no documented purpose
- Identify over-privileged accounts: admin rights without justification

**Step 3: Categorize by risk**

| Category | Action | Timeline |
|----------|--------|----------|
| Orphaned, unused > 90 days | Disable immediately | Day 1-2 |
| Shared accounts | Target for elimination or vaulting | Week 1-2 |
| Admin / privileged | Force password rotation + MFA enforcement | Day 3-5 |
| Service accounts with interactive logon | Review and restrict | Week 1-2 |
| External / vendor access | Audit and time-bound | Week 1-2 |

### Week 2: Privilege Reduction

**Step 1: Implement Privileged Access Workstations (PAWs)**

- Dedicated machines for admin tasks
- No internet browsing, no email, no non-admin applications
- Physical or strongly virtualized separation

**Step 2: Deploy Just-in-Time (JIT) elevation where possible**

- Azure AD PIM, AWS IAM Identity Center, or third-party PAM
- Maximum elevation duration: 4 hours
- Require approval for standing admin roles

**Step 3: Password hygiene enforcement**

- Minimum 14 characters, no complexity requirements (NIST 800-63B)
- Audit against known-breached password lists
- Eliminate password rotation mandates unless compromise suspected

### Week 3-4: MFA and Conditional Access

- Enforce MFA on all remote access: VPN, cloud admin, RDP gateways
- Implement risk-based conditional access:
  - Unmanaged device → require MFA + compliant device
  - Impossible travel → block or step-up
  - Legacy authentication → block entirely

### Common Pitfalls

- **Over-scoping**: Do not attempt to fix every identity in 30 days. Focus on privileged and external first.
- **Breaking automation**: Service account password rotations can break CI/CD. Coordinate with application owners. Test in non-production first.
- **Shadow IT identities**: SaaS platforms with standalone accounts (Slack, Zoom, etc.) are often missed. Use email domain scanning or CASB tools.

---

## Workstream: Perimeter and Visibility

### Objective

Know exactly what the organization looks like from the outside, and monitor every path that crosses the trust boundary.

### Week 1-2: External Attack Surface Mapping

**Step 1: Passive reconnaissance**

- Enumerate subdomains: certificate transparency logs, DNS brute force, search engine dorks
- Identify exposed services: Shodan, Censys, custom port scanning from external vantage points
- Map cloud assets: public S3 buckets, open storage accounts, exposed databases

**Step 2: Active validation**

- Confirm ownership of discovered assets with client
- Test for default credentials on exposed management interfaces
- Document findings with risk ratings: P0 (immediate), P1 (urgent), P2 (planned)

### Week 2-3: Internal Visibility

**Step 1: Deploy endpoint detection**

- Microsoft Defender for Endpoint, CrowdStrike, SentinelOne, or equivalent
- Target: 100% of managed Windows, macOS, Linux endpoints
- Validate: can you see process execution, network connections, and file modifications?

**Step 2: Network monitoring**

- Deploy sensors at:
  - Internet boundary
  - Internal network segments (especially IT/OT boundaries)
  - Critical server VLANs
- Enable DNS query logging and analysis

**Step 3: Log aggregation**

- Centralize logs from: identity systems, endpoints, firewalls, cloud control planes, critical applications
- Minimum retention: 90 days hot, 1 year cold
- Ensure tamper protection: attackers delete logs

### Week 3-4: CMDB Seeding

- Populate CMDB with T0 and T1 assets first
- For each asset: owner, criticality, dependencies, recovery requirements
- Accept imperfection. A partially correct CMDB is infinitely better than no CMDB.

### Common Pitfalls

- **Scanning without authorization**: Ensure written approval for active scanning. Some jurisdictions treat unauthorized scanning as criminal.
- **Alert fatigue**: Do not enable every detection rule on day one. Start with high-confidence, high-impact alerts. Tune before expanding.
- **Log storage costs**: Centralized logging is expensive. Prioritize critical systems. Use tiered storage.

---

## Workstream: AI Sovereignty

### Objective

Convert intelligence from a rented commodity into an owned, protected, T0-class asset.

### Week 1-2: AI Usage Discovery

**Step 1: Survey**

- Interview department heads: engineering, legal, marketing, operations, finance
- Ask: "What AI tools are you using? What data are you putting into them?"
- Expect 30-50% shadow usage. Employees use personal ChatGPT accounts, browser extensions, and mobile apps.

**Step 2: Technical discovery**

- Review proxy logs for AI API traffic: OpenAI, Anthropic, Google, Azure OpenAI
- Review SaaS billing for AI-enabled tools
- Review browser extensions and endpoint software inventories

**Step 3: Data flow mapping**

For each discovered AI tool, document:

- Data types entering the tool
- Data residency and processing location
- Vendor terms: training use, retention, deletion, subprocessing
- Regulatory implications: GDPR, DORA, NIS2, industry-specific

### Week 3-4: Local AI Infrastructure

**Step 1: Select hardware or sovereign cloud**

| Option | When to Use |
|--------|-------------|
| On-premise GPU servers | High volume, strict air-gap, existing data centre capacity |
| Sovereign cloud (EU, national) | Regulatory requirements, no on-premises GPU expertise |
| Edge inference nodes | Distributed organization, OT environments, low-latency requirements |

**Step 2: Select initial model**

For most organizations, start with:

- **Base model**: Llama 3, Mistral, or Qwen (7B-13B parameters, quantized to 4-bit)
- **Deployment**: Ollama, vLLM, or llama.cpp for inference
- **Orchestration**: LangChain or custom RAG pipeline for proprietary data integration
- **Fine-tuning**: QLoRA for domain adaptation on proprietary datasets

**Step 3: Deploy with T0 controls**

- Network segmentation: inference hosts have no direct internet egress
- Access control: model weights encrypted at rest; access requires multi-party approval
- Audit: log all prompts, responses, and model access
- Backup: immutable backups of weights, configurations, and vector databases

### Week 5-8: Pilot and Measure

Select one high-value, low-risk workflow:

| Workflow | Why It Works |
|----------|-------------|
| Internal code review assistant | Proprietary code never leaves perimeter; measurable quality improvement |
| Security log analysis | Feeds defensive AI directly; reduces analyst workload |
| Policy / compliance document drafting | High volume, repetitive, proprietary domain knowledge |
| Customer support triage | Reduces response time; training data is historical tickets |

**Measurement criteria**:

- Accuracy vs. cloud baseline (human-evaluated on a sample)
- Cost per inference (compute + personnel)
- Data leakage incidents: zero
- User satisfaction: qualitative survey

### Common Pitfalls

- **Over-engineering the first deployment**: Do not build a full MLOps platform for the pilot. Start simple. Prove value. Then scale.
- **Ignoring GPU availability**: GPU procurement can take months. Have a cloud fallback for the pilot if on-premises hardware is delayed.
- **Neglecting prompt injection**: Local models are not immune to adversarial prompts. Implement input validation and output filtering.
- **Forgetting the human loop**: AI augments decisions; it does not replace accountability. Design workflows where humans retain final authority.

---

## Workstream: Resilience and Recovery

### Objective

Ensure that when—not if—a critical system fails, recovery is fast, tested, and deterministic.

### Week 1-4: Backup Validation

**Step 1: Inventory backup coverage**

- For every T0 and T1 asset: what is backed up, how often, where, by what mechanism
- Identify gaps: databases without point-in-time recovery, VMs without application-consistent snapshots

**Step 2: Test restoration**

- Select one critical system per week
- Perform full restoration to isolated environment
- Document: time to restore, data loss window, manual steps required, blockers encountered

**Step 3: Fix what breaks**

- If a backup cannot be restored, the backup does not exist
- Update procedures, fix tooling, re-test

### Month 2-3: Recovery Automation

- Automate the most common recovery scenarios: VM restore, database point-in-time recovery, Active Directory forest recovery
- Document runbooks for scenarios that cannot be fully automated
- Train multiple team members on each runbook

### Month 3-6: Chaos Engineering

**Step 1: Game days**

- Scheduled, announced simulations of failure scenarios
- Example: simulate domain controller failure during business hours
- Measure: detection time, escalation time, resolution time, communication quality

**Step 2: Chaos experiments**

- Unannounced, bounded experiments in non-production
- Example: terminate API service instances, block DNS resolution, fill disk space
- Validate: auto-scaling, alerting, runbook accuracy

**Step 3: Production chaos**

- Only after months of successful game days and non-production experiments
- Start with low-risk failures: single instance termination, network latency injection
- Always have automated rollback and a human kill switch

### Common Pitfalls

- **Assuming backups work**: Untested backups are prayers, not plans.
- **Recovery without validation**: A restored system that cannot authenticate users or connect to databases is not recovered.
- **Chaos without guardrails**: Never run chaos experiments when the organization is already under stress (active incident, change freeze, key personnel on leave).

---

## Workstream: Culture and Governance

### Objective

Embed antifragile principles into decision-making, hiring, and organizational habits.

### Tactics

**Blameless Post-Mortems**

- Within 48 hours of significant incident
- Focus: what about the system allowed this mistake? Not: who made the mistake?
- Mandatory output: at least one structural change (policy, architecture, or procedure)
- Publish internally: transparency builds trust and disseminates learning

**Security Champions Program**

- Identify one volunteer per team who acts as security liaison
- Monthly 1-hour meeting: new threats, policy changes, team-specific concerns
- Champions feed team context up and security guidance down

**Red Team as a Service**

- Monthly or quarterly adversarial simulations
- Report to CISO and board, not just IT
- Measure: time to detect, time to contain, time to evict
- Trend over time: the organization should get faster, not just more compliant

**Antifragile Metrics Review**

- Monthly steering committee reviews:
  - Mean time to structural fix (from incident)
  - Number of chaos experiments run and lessons learned
  - % of vendor dependencies with documented exit plan
  - AI sovereignty maturity score

### Common Pitfalls

- **Post-mortems without action**: If findings are not tracked to completion, they become theater.
- **Security champions without authority**: Champions need time allocation and executive backing, or they become scapegoats.
- **Metrics without narrative**: Numbers alone do not persuade boards. Pair metrics with stories: "Here is what we learned, here is what we changed, here is why we are safer."

---

## Common Failure Modes

| Failure Mode | Symptom | Remedy |
|-------------|---------|--------|
| **Scope creep** | 30-day phase stretches to 90 days | Time-box ruthlessly. Document deferred items for next phase. |
| **Tool obsession** | Team debates SIEM vendor for 3 weeks | Pick the good-enough tool. Implementation beats selection. |
| **Perfectionism** | CMDB project stalls waiting for completeness | Seed with critical assets. Expand iteratively. |
| **Vendor capture** | Recommendations always favor one provider | Disclose relationships. Maintain independence. Document alternatives. |
| **Executive fatigue** | Board stops attending updates | Lead with business risk, not technical detail. Show cost of inaction. |
| **Operational resistance** | IT refuses to disable legacy accounts | Use the "get out of jail free" letter. Escalate to executive sponsor. |
| **Pilot purgatory** | Local AI pilot runs forever without production migration | Define hard success criteria and production migration date before starting. |

---

## Tools and Templates

### Templates Included in This Repository

- [T0 Asset Classification Worksheet](../core/t0-asset-framework.md#t0-classification-worksheet)
- AI Usage Discovery Interview Guide (see Workstream: AI Sovereignty)
- Blameless Post-Mortem Template (to be added)
- Chaos Experiment Planning Template (to be added)
- Vendor Exit Architecture Template (to be added)

### Recommended External Tools

| Category | Options | Notes |
|----------|---------|-------|
| Endpoint Detection | Microsoft Defender, CrowdStrike, SentinelOne | Choose based on existing Microsoft footprint |
| SIEM / Log Analysis | Sentinel, Splunk, Elastic, Wazuh | Wazuh is open-source and sufficient for many environments |
| Identity Governance | Azure AD / Entra ID, Okta, Saviynt | Match to primary cloud identity provider |
| PAM / Vault | CyberArk, Delinea, HashiCorp Vault | Essential for service account and secret management |
| CMDB | ServiceNow, Device42, GLPI, or spreadsheet | Any CMDB is better than no CMDB |
| Local AI Inference | Ollama, vLLM, llama.cpp, TGI | Start simple; scale to TGI or vLLM for production load |
| Chaos Engineering | Gremlin, Chaos Mesh, custom scripts | Gremlin for enterprise; Chaos Mesh for Kubernetes |

---

*This playbook is a living document. Update it with lessons from every engagement.*

*Previous: [Rapid Modernisation Plan](rapid-modernisation-plan.md)*