Complete repository of frameworks, playbooks, and assessment resources for cybersecurity consultations focused on antifragile enterprise design. Includes: - Core philosophy and manifest (5 pillars) - 12 modular engagement packages - AI sovereignty and operations frameworks - Zero-budget vulnerability discovery and hardening playbooks - M365 E3 hardening and antifragile project plans - Osquery sovereign discovery platform blueprint - Perimeter scanning capability guide - AI-assisted TVM blueprint for AI-powered adversaries - Vertical specializations: banking, telco, power/utilities - CIS Controls v8 and NIST CSF 2.0 mappings - Risk registers and assessment templates - C-suite conversation guide and business case templates
404 lines
16 KiB
Markdown
404 lines
16 KiB
Markdown
# Implementation Playbook
|
|
|
|
> *"This is not an upgrade. It is an insurance policy against the obsolescence of your own company."*
|
|
|
|
This playbook provides tactical, step-by-step guidance for delivering the [Rapid Modernisation Plan](rapid-modernisation-plan.md) in a client environment. It is organized by workstream and intended for hands-on consultants, security architects, and technical leads.
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Engagement Kickoff](#engagement-kickoff)
|
|
2. [Workstream: Identity and Access](#workstream-identity-and-access)
|
|
3. [Workstream: Perimeter and Visibility](#workstream-perimeter-and-visibility)
|
|
4. [Workstream: AI Sovereignty](#workstream-ai-sovereignty)
|
|
5. [Workstream: Resilience and Recovery](#workstream-resilience-and-recovery)
|
|
6. [Workstream: Culture and Governance](#workstream-culture-and-governance)
|
|
7. [Common Failure Modes](#common-failure-modes)
|
|
8. [Tools and Templates](#tools-and-templates)
|
|
|
|
---
|
|
|
|
## Engagement Kickoff
|
|
|
|
### Pre-Engagement Checklist
|
|
|
|
Before arriving on-site or starting the remote engagement:
|
|
|
|
- [ ] Client has signed SOW with explicit scope, authority, and escalation paths
|
|
- [ ] Key stakeholders identified: CISO, CIO, legal, business unit sponsors
|
|
- [ ] Initial data room access granted: AD exports, cloud IAM, network diagrams, CMDB if exists
|
|
- [ ] Emergency contact list established with authority to disable accounts / block access
|
|
- [ ] Backup verification: confirm backups exist and have been tested within last 90 days
|
|
- [ ] "Get out of jail free" letter: written executive authorization for disruptive security actions
|
|
|
|
### Day 0: Stakeholder Interviews
|
|
|
|
Interview each stakeholder for 30 minutes. Ask the same five questions:
|
|
|
|
1. What is the shortest path to a business-ending incident here?
|
|
2. What are you most worried about that you are not telling the board?
|
|
3. What is the one system whose failure would stop revenue for 24 hours?
|
|
4. Where is your proprietary data going that you cannot fully track?
|
|
5. If you had to replace your primary cloud vendor in 90 days, could you?
|
|
|
|
Document answers. Look for contradictions between stakeholders—these reveal hidden dependencies.
|
|
|
|
### Day 0: Establish the War Room
|
|
|
|
- Physical or virtual space for daily standups
|
|
- Shared dashboard: tasks, blockers, risks
|
|
- Direct escalation path to executive sponsor
|
|
- Decision log: every major decision recorded with rationale and owner
|
|
|
|
---
|
|
|
|
## Workstream: Identity and Access
|
|
|
|
### Objective
|
|
|
|
Eliminate unknown identities, reduce privilege, and establish just-in-time access before attackers exploit standing permissions.
|
|
|
|
### Week 1: Identity Census
|
|
|
|
**Step 1: Export all identities**
|
|
|
|
- Active Directory: all users, groups, computers, service accounts
|
|
- Cloud IAM: AWS IAM, Azure AD / Entra ID, GCP IAM
|
|
- SaaS platforms with local identity stores
|
|
- Non-human identities: API keys, service principals, OAuth apps, managed identities
|
|
|
|
**Step 2: Deduplicate and correlate**
|
|
|
|
- Match cloud identities to on-premises identities
|
|
- Identify orphaned accounts: no owner, no recent use, no documented purpose
|
|
- Identify over-privileged accounts: admin rights without justification
|
|
|
|
**Step 3: Categorize by risk**
|
|
|
|
| Category | Action | Timeline |
|
|
|----------|--------|----------|
|
|
| Orphaned, unused > 90 days | Disable immediately | Day 1-2 |
|
|
| Shared accounts | Target for elimination or vaulting | Week 1-2 |
|
|
| Admin / privileged | Force password rotation + MFA enforcement | Day 3-5 |
|
|
| Service accounts with interactive logon | Review and restrict | Week 1-2 |
|
|
| External / vendor access | Audit and time-bound | Week 1-2 |
|
|
|
|
### Week 2: Privilege Reduction
|
|
|
|
**Step 1: Implement Privileged Access Workstations (PAWs)**
|
|
|
|
- Dedicated machines for admin tasks
|
|
- No internet browsing, no email, no non-admin applications
|
|
- Physical or strongly virtualized separation
|
|
|
|
**Step 2: Deploy Just-in-Time (JIT) elevation where possible**
|
|
|
|
- Azure AD PIM, AWS IAM Identity Center, or third-party PAM
|
|
- Maximum elevation duration: 4 hours
|
|
- Require approval for standing admin roles
|
|
|
|
**Step 3: Password hygiene enforcement**
|
|
|
|
- Minimum 14 characters, no complexity requirements (NIST 800-63B)
|
|
- Audit against known-breached password lists
|
|
- Eliminate password rotation mandates unless compromise suspected
|
|
|
|
### Week 3-4: MFA and Conditional Access
|
|
|
|
- Enforce MFA on all remote access: VPN, cloud admin, RDP gateways
|
|
- Implement risk-based conditional access:
|
|
- Unmanaged device → require MFA + compliant device
|
|
- Impossible travel → block or step-up
|
|
- Legacy authentication → block entirely
|
|
|
|
### Common Pitfalls
|
|
|
|
- **Over-scoping**: Do not attempt to fix every identity in 30 days. Focus on privileged and external first.
|
|
- **Breaking automation**: Service account password rotations can break CI/CD. Coordinate with application owners. Test in non-production first.
|
|
- **Shadow IT identities**: SaaS platforms with standalone accounts (Slack, Zoom, etc.) are often missed. Use email domain scanning or CASB tools.
|
|
|
|
---
|
|
|
|
## Workstream: Perimeter and Visibility
|
|
|
|
### Objective
|
|
|
|
Know exactly what the organization looks like from the outside, and monitor every path that crosses the trust boundary.
|
|
|
|
### Week 1-2: External Attack Surface Mapping
|
|
|
|
**Step 1: Passive reconnaissance**
|
|
|
|
- Enumerate subdomains: certificate transparency logs, DNS brute force, search engine dorks
|
|
- Identify exposed services: Shodan, Censys, custom port scanning from external vantage points
|
|
- Map cloud assets: public S3 buckets, open storage accounts, exposed databases
|
|
|
|
**Step 2: Active validation**
|
|
|
|
- Confirm ownership of discovered assets with client
|
|
- Test for default credentials on exposed management interfaces
|
|
- Document findings with risk ratings: P0 (immediate), P1 (urgent), P2 (planned)
|
|
|
|
### Week 2-3: Internal Visibility
|
|
|
|
**Step 1: Deploy endpoint detection**
|
|
|
|
- Microsoft Defender for Endpoint, CrowdStrike, SentinelOne, or equivalent
|
|
- Target: 100% of managed Windows, macOS, Linux endpoints
|
|
- Validate: can you see process execution, network connections, and file modifications?
|
|
|
|
**Step 2: Network monitoring**
|
|
|
|
- Deploy sensors at:
|
|
- Internet boundary
|
|
- Internal network segments (especially IT/OT boundaries)
|
|
- Critical server VLANs
|
|
- Enable DNS query logging and analysis
|
|
|
|
**Step 3: Log aggregation**
|
|
|
|
- Centralize logs from: identity systems, endpoints, firewalls, cloud control planes, critical applications
|
|
- Minimum retention: 90 days hot, 1 year cold
|
|
- Ensure tamper protection: attackers delete logs
|
|
|
|
### Week 3-4: CMDB Seeding
|
|
|
|
- Populate CMDB with T0 and T1 assets first
|
|
- For each asset: owner, criticality, dependencies, recovery requirements
|
|
- Accept imperfection. A partially correct CMDB is infinitely better than no CMDB.
|
|
|
|
### Common Pitfalls
|
|
|
|
- **Scanning without authorization**: Ensure written approval for active scanning. Some jurisdictions treat unauthorized scanning as criminal.
|
|
- **Alert fatigue**: Do not enable every detection rule on day one. Start with high-confidence, high-impact alerts. Tune before expanding.
|
|
- **Log storage costs**: Centralized logging is expensive. Prioritize critical systems. Use tiered storage.
|
|
|
|
---
|
|
|
|
## Workstream: AI Sovereignty
|
|
|
|
### Objective
|
|
|
|
Convert intelligence from a rented commodity into an owned, protected, T0-class asset.
|
|
|
|
### Week 1-2: AI Usage Discovery
|
|
|
|
**Step 1: Survey**
|
|
|
|
- Interview department heads: engineering, legal, marketing, operations, finance
|
|
- Ask: "What AI tools are you using? What data are you putting into them?"
|
|
- Expect 30-50% shadow usage. Employees use personal ChatGPT accounts, browser extensions, and mobile apps.
|
|
|
|
**Step 2: Technical discovery**
|
|
|
|
- Review proxy logs for AI API traffic: OpenAI, Anthropic, Google, Azure OpenAI
|
|
- Review SaaS billing for AI-enabled tools
|
|
- Review browser extensions and endpoint software inventories
|
|
|
|
**Step 3: Data flow mapping**
|
|
|
|
For each discovered AI tool, document:
|
|
|
|
- Data types entering the tool
|
|
- Data residency and processing location
|
|
- Vendor terms: training use, retention, deletion, subprocessing
|
|
- Regulatory implications: GDPR, DORA, NIS2, industry-specific
|
|
|
|
### Week 3-4: Local AI Infrastructure
|
|
|
|
**Step 1: Select hardware or sovereign cloud**
|
|
|
|
| Option | When to Use |
|
|
|--------|-------------|
|
|
| On-premise GPU servers | High volume, strict air-gap, existing data centre capacity |
|
|
| Sovereign cloud (EU, national) | Regulatory requirements, no on-premises GPU expertise |
|
|
| Edge inference nodes | Distributed organization, OT environments, low-latency requirements |
|
|
|
|
**Step 2: Select initial model**
|
|
|
|
For most organizations, start with:
|
|
|
|
- **Base model**: Llama 3, Mistral, or Qwen (7B-13B parameters, quantized to 4-bit)
|
|
- **Deployment**: Ollama, vLLM, or llama.cpp for inference
|
|
- **Orchestration**: LangChain or custom RAG pipeline for proprietary data integration
|
|
- **Fine-tuning**: QLoRA for domain adaptation on proprietary datasets
|
|
|
|
**Step 3: Deploy with T0 controls**
|
|
|
|
- Network segmentation: inference hosts have no direct internet egress
|
|
- Access control: model weights encrypted at rest; access requires multi-party approval
|
|
- Audit: log all prompts, responses, and model access
|
|
- Backup: immutable backups of weights, configurations, and vector databases
|
|
|
|
### Week 5-8: Pilot and Measure
|
|
|
|
Select one high-value, low-risk workflow:
|
|
|
|
| Workflow | Why It Works |
|
|
|----------|-------------|
|
|
| Internal code review assistant | Proprietary code never leaves perimeter; measurable quality improvement |
|
|
| Security log analysis | Feeds defensive AI directly; reduces analyst workload |
|
|
| Policy / compliance document drafting | High volume, repetitive, proprietary domain knowledge |
|
|
| Customer support triage | Reduces response time; training data is historical tickets |
|
|
|
|
**Measurement criteria**:
|
|
|
|
- Accuracy vs. cloud baseline (human-evaluated on a sample)
|
|
- Cost per inference (compute + personnel)
|
|
- Data leakage incidents: zero
|
|
- User satisfaction: qualitative survey
|
|
|
|
### Common Pitfalls
|
|
|
|
- **Over-engineering the first deployment**: Do not build a full MLOps platform for the pilot. Start simple. Prove value. Then scale.
|
|
- **Ignoring GPU availability**: GPU procurement can take months. Have a cloud fallback for the pilot if on-premises hardware is delayed.
|
|
- **Neglecting prompt injection**: Local models are not immune to adversarial prompts. Implement input validation and output filtering.
|
|
- **Forgetting the human loop**: AI augments decisions; it does not replace accountability. Design workflows where humans retain final authority.
|
|
|
|
---
|
|
|
|
## Workstream: Resilience and Recovery
|
|
|
|
### Objective
|
|
|
|
Ensure that when—not if—a critical system fails, recovery is fast, tested, and deterministic.
|
|
|
|
### Week 1-4: Backup Validation
|
|
|
|
**Step 1: Inventory backup coverage**
|
|
|
|
- For every T0 and T1 asset: what is backed up, how often, where, by what mechanism
|
|
- Identify gaps: databases without point-in-time recovery, VMs without application-consistent snapshots
|
|
|
|
**Step 2: Test restoration**
|
|
|
|
- Select one critical system per week
|
|
- Perform full restoration to isolated environment
|
|
- Document: time to restore, data loss window, manual steps required, blockers encountered
|
|
|
|
**Step 3: Fix what breaks**
|
|
|
|
- If a backup cannot be restored, the backup does not exist
|
|
- Update procedures, fix tooling, re-test
|
|
|
|
### Month 2-3: Recovery Automation
|
|
|
|
- Automate the most common recovery scenarios: VM restore, database point-in-time recovery, Active Directory forest recovery
|
|
- Document runbooks for scenarios that cannot be fully automated
|
|
- Train multiple team members on each runbook
|
|
|
|
### Month 3-6: Chaos Engineering
|
|
|
|
**Step 1: Game days**
|
|
|
|
- Scheduled, announced simulations of failure scenarios
|
|
- Example: simulate domain controller failure during business hours
|
|
- Measure: detection time, escalation time, resolution time, communication quality
|
|
|
|
**Step 2: Chaos experiments**
|
|
|
|
- Unannounced, bounded experiments in non-production
|
|
- Example: terminate API service instances, block DNS resolution, fill disk space
|
|
- Validate: auto-scaling, alerting, runbook accuracy
|
|
|
|
**Step 3: Production chaos**
|
|
|
|
- Only after months of successful game days and non-production experiments
|
|
- Start with low-risk failures: single instance termination, network latency injection
|
|
- Always have automated rollback and a human kill switch
|
|
|
|
### Common Pitfalls
|
|
|
|
- **Assuming backups work**: Untested backups are prayers, not plans.
|
|
- **Recovery without validation**: A restored system that cannot authenticate users or connect to databases is not recovered.
|
|
- **Chaos without guardrails**: Never run chaos experiments when the organization is already under stress (active incident, change freeze, key personnel on leave).
|
|
|
|
---
|
|
|
|
## Workstream: Culture and Governance
|
|
|
|
### Objective
|
|
|
|
Embed antifragile principles into decision-making, hiring, and organizational habits.
|
|
|
|
### Tactics
|
|
|
|
**Blameless Post-Mortems**
|
|
|
|
- Within 48 hours of significant incident
|
|
- Focus: what about the system allowed this mistake? Not: who made the mistake?
|
|
- Mandatory output: at least one structural change (policy, architecture, or procedure)
|
|
- Publish internally: transparency builds trust and disseminates learning
|
|
|
|
**Security Champions Program**
|
|
|
|
- Identify one volunteer per team who acts as security liaison
|
|
- Monthly 1-hour meeting: new threats, policy changes, team-specific concerns
|
|
- Champions feed team context up and security guidance down
|
|
|
|
**Red Team as a Service**
|
|
|
|
- Monthly or quarterly adversarial simulations
|
|
- Report to CISO and board, not just IT
|
|
- Measure: time to detect, time to contain, time to evict
|
|
- Trend over time: the organization should get faster, not just more compliant
|
|
|
|
**Antifragile Metrics Review**
|
|
|
|
- Monthly steering committee reviews:
|
|
- Mean time to structural fix (from incident)
|
|
- Number of chaos experiments run and lessons learned
|
|
- % of vendor dependencies with documented exit plan
|
|
- AI sovereignty maturity score
|
|
|
|
### Common Pitfalls
|
|
|
|
- **Post-mortems without action**: If findings are not tracked to completion, they become theater.
|
|
- **Security champions without authority**: Champions need time allocation and executive backing, or they become scapegoats.
|
|
- **Metrics without narrative**: Numbers alone do not persuade boards. Pair metrics with stories: "Here is what we learned, here is what we changed, here is why we are safer."
|
|
|
|
---
|
|
|
|
## Common Failure Modes
|
|
|
|
| Failure Mode | Symptom | Remedy |
|
|
|-------------|---------|--------|
|
|
| **Scope creep** | 30-day phase stretches to 90 days | Time-box ruthlessly. Document deferred items for next phase. |
|
|
| **Tool obsession** | Team debates SIEM vendor for 3 weeks | Pick the good-enough tool. Implementation beats selection. |
|
|
| **Perfectionism** | CMDB project stalls waiting for completeness | Seed with critical assets. Expand iteratively. |
|
|
| **Vendor capture** | Recommendations always favor one provider | Disclose relationships. Maintain independence. Document alternatives. |
|
|
| **Executive fatigue** | Board stops attending updates | Lead with business risk, not technical detail. Show cost of inaction. |
|
|
| **Operational resistance** | IT refuses to disable legacy accounts | Use the "get out of jail free" letter. Escalate to executive sponsor. |
|
|
| **Pilot purgatory** | Local AI pilot runs forever without production migration | Define hard success criteria and production migration date before starting. |
|
|
|
|
---
|
|
|
|
## Tools and Templates
|
|
|
|
### Templates Included in This Repository
|
|
|
|
- [T0 Asset Classification Worksheet](../core/t0-asset-framework.md#t0-classification-worksheet)
|
|
- AI Usage Discovery Interview Guide (see Workstream: AI Sovereignty)
|
|
- Blameless Post-Mortem Template (to be added)
|
|
- Chaos Experiment Planning Template (to be added)
|
|
- Vendor Exit Architecture Template (to be added)
|
|
|
|
### Recommended External Tools
|
|
|
|
| Category | Options | Notes |
|
|
|----------|---------|-------|
|
|
| Endpoint Detection | Microsoft Defender, CrowdStrike, SentinelOne | Choose based on existing Microsoft footprint |
|
|
| SIEM / Log Analysis | Sentinel, Splunk, Elastic, Wazuh | Wazuh is open-source and sufficient for many environments |
|
|
| Identity Governance | Azure AD / Entra ID, Okta, Saviynt | Match to primary cloud identity provider |
|
|
| PAM / Vault | CyberArk, Delinea, HashiCorp Vault | Essential for service account and secret management |
|
|
| CMDB | ServiceNow, Device42, GLPI, or spreadsheet | Any CMDB is better than no CMDB |
|
|
| Local AI Inference | Ollama, vLLM, llama.cpp, TGI | Start simple; scale to TGI or vLLM for production load |
|
|
| Chaos Engineering | Gremlin, Chaos Mesh, custom scripts | Gremlin for enterprise; Chaos Mesh for Kubernetes |
|
|
|
|
---
|
|
|
|
*This playbook is a living document. Update it with lessons from every engagement.*
|
|
|
|
*Previous: [Rapid Modernisation Plan](rapid-modernisation-plan.md)*
|