Files

T

Claude Sonnet 4.6 bcebf8ebb3 feat: Add critical infrastructure adaptation for Rule 5 (greenfield)

move-fast-and-fix-things.md: 'The Critical Infrastructure Adaptation'
section in Rule 5. OT/NT environments where full greenfield is impossible.
Five-layer adapted stack: IT greenfield protects OT, OT config as code,
manual operation as fallback, compartmentalisation as partial burn,
long-cycle planned refresh. OT greenfield test with 4h/48h/2w targets.

vertical-power-utilities.md: New 'The Controlled Burn Adaptation' section.
Full treatment of when greenfield is not an option. Five-layer OT-adapted
stack. Explicit acceptance statement framework for genuinely irreplaceable
OT components (name, isolate, monitor, plan replacement). The OT greenfield
test. Reference back to Rule 5.

Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>

2026-06-05 06:58:07 +00:00

23 KiB

Raw Blame History

Vertical Reference: Power and Utilities

"The grid does not care about your quarterly targets. It cares whether you understood the boundary between IT and operations before the adversary did."

This document adapts the antifragile rapid modernisation approach for power generation, transmission, distribution, and water utilities. These organizations operate industrial control systems (ICS/SCADA) where safety and availability are paramount, regulatory oversight is intense, and the convergence of IT and OT creates existential attack surfaces.

The Power and Utility Context

What Makes This Sector Different

Factor	Enterprise Default	Power/Utility Reality
Downtime tolerance	Hours	Seconds to minutes (protection systems); hours for generation
Safety impact	Data loss, financial harm	Physical harm, loss of life, environmental catastrophe
System lifetime	3-5 years	20-40 years (generation, transmission, protection relays)
Regulatory driver	GDPR, industry standards	NIS2, CER, IEC 62351, NERC CIP (North America), national energy regulators
OT/IT boundary	Often porous or nonexistent	Legally and physically mandated; convergence is the primary risk
Supply chain	Moderate depth	Extreme (multi-vendor, multi-national, obsolete equipment)
Remote access	Common, convenient	Heavily restricted; often requires physical presence or dedicated lines

The IT/OT Convergence Problem

Power utilities historically operated OT networks (SCADA, EMS, DMS, protection relays) as air-gapped systems. Over the past two decades, convergence has introduced:

Remote diagnostics over internet-connected VPNs
Centralized patch management through IT SCCM/WSUS
Business intelligence systems reading OT historian data
Vendor remote support terminals in control centers
Smart grid and Advanced Metering Infrastructure (AMI) connecting customer-facing IT to grid operations

Every convergence point is a potential bridge for adversaries from IT to OT.

The executive framing:

"Your control room does not need email. Your protection relays do not need internet access. Every connection between your IT network and your operational technology is a connection an adversary can cross. We are not adding bureaucracy. We are re-establishing the boundary that keeps the lights on."

Regulatory Landscape

EU NIS2 Directive (2023)

Power utilities and water suppliers are classified as essential entities under NIS2.

NIS2 Requirement	Power/Utility Application
Risk management measures	Kill chain analysis for IT→OT bridges; physical security assessment
Supply chain security	Vendor access inventory for all OT equipment; firmware provenance tracking
Incident reporting (24h → 72h)	Automated detection and reporting to national CSIRT and energy regulator
Business continuity	Black start capability; grid islanding procedures; manual override validation
Cryptography	Encrypted communications for all IT/OT integration points
MFA	Hardware tokens for all remote access to OT or critical IT systems
Vulnerability handling	Risk-based prioritization with safety impact assessment

CER Directive (Critical Entities Resilience)

Requires power utilities to demonstrate resilience against:

Natural disasters
Cyberattacks
Supply chain disruptions
Pandemics and workforce unavailability

Antifragile application: Chaos engineering for non-safety systems; cross-training for manual procedures; distributed spare parts inventory.

Sector-Specific Standards

Standard	Scope
IEC 62351	Power systems cybersecurity: communications protocols, authentication, encryption
IEC 61850	Substation communication (GOOSE, SV); security extensions for IEC 61850-90-20
NERC CIP	North American electric reliability; mandatory standards with heavy penalties
ENTSO-E Cybersecurity Guidance	European transmission system operator requirements
BDEW Whitepaper	German energy sector cybersecurity best practices

The Antifragile Posture for Power and Utilities

Pillar 1: Structural Decoupling — The IT/OT Firewall

Principle: IT and OT must be decoupled to the maximum extent compatible with operational requirements. The air gap is the default. Any bridge must be justified, documented, and monitored.

Antifragile Moves:

Action	Implementation	Priority
Network segmentation	Physically separate IT and OT; unidirectional gateway or data diode for IT→OT data flows	P0
No AD trust to OT	OT AD (if any) must be a separate forest with one-way trust or no trust	P0
Jump host architecture	All IT-to-OT access via hardened, monitored jump hosts with session recording	P1
Vendor access airlock	Vendor VPNs terminate in dedicated DMZ; no direct OT access; remote hands or on-site escort for OT	P1
Remove internet from OT	OT VLANs have no direct internet egress; updates via offline media or controlled proxy	P0
AMI/ Smart Grid isolation	Advanced Metering Infrastructure on dedicated network; no direct path to SCADA or EMS	P1

Pillar 2: Optionality Preservation — Vendor and Technology Independence

Principle: Power utilities depend on vendors for SCADA, protection relays, turbine control, and substation automation. This dependency must not become a single point of failure.

Antifragile Moves:

Multi-vendor strategy for critical systems: No single vendor should control >50% of protection, control, or monitoring functions
Spare parts inventory: Maintain critical spares for legacy OT equipment that vendors no longer support
Firmware escrow and provenance: Require vendors to deposit firmware; verify cryptographic signatures before deployment
Local competence: Train internal staff to operate and maintain systems without vendor support for 30 days
Protocol independence: Where possible, support multiple communication protocols to avoid single-vendor lock-in

Pillar 3: Stress-to-Signal Conversion — OT Incident Learning

Principle: OT incidents are rare but high-impact. The organization must learn from every anomaly, near-miss, and exercise.

Antifragile Moves:

OT security operations centre (SOC) integration: Feed OT alarms into the SOC with analysts trained on industrial protocols
Monthly tabletop exercises: Simulate OT-specific scenarios (compromised EMS, rogue protection relay settings, ransomware on engineering workstations)
Post-incident structural mandate: Every OT incident or near-miss must produce at least one architectural or procedural change
Red team with bounded OT scope: Annual exercise including OT reconnaissance, constrained by safety requirements

Pillar 4: Sovereign Intelligence — Local AI for the Grid

Principle: Grid data is among the most sensitive an organization possesses. It reveals generation capacity, topology, switching patterns, load profiles, and operational routines.

Antifragile Moves:

Local AI for OT anomaly detection: Analyze historian data, DCS logs, and protection relay events without cloud exfiltration
Closed-loop digital twin: Train models on local OT data to predict equipment failures; never export raw telemetry
Air-gapped AI inference: Deploy inference nodes in OT DMZ with no return path to IT or internet
Load forecasting sovereignty: Local models for demand prediction using proprietary grid data

The executive framing:

"Your grid data tells an adversary exactly when and where to strike. It tells a competitor your capacity constraints. Sending it to a cloud AI for 'optimization' is not a technology decision. It is a national security and competitive intelligence decision. Local models on local hardware. Full stop."

Pillar 5: Asymmetric Payoff — Resilience Over Prevention

Principle: In power utilities, perfect prevention is impossible. The goal is to survive and recover faster than the adversary can exploit.

Antifragile Moves:

Black start capability: Maintain the ability to restart the grid from shutdown without external power
Grid islanding: Design systems so that sections can disconnect and operate independently during disturbances
Manual override procedures: Every automated system must have a documented, tested manual procedure
Redundant communication paths: Power line carrier, microwave, satellite backup for SCADA and protection communications
Protection relay independence: Electromechanical or static relays as backup for digital relays in critical paths

The Rapid Modernisation Plan: Power/Utility Variant

Phase 1: Hygiene (Days 0-30)

In addition to standard hygiene:

Action	Owner	Deliverable
Inventory all OT assets: DCS, SCADA, EMS, protection relays, RTUs, AMI	OT Security / Engineering	OT asset inventory with vendor and firmware versions
Map all IT-to-OT network connections	Network / OT	Connection matrix with business justification per connection
Audit vendor remote access: who, how, when, for how long	OT Security / Procurement	Vendor access log and hardened policy
Identify OT systems with internet connectivity	Network	List with immediate remediation plan
Document manual override procedures for critical systems	OT Engineering	Procedure manual, signed off by operations and safety
Validate backup of EMS / DMS configurations	OT Engineering	Backup integrity test report

Phase 2: Control (Days 30-60)

Action	Owner	Deliverable
Implement network segmentation: IT/OT DMZ with unidirectional gateway	Network / OT	Segmentation architecture and validated firewall rules
Harden vendor access: time-bounded, session-recorded, MFA with hardware tokens	OT Security	Vendor access gateway operational
Enable OT logging: historian, DCS, firewall, protection relay events	OT Security	Centralized OT log aggregation (air-gapped SIEM or historian)
Patch OT systems: test in lab, deploy in maintenance windows	OT Engineering	Patch management procedure with safety gates
Secure engineering workstations (EWS): application whitelisting, no internet	OT Security	EWS hardening standard deployed

Phase 3: Sovereignty (Days 60-90)

Action	Owner	Deliverable
Deploy local AI for OT anomaly detection pilot	AI / OT Security	OT anomaly detection with false positive tuning
Validate black start / islanding procedures	Operations	Test report with time-to-recovery metrics
Conduct OT-specific tabletop exercise	Security / Operations	Exercise report with structural improvements
Implement firmware integrity monitoring	OT Security	Baseline hashes for critical OT firmware
Test protection relay fail-over to electromechanical backup	Engineering	Fail-over test report

Phase 4: Antifragility (Days 90-180)

Action	Owner	Deliverable
Annual red team with bounded OT scope	Security	Red team report with kill chain analysis
Chaos engineering on non-safety IT systems	Resilience	Monthly experiment schedule and findings
Vendor exit architecture for critical OT platforms	Procurement / Engineering	90-day vendor transition plan per critical system
Cross-training: operations staff on manual procedures	Operations	Training completion metrics
Participate in sector ISAC information sharing	Security	Threat intelligence integration report

Substation and Protection Specifics

IEC 61850 Security

IEC 61850 (substation communication) uses GOOSE and Sampled Values (SV) that were not designed with security in mind.

Hardening priorities:

IEC 61850-90-20: Implement cybersecurity recommendations for IEC 61850 networks
Authentication: Digitally sign GOOSE messages where IEDs support it
Network segmentation: GOOSE/SV traffic on dedicated VLAN; no routing to IT networks
IED hardening: Disable unused services; change default passwords; enable logging
Configuration management: Version control for SCL files; change detection for IED settings

Protection Relay Security

Protection relays are the safety-critical edge of the grid. Compromise can cause physical damage.

Control	Implementation
Access control	Vaulted credentials; multi-person approval for settings changes
Logging	All settings changes logged with before/after values
Integrity	Cryptographic checksums for firmware and settings files
Redundancy	Independent protection schemes (e.g., distance + differential)
Manual backup	Electromechanical or static relay backup for critical digital protections

Generation-Specific Considerations

Thermal / Nuclear / Hydro

Generation Type	Specific Risk	Control
Thermal	Turbine control system compromise	Dedicated turbine control network; no IT connectivity
Nuclear	Safety system interference	Air-gapped safety systems; regulatory compliance with national nuclear authority
Hydro	Dam control / spillway gate manipulation	Physical controls for critical water management; redundant level sensors
Renewables	Inverter-based resource (IBR) vulnerability	Secure firmware updates; anti-islanding protection; grid support function validation

Distributed Energy Resources (DER)

Solar, wind, and battery inverters connect to the distribution grid with varying security maturity.

Action: DER interconnection standards must include cybersecurity requirements
Action: Monitor DER communications for anomalous commands or settings changes
Action: Aggregate DER visibility in DMS/ADMS without direct control paths

Water and Wastewater Utilities

Water utilities share many characteristics with power but have additional concerns:

Concern	Application
Safety	Contamination prevention, pressure management, chemical dosing control
SCADA/OT	Treatment plant automation, distribution pump control, reservoir level management
Criticality	Water is life-sustaining; outages have immediate public health impact
Regulation	EPA (US), Drinking Water Inspectorate (UK), national health authorities

Additional controls for water utilities:

Physical security for treatment chemicals (chlorine, fluoride) to prevent intentional contamination
Redundant water quality sensors with cross-validation
Manual override capability for all automated chemical dosing systems
Isolation of IT from operational water quality monitoring

M365 in Power and Utilities

Corporate IT in power utilities uses M365 but must be strictly separated from OT.

Consideration	Power/Utility Requirement
Data residency	M365 data in EU/national datacenters; verify tenant location
Conditional access	Block M365 access from non-corporate devices for privileged users; geo-restrict admin access
Guest access	Strictly prohibit in OT-connected tenants; heavily vet in corporate tenant
Teams / SharePoint	Never used for OT document sharing or control room communication
Mobile device management	Field engineer tablets Intune-managed; restricted app installation
Email security	EOP baseline minimum; Defender for Office 365 P2 recommended for critical infrastructure

See M365 E3 Hardening for tactical hardening, and apply these overlays.

The Controlled Burn Adaptation: When Greenfield Is Not an Option

The antifragile framework holds that organisations should build toward the ability to deploy greenfield — rebuild from scratch, on clean infrastructure, from version-controlled configuration. This is the ultimate expression of structural decoupling: if you can rebuild the environment, no adversary and no vendor holds you hostage.

Power utilities, water suppliers, and telecom network operators frequently view this principle as inapplicable. The grid does not go dark for a rebuild exercise. Protection relays cannot be factory-reset during a fault. OT systems operate under safety cases that require regulatory approval for any configuration change. The controlled burn, taken literally, cannot happen.

This is correct. It is also not the end of the conversation.

The goal of greenfield capability is to eliminate inherited compromise and return to a known-good operational state. For IT environments, the method is rebuild. For OT/NT environments, the method is different — but the goal is identical, and it is achievable. The absence of a literal rebuild path does not justify the absence of a recovery plan.

The OT-Adapted Greenfield Stack

Layer 1: IT greenfield protects OT. The corporate IT environment, M365 tenant, SCADA servers, historian, engineering workstations, and HMI layer can almost always be made greenfield-capable even when OT hardware cannot. An adversary who compromises the IT layer and finds a clean rebuild path loses their persistence and pivot path without a single OT device being touched. IT greenfield is the outer perimeter of an OT environment that cannot be rebuilt itself. This is the first investment.

Layer 2: OT configuration as code. PLC logic, IED settings files, protection relay configuration archives, SCADA database snapshots, DCS export files — all of these belong in version-controlled backups with integrity verification. The ability to restore a known-good configuration to existing hardware is the OT equivalent of greenfield: the hardware remains, but the software state is wiped and rebuilt from a verified baseline. This is not a backup exercise. It is a discipline — with the same rigour that ASTRAL applies to M365 configuration, applied to OT configuration archives. Every piece of OT configuration that exists only in the device and nowhere else is a single point of failure.

Layer 3: Manual operation as the fallback layer. The ability to operate critical systems without the automation layer is, in practice, the ability to drop the compromised layer and continue service. A power utility that can maintain 70–80% of service from manual procedures during a SCADA compromise has a fundamentally different risk profile than one that cannot. Manual override procedures must be:

Documented in detail, not just referenced in an emergency plan
Tested under realistic conditions, not just reviewed in a tabletop
Known by currently assigned operations staff, not just veterans who may have left
Validated at least annually — capability that is not practised does not exist when it is needed

Layer 4: Compartmentalisation as partial burn. OT environments are typically sectionable. Grid islanding, substation isolation, plant-level control separation, and control centre failover allow the operator to sacrifice and rebuild one section while maintaining critical service in others. This is the OT equivalent of the controlled burn: localised rather than total, sequential rather than simultaneous, but governed by the same principle — designed-in ability to contain, recover, and restore without waiting for a complete environment to be clean.

Layer 5: Planned long-cycle refresh. OT systems have 20–40 year operational lifetimes, but those lifetimes should be a programme, not an accident. Organisations without a documented OT refresh schedule — with component-by-component replacement milestones, firmware escrow requirements, spare parts inventory targets, and vendor succession planning — are not avoiding greenfield. They are deferring it until a crisis forces it under the worst possible conditions: compromised hardware, unavailable vendors, missing documentation, and no tested procedures.

The Acceptance Statement

Some OT components in critical infrastructure genuinely cannot be replaced on any timescale that security planning can influence. Legacy protection relays on operational transmission lines. Nuclear instrumentation systems under active safety cases. Water treatment chemical dosing controllers that predate the organisation's current IT function.

For these systems, the correct position is explicit acceptance, not avoidance:

Name them. Identify specifically which systems are outside the rebuild envelope and why.
Isolate them. The isolation must be proportional to the acknowledged unrepairability. A system that cannot be patched, cannot be replaced, and cannot be rebuilt must be surrounded by compensating controls so thorough that its compromise cannot propagate.
Monitor them obsessively. Configuration integrity monitoring, network traffic baselining, and anomaly detection for these specific systems — because when you cannot fix the asset, detection and containment are the only remaining defences.
Plan their eventual replacement. "This system cannot be replaced in the current operational context" is acceptable. "This system will never be replaced" is not a security posture — it is a deferred decision that will be made under worse conditions later.

The acceptance statement is not a sign of weakness. It is the honest foundation of a credible security programme. Regulators, insurers, and incident responders all prefer an organisation that knows exactly where its limits are and has compensating controls in place over one that claims no limits and has no plan.

The OT Greenfield Test

"If our IT and SCADA layers were fully compromised tonight: could we maintain critical service from manual procedures within 4 hours? Rebuild the IT layer from clean baselines within 48 hours? Restore full automated operation from verified OT configuration backups within two weeks? And have we actually tested each of these in the past 12 months?"

If any answer is no, the gap is in manual procedures, IT rebuild capability, OT configuration management, or test cadence — not in the impossibility of the OT environment itself.

Evidence Package for Regulators

Requirement	Evidence from Antifragile Program
NIS2 risk management	Kill chain analysis, T0 asset classification, IT/OT connection matrix
NIS2 incident handling	IR runbooks, OT-specific response procedures, quarterly drill reports
NIS2 business continuity	Black start test reports, islanding validation, manual procedure verification
NIS2 supply chain security	Vendor risk register, firmware provenance, vendor exit architectures
NIS2 encryption	Data classification with encryption mapping, TLS configuration audits
NIS2 vulnerability handling	Vulnerability scan reports with safety-impact prioritization
CER resilience	Chaos engineering results, cross-training metrics, spare parts inventory

Previous: NIST CSF Mapping Next: Vertical: Telco

23 KiB Raw Blame History Unescape Escape

Vertical Reference: Power and Utilities

The Power and Utility Context

What Makes This Sector Different

The IT/OT Convergence Problem

Regulatory Landscape

EU NIS2 Directive (2023)

CER Directive (Critical Entities Resilience)

Sector-Specific Standards

The Antifragile Posture for Power and Utilities

Pillar 1: Structural Decoupling — The IT/OT Firewall

Pillar 2: Optionality Preservation — Vendor and Technology Independence

Pillar 3: Stress-to-Signal Conversion — OT Incident Learning

Pillar 4: Sovereign Intelligence — Local AI for the Grid

Pillar 5: Asymmetric Payoff — Resilience Over Prevention

The Rapid Modernisation Plan: Power/Utility Variant

Phase 1: Hygiene (Days 0-30)

Phase 2: Control (Days 30-60)

Phase 3: Sovereignty (Days 60-90)

Phase 4: Antifragility (Days 90-180)

Substation and Protection Specifics

IEC 61850 Security

Protection Relay Security

Generation-Specific Considerations

Thermal / Nuclear / Hydro

Distributed Energy Resources (DER)

Water and Wastewater Utilities

M365 in Power and Utilities

The Controlled Burn Adaptation: When Greenfield Is Not an Option

The OT-Adapted Greenfield Stack

The Acceptance Statement

The OT Greenfield Test

Evidence Package for Regulators

23 KiB

Raw Blame History