Files
antifragile/antifragile-consulting/reference/vertical-power-utilities.md
T
Claude Sonnet 4.6 bcebf8ebb3 feat: Add critical infrastructure adaptation for Rule 5 (greenfield)
move-fast-and-fix-things.md: 'The Critical Infrastructure Adaptation'
section in Rule 5. OT/NT environments where full greenfield is impossible.
Five-layer adapted stack: IT greenfield protects OT, OT config as code,
manual operation as fallback, compartmentalisation as partial burn,
long-cycle planned refresh. OT greenfield test with 4h/48h/2w targets.

vertical-power-utilities.md: New 'The Controlled Burn Adaptation' section.
Full treatment of when greenfield is not an option. Five-layer OT-adapted
stack. Explicit acceptance statement framework for genuinely irreplaceable
OT components (name, isolate, monitor, plan replacement). The OT greenfield
test. Reference back to Rule 5.

Co-Authored-By: Tom Kracmar <tom+claude@cat6.cz>
2026-06-05 06:58:07 +00:00

23 KiB
Raw Blame History

Vertical Reference: Power and Utilities

"The grid does not care about your quarterly targets. It cares whether you understood the boundary between IT and operations before the adversary did."

This document adapts the antifragile rapid modernisation approach for power generation, transmission, distribution, and water utilities. These organizations operate industrial control systems (ICS/SCADA) where safety and availability are paramount, regulatory oversight is intense, and the convergence of IT and OT creates existential attack surfaces.


The Power and Utility Context

What Makes This Sector Different

Factor Enterprise Default Power/Utility Reality
Downtime tolerance Hours Seconds to minutes (protection systems); hours for generation
Safety impact Data loss, financial harm Physical harm, loss of life, environmental catastrophe
System lifetime 3-5 years 20-40 years (generation, transmission, protection relays)
Regulatory driver GDPR, industry standards NIS2, CER, IEC 62351, NERC CIP (North America), national energy regulators
OT/IT boundary Often porous or nonexistent Legally and physically mandated; convergence is the primary risk
Supply chain Moderate depth Extreme (multi-vendor, multi-national, obsolete equipment)
Remote access Common, convenient Heavily restricted; often requires physical presence or dedicated lines

The IT/OT Convergence Problem

Power utilities historically operated OT networks (SCADA, EMS, DMS, protection relays) as air-gapped systems. Over the past two decades, convergence has introduced:

  • Remote diagnostics over internet-connected VPNs
  • Centralized patch management through IT SCCM/WSUS
  • Business intelligence systems reading OT historian data
  • Vendor remote support terminals in control centers
  • Smart grid and Advanced Metering Infrastructure (AMI) connecting customer-facing IT to grid operations

Every convergence point is a potential bridge for adversaries from IT to OT.

The executive framing:

"Your control room does not need email. Your protection relays do not need internet access. Every connection between your IT network and your operational technology is a connection an adversary can cross. We are not adding bureaucracy. We are re-establishing the boundary that keeps the lights on."


Regulatory Landscape

EU NIS2 Directive (2023)

Power utilities and water suppliers are classified as essential entities under NIS2.

NIS2 Requirement Power/Utility Application
Risk management measures Kill chain analysis for IT→OT bridges; physical security assessment
Supply chain security Vendor access inventory for all OT equipment; firmware provenance tracking
Incident reporting (24h → 72h) Automated detection and reporting to national CSIRT and energy regulator
Business continuity Black start capability; grid islanding procedures; manual override validation
Cryptography Encrypted communications for all IT/OT integration points
MFA Hardware tokens for all remote access to OT or critical IT systems
Vulnerability handling Risk-based prioritization with safety impact assessment

CER Directive (Critical Entities Resilience)

Requires power utilities to demonstrate resilience against:

  • Natural disasters
  • Cyberattacks
  • Supply chain disruptions
  • Pandemics and workforce unavailability

Antifragile application: Chaos engineering for non-safety systems; cross-training for manual procedures; distributed spare parts inventory.

Sector-Specific Standards

Standard Scope
IEC 62351 Power systems cybersecurity: communications protocols, authentication, encryption
IEC 61850 Substation communication (GOOSE, SV); security extensions for IEC 61850-90-20
NERC CIP North American electric reliability; mandatory standards with heavy penalties
ENTSO-E Cybersecurity Guidance European transmission system operator requirements
BDEW Whitepaper German energy sector cybersecurity best practices

The Antifragile Posture for Power and Utilities

Pillar 1: Structural Decoupling — The IT/OT Firewall

Principle: IT and OT must be decoupled to the maximum extent compatible with operational requirements. The air gap is the default. Any bridge must be justified, documented, and monitored.

Antifragile Moves:

Action Implementation Priority
Network segmentation Physically separate IT and OT; unidirectional gateway or data diode for IT→OT data flows P0
No AD trust to OT OT AD (if any) must be a separate forest with one-way trust or no trust P0
Jump host architecture All IT-to-OT access via hardened, monitored jump hosts with session recording P1
Vendor access airlock Vendor VPNs terminate in dedicated DMZ; no direct OT access; remote hands or on-site escort for OT P1
Remove internet from OT OT VLANs have no direct internet egress; updates via offline media or controlled proxy P0
AMI/ Smart Grid isolation Advanced Metering Infrastructure on dedicated network; no direct path to SCADA or EMS P1

Pillar 2: Optionality Preservation — Vendor and Technology Independence

Principle: Power utilities depend on vendors for SCADA, protection relays, turbine control, and substation automation. This dependency must not become a single point of failure.

Antifragile Moves:

  • Multi-vendor strategy for critical systems: No single vendor should control >50% of protection, control, or monitoring functions
  • Spare parts inventory: Maintain critical spares for legacy OT equipment that vendors no longer support
  • Firmware escrow and provenance: Require vendors to deposit firmware; verify cryptographic signatures before deployment
  • Local competence: Train internal staff to operate and maintain systems without vendor support for 30 days
  • Protocol independence: Where possible, support multiple communication protocols to avoid single-vendor lock-in

Pillar 3: Stress-to-Signal Conversion — OT Incident Learning

Principle: OT incidents are rare but high-impact. The organization must learn from every anomaly, near-miss, and exercise.

Antifragile Moves:

  • OT security operations centre (SOC) integration: Feed OT alarms into the SOC with analysts trained on industrial protocols
  • Monthly tabletop exercises: Simulate OT-specific scenarios (compromised EMS, rogue protection relay settings, ransomware on engineering workstations)
  • Post-incident structural mandate: Every OT incident or near-miss must produce at least one architectural or procedural change
  • Red team with bounded OT scope: Annual exercise including OT reconnaissance, constrained by safety requirements

Pillar 4: Sovereign Intelligence — Local AI for the Grid

Principle: Grid data is among the most sensitive an organization possesses. It reveals generation capacity, topology, switching patterns, load profiles, and operational routines.

Antifragile Moves:

  • Local AI for OT anomaly detection: Analyze historian data, DCS logs, and protection relay events without cloud exfiltration
  • Closed-loop digital twin: Train models on local OT data to predict equipment failures; never export raw telemetry
  • Air-gapped AI inference: Deploy inference nodes in OT DMZ with no return path to IT or internet
  • Load forecasting sovereignty: Local models for demand prediction using proprietary grid data

The executive framing:

"Your grid data tells an adversary exactly when and where to strike. It tells a competitor your capacity constraints. Sending it to a cloud AI for 'optimization' is not a technology decision. It is a national security and competitive intelligence decision. Local models on local hardware. Full stop."

Pillar 5: Asymmetric Payoff — Resilience Over Prevention

Principle: In power utilities, perfect prevention is impossible. The goal is to survive and recover faster than the adversary can exploit.

Antifragile Moves:

  • Black start capability: Maintain the ability to restart the grid from shutdown without external power
  • Grid islanding: Design systems so that sections can disconnect and operate independently during disturbances
  • Manual override procedures: Every automated system must have a documented, tested manual procedure
  • Redundant communication paths: Power line carrier, microwave, satellite backup for SCADA and protection communications
  • Protection relay independence: Electromechanical or static relays as backup for digital relays in critical paths

The Rapid Modernisation Plan: Power/Utility Variant

Phase 1: Hygiene (Days 0-30)

In addition to standard hygiene:

Action Owner Deliverable
Inventory all OT assets: DCS, SCADA, EMS, protection relays, RTUs, AMI OT Security / Engineering OT asset inventory with vendor and firmware versions
Map all IT-to-OT network connections Network / OT Connection matrix with business justification per connection
Audit vendor remote access: who, how, when, for how long OT Security / Procurement Vendor access log and hardened policy
Identify OT systems with internet connectivity Network List with immediate remediation plan
Document manual override procedures for critical systems OT Engineering Procedure manual, signed off by operations and safety
Validate backup of EMS / DMS configurations OT Engineering Backup integrity test report

Phase 2: Control (Days 30-60)

Action Owner Deliverable
Implement network segmentation: IT/OT DMZ with unidirectional gateway Network / OT Segmentation architecture and validated firewall rules
Harden vendor access: time-bounded, session-recorded, MFA with hardware tokens OT Security Vendor access gateway operational
Enable OT logging: historian, DCS, firewall, protection relay events OT Security Centralized OT log aggregation (air-gapped SIEM or historian)
Patch OT systems: test in lab, deploy in maintenance windows OT Engineering Patch management procedure with safety gates
Secure engineering workstations (EWS): application whitelisting, no internet OT Security EWS hardening standard deployed

Phase 3: Sovereignty (Days 60-90)

Action Owner Deliverable
Deploy local AI for OT anomaly detection pilot AI / OT Security OT anomaly detection with false positive tuning
Validate black start / islanding procedures Operations Test report with time-to-recovery metrics
Conduct OT-specific tabletop exercise Security / Operations Exercise report with structural improvements
Implement firmware integrity monitoring OT Security Baseline hashes for critical OT firmware
Test protection relay fail-over to electromechanical backup Engineering Fail-over test report

Phase 4: Antifragility (Days 90-180)

Action Owner Deliverable
Annual red team with bounded OT scope Security Red team report with kill chain analysis
Chaos engineering on non-safety IT systems Resilience Monthly experiment schedule and findings
Vendor exit architecture for critical OT platforms Procurement / Engineering 90-day vendor transition plan per critical system
Cross-training: operations staff on manual procedures Operations Training completion metrics
Participate in sector ISAC information sharing Security Threat intelligence integration report

Substation and Protection Specifics

IEC 61850 Security

IEC 61850 (substation communication) uses GOOSE and Sampled Values (SV) that were not designed with security in mind.

Hardening priorities:

  • IEC 61850-90-20: Implement cybersecurity recommendations for IEC 61850 networks
  • Authentication: Digitally sign GOOSE messages where IEDs support it
  • Network segmentation: GOOSE/SV traffic on dedicated VLAN; no routing to IT networks
  • IED hardening: Disable unused services; change default passwords; enable logging
  • Configuration management: Version control for SCL files; change detection for IED settings

Protection Relay Security

Protection relays are the safety-critical edge of the grid. Compromise can cause physical damage.

Control Implementation
Access control Vaulted credentials; multi-person approval for settings changes
Logging All settings changes logged with before/after values
Integrity Cryptographic checksums for firmware and settings files
Redundancy Independent protection schemes (e.g., distance + differential)
Manual backup Electromechanical or static relay backup for critical digital protections

Generation-Specific Considerations

Thermal / Nuclear / Hydro

Generation Type Specific Risk Control
Thermal Turbine control system compromise Dedicated turbine control network; no IT connectivity
Nuclear Safety system interference Air-gapped safety systems; regulatory compliance with national nuclear authority
Hydro Dam control / spillway gate manipulation Physical controls for critical water management; redundant level sensors
Renewables Inverter-based resource (IBR) vulnerability Secure firmware updates; anti-islanding protection; grid support function validation

Distributed Energy Resources (DER)

Solar, wind, and battery inverters connect to the distribution grid with varying security maturity.

  • Action: DER interconnection standards must include cybersecurity requirements
  • Action: Monitor DER communications for anomalous commands or settings changes
  • Action: Aggregate DER visibility in DMS/ADMS without direct control paths

Water and Wastewater Utilities

Water utilities share many characteristics with power but have additional concerns:

Concern Application
Safety Contamination prevention, pressure management, chemical dosing control
SCADA/OT Treatment plant automation, distribution pump control, reservoir level management
Criticality Water is life-sustaining; outages have immediate public health impact
Regulation EPA (US), Drinking Water Inspectorate (UK), national health authorities

Additional controls for water utilities:

  • Physical security for treatment chemicals (chlorine, fluoride) to prevent intentional contamination
  • Redundant water quality sensors with cross-validation
  • Manual override capability for all automated chemical dosing systems
  • Isolation of IT from operational water quality monitoring

M365 in Power and Utilities

Corporate IT in power utilities uses M365 but must be strictly separated from OT.

Consideration Power/Utility Requirement
Data residency M365 data in EU/national datacenters; verify tenant location
Conditional access Block M365 access from non-corporate devices for privileged users; geo-restrict admin access
Guest access Strictly prohibit in OT-connected tenants; heavily vet in corporate tenant
Teams / SharePoint Never used for OT document sharing or control room communication
Mobile device management Field engineer tablets Intune-managed; restricted app installation
Email security EOP baseline minimum; Defender for Office 365 P2 recommended for critical infrastructure

See M365 E3 Hardening for tactical hardening, and apply these overlays.


The Controlled Burn Adaptation: When Greenfield Is Not an Option

The antifragile framework holds that organisations should build toward the ability to deploy greenfield — rebuild from scratch, on clean infrastructure, from version-controlled configuration. This is the ultimate expression of structural decoupling: if you can rebuild the environment, no adversary and no vendor holds you hostage.

Power utilities, water suppliers, and telecom network operators frequently view this principle as inapplicable. The grid does not go dark for a rebuild exercise. Protection relays cannot be factory-reset during a fault. OT systems operate under safety cases that require regulatory approval for any configuration change. The controlled burn, taken literally, cannot happen.

This is correct. It is also not the end of the conversation.

The goal of greenfield capability is to eliminate inherited compromise and return to a known-good operational state. For IT environments, the method is rebuild. For OT/NT environments, the method is different — but the goal is identical, and it is achievable. The absence of a literal rebuild path does not justify the absence of a recovery plan.

The OT-Adapted Greenfield Stack

Layer 1: IT greenfield protects OT. The corporate IT environment, M365 tenant, SCADA servers, historian, engineering workstations, and HMI layer can almost always be made greenfield-capable even when OT hardware cannot. An adversary who compromises the IT layer and finds a clean rebuild path loses their persistence and pivot path without a single OT device being touched. IT greenfield is the outer perimeter of an OT environment that cannot be rebuilt itself. This is the first investment.

Layer 2: OT configuration as code. PLC logic, IED settings files, protection relay configuration archives, SCADA database snapshots, DCS export files — all of these belong in version-controlled backups with integrity verification. The ability to restore a known-good configuration to existing hardware is the OT equivalent of greenfield: the hardware remains, but the software state is wiped and rebuilt from a verified baseline. This is not a backup exercise. It is a discipline — with the same rigour that ASTRAL applies to M365 configuration, applied to OT configuration archives. Every piece of OT configuration that exists only in the device and nowhere else is a single point of failure.

Layer 3: Manual operation as the fallback layer. The ability to operate critical systems without the automation layer is, in practice, the ability to drop the compromised layer and continue service. A power utility that can maintain 7080% of service from manual procedures during a SCADA compromise has a fundamentally different risk profile than one that cannot. Manual override procedures must be:

  • Documented in detail, not just referenced in an emergency plan
  • Tested under realistic conditions, not just reviewed in a tabletop
  • Known by currently assigned operations staff, not just veterans who may have left
  • Validated at least annually — capability that is not practised does not exist when it is needed

Layer 4: Compartmentalisation as partial burn. OT environments are typically sectionable. Grid islanding, substation isolation, plant-level control separation, and control centre failover allow the operator to sacrifice and rebuild one section while maintaining critical service in others. This is the OT equivalent of the controlled burn: localised rather than total, sequential rather than simultaneous, but governed by the same principle — designed-in ability to contain, recover, and restore without waiting for a complete environment to be clean.

Layer 5: Planned long-cycle refresh. OT systems have 2040 year operational lifetimes, but those lifetimes should be a programme, not an accident. Organisations without a documented OT refresh schedule — with component-by-component replacement milestones, firmware escrow requirements, spare parts inventory targets, and vendor succession planning — are not avoiding greenfield. They are deferring it until a crisis forces it under the worst possible conditions: compromised hardware, unavailable vendors, missing documentation, and no tested procedures.

The Acceptance Statement

Some OT components in critical infrastructure genuinely cannot be replaced on any timescale that security planning can influence. Legacy protection relays on operational transmission lines. Nuclear instrumentation systems under active safety cases. Water treatment chemical dosing controllers that predate the organisation's current IT function.

For these systems, the correct position is explicit acceptance, not avoidance:

  1. Name them. Identify specifically which systems are outside the rebuild envelope and why.
  2. Isolate them. The isolation must be proportional to the acknowledged unrepairability. A system that cannot be patched, cannot be replaced, and cannot be rebuilt must be surrounded by compensating controls so thorough that its compromise cannot propagate.
  3. Monitor them obsessively. Configuration integrity monitoring, network traffic baselining, and anomaly detection for these specific systems — because when you cannot fix the asset, detection and containment are the only remaining defences.
  4. Plan their eventual replacement. "This system cannot be replaced in the current operational context" is acceptable. "This system will never be replaced" is not a security posture — it is a deferred decision that will be made under worse conditions later.

The acceptance statement is not a sign of weakness. It is the honest foundation of a credible security programme. Regulators, insurers, and incident responders all prefer an organisation that knows exactly where its limits are and has compensating controls in place over one that claims no limits and has no plan.

The OT Greenfield Test

"If our IT and SCADA layers were fully compromised tonight: could we maintain critical service from manual procedures within 4 hours? Rebuild the IT layer from clean baselines within 48 hours? Restore full automated operation from verified OT configuration backups within two weeks? And have we actually tested each of these in the past 12 months?"

If any answer is no, the gap is in manual procedures, IT rebuild capability, OT configuration management, or test cadence — not in the impossibility of the OT environment itself.


Evidence Package for Regulators

Requirement Evidence from Antifragile Program
NIS2 risk management Kill chain analysis, T0 asset classification, IT/OT connection matrix
NIS2 incident handling IR runbooks, OT-specific response procedures, quarterly drill reports
NIS2 business continuity Black start test reports, islanding validation, manual procedure verification
NIS2 supply chain security Vendor risk register, firmware provenance, vendor exit architectures
NIS2 encryption Data classification with encryption mapping, TLS configuration audits
NIS2 vulnerability handling Vulnerability scan reports with safety-impact prioritization
CER resilience Chaos engineering results, cross-training metrics, spare parts inventory

Previous: NIST CSF Mapping Next: Vertical: Telco