Files
antifragile/antifragile-consulting/reference/vertical-power-utilities.md
Tomas Kracmar 763da003d3 Initial commit: antifragile cybersecurity consulting blueprint
Complete repository of frameworks, playbooks, and assessment resources
for cybersecurity consultations focused on antifragile enterprise design.

Includes:
- Core philosophy and manifest (5 pillars)
- 12 modular engagement packages
- AI sovereignty and operations frameworks
- Zero-budget vulnerability discovery and hardening playbooks
- M365 E3 hardening and antifragile project plans
- Osquery sovereign discovery platform blueprint
- Perimeter scanning capability guide
- AI-assisted TVM blueprint for AI-powered adversaries
- Vertical specializations: banking, telco, power/utilities
- CIS Controls v8 and NIST CSF 2.0 mappings
- Risk registers and assessment templates
- C-suite conversation guide and business case templates
2026-05-09 16:53:22 +02:00

17 KiB

Vertical Reference: Power and Utilities

"The grid does not care about your quarterly targets. It cares whether you understood the boundary between IT and operations before the adversary did."

This document adapts the antifragile rapid modernisation approach for power generation, transmission, distribution, and water utilities. These organizations operate industrial control systems (ICS/SCADA) where safety and availability are paramount, regulatory oversight is intense, and the convergence of IT and OT creates existential attack surfaces.


The Power and Utility Context

What Makes This Sector Different

Factor Enterprise Default Power/Utility Reality
Downtime tolerance Hours Seconds to minutes (protection systems); hours for generation
Safety impact Data loss, financial harm Physical harm, loss of life, environmental catastrophe
System lifetime 3-5 years 20-40 years (generation, transmission, protection relays)
Regulatory driver GDPR, industry standards NIS2, CER, IEC 62351, NERC CIP (North America), national energy regulators
OT/IT boundary Often porous or nonexistent Legally and physically mandated; convergence is the primary risk
Supply chain Moderate depth Extreme (multi-vendor, multi-national, obsolete equipment)
Remote access Common, convenient Heavily restricted; often requires physical presence or dedicated lines

The IT/OT Convergence Problem

Power utilities historically operated OT networks (SCADA, EMS, DMS, protection relays) as air-gapped systems. Over the past two decades, convergence has introduced:

  • Remote diagnostics over internet-connected VPNs
  • Centralized patch management through IT SCCM/WSUS
  • Business intelligence systems reading OT historian data
  • Vendor remote support terminals in control centers
  • Smart grid and Advanced Metering Infrastructure (AMI) connecting customer-facing IT to grid operations

Every convergence point is a potential bridge for adversaries from IT to OT.

The executive framing:

"Your control room does not need email. Your protection relays do not need internet access. Every connection between your IT network and your operational technology is a connection an adversary can cross. We are not adding bureaucracy. We are re-establishing the boundary that keeps the lights on."


Regulatory Landscape

EU NIS2 Directive (2023)

Power utilities and water suppliers are classified as essential entities under NIS2.

NIS2 Requirement Power/Utility Application
Risk management measures Kill chain analysis for IT→OT bridges; physical security assessment
Supply chain security Vendor access inventory for all OT equipment; firmware provenance tracking
Incident reporting (24h → 72h) Automated detection and reporting to national CSIRT and energy regulator
Business continuity Black start capability; grid islanding procedures; manual override validation
Cryptography Encrypted communications for all IT/OT integration points
MFA Hardware tokens for all remote access to OT or critical IT systems
Vulnerability handling Risk-based prioritization with safety impact assessment

CER Directive (Critical Entities Resilience)

Requires power utilities to demonstrate resilience against:

  • Natural disasters
  • Cyberattacks
  • Supply chain disruptions
  • Pandemics and workforce unavailability

Antifragile application: Chaos engineering for non-safety systems; cross-training for manual procedures; distributed spare parts inventory.

Sector-Specific Standards

Standard Scope
IEC 62351 Power systems cybersecurity: communications protocols, authentication, encryption
IEC 61850 Substation communication (GOOSE, SV); security extensions for IEC 61850-90-20
NERC CIP North American electric reliability; mandatory standards with heavy penalties
ENTSO-E Cybersecurity Guidance European transmission system operator requirements
BDEW Whitepaper German energy sector cybersecurity best practices

The Antifragile Posture for Power and Utilities

Pillar 1: Structural Decoupling — The IT/OT Firewall

Principle: IT and OT must be decoupled to the maximum extent compatible with operational requirements. The air gap is the default. Any bridge must be justified, documented, and monitored.

Antifragile Moves:

Action Implementation Priority
Network segmentation Physically separate IT and OT; unidirectional gateway or data diode for IT→OT data flows P0
No AD trust to OT OT AD (if any) must be a separate forest with one-way trust or no trust P0
Jump host architecture All IT-to-OT access via hardened, monitored jump hosts with session recording P1
Vendor access airlock Vendor VPNs terminate in dedicated DMZ; no direct OT access; remote hands or on-site escort for OT P1
Remove internet from OT OT VLANs have no direct internet egress; updates via offline media or controlled proxy P0
AMI/ Smart Grid isolation Advanced Metering Infrastructure on dedicated network; no direct path to SCADA or EMS P1

Pillar 2: Optionality Preservation — Vendor and Technology Independence

Principle: Power utilities depend on vendors for SCADA, protection relays, turbine control, and substation automation. This dependency must not become a single point of failure.

Antifragile Moves:

  • Multi-vendor strategy for critical systems: No single vendor should control >50% of protection, control, or monitoring functions
  • Spare parts inventory: Maintain critical spares for legacy OT equipment that vendors no longer support
  • Firmware escrow and provenance: Require vendors to deposit firmware; verify cryptographic signatures before deployment
  • Local competence: Train internal staff to operate and maintain systems without vendor support for 30 days
  • Protocol independence: Where possible, support multiple communication protocols to avoid single-vendor lock-in

Pillar 3: Stress-to-Signal Conversion — OT Incident Learning

Principle: OT incidents are rare but high-impact. The organization must learn from every anomaly, near-miss, and exercise.

Antifragile Moves:

  • OT security operations centre (SOC) integration: Feed OT alarms into the SOC with analysts trained on industrial protocols
  • Monthly tabletop exercises: Simulate OT-specific scenarios (compromised EMS, rogue protection relay settings, ransomware on engineering workstations)
  • Post-incident structural mandate: Every OT incident or near-miss must produce at least one architectural or procedural change
  • Red team with bounded OT scope: Annual exercise including OT reconnaissance, constrained by safety requirements

Pillar 4: Sovereign Intelligence — Local AI for the Grid

Principle: Grid data is among the most sensitive an organization possesses. It reveals generation capacity, topology, switching patterns, load profiles, and operational routines.

Antifragile Moves:

  • Local AI for OT anomaly detection: Analyze historian data, DCS logs, and protection relay events without cloud exfiltration
  • Closed-loop digital twin: Train models on local OT data to predict equipment failures; never export raw telemetry
  • Air-gapped AI inference: Deploy inference nodes in OT DMZ with no return path to IT or internet
  • Load forecasting sovereignty: Local models for demand prediction using proprietary grid data

The executive framing:

"Your grid data tells an adversary exactly when and where to strike. It tells a competitor your capacity constraints. Sending it to a cloud AI for 'optimization' is not a technology decision. It is a national security and competitive intelligence decision. Local models on local hardware. Full stop."

Pillar 5: Asymmetric Payoff — Resilience Over Prevention

Principle: In power utilities, perfect prevention is impossible. The goal is to survive and recover faster than the adversary can exploit.

Antifragile Moves:

  • Black start capability: Maintain the ability to restart the grid from shutdown without external power
  • Grid islanding: Design systems so that sections can disconnect and operate independently during disturbances
  • Manual override procedures: Every automated system must have a documented, tested manual procedure
  • Redundant communication paths: Power line carrier, microwave, satellite backup for SCADA and protection communications
  • Protection relay independence: Electromechanical or static relays as backup for digital relays in critical paths

The Rapid Modernisation Plan: Power/Utility Variant

Phase 1: Hygiene (Days 0-30)

In addition to standard hygiene:

Action Owner Deliverable
Inventory all OT assets: DCS, SCADA, EMS, protection relays, RTUs, AMI OT Security / Engineering OT asset inventory with vendor and firmware versions
Map all IT-to-OT network connections Network / OT Connection matrix with business justification per connection
Audit vendor remote access: who, how, when, for how long OT Security / Procurement Vendor access log and hardened policy
Identify OT systems with internet connectivity Network List with immediate remediation plan
Document manual override procedures for critical systems OT Engineering Procedure manual, signed off by operations and safety
Validate backup of EMS / DMS configurations OT Engineering Backup integrity test report

Phase 2: Control (Days 30-60)

Action Owner Deliverable
Implement network segmentation: IT/OT DMZ with unidirectional gateway Network / OT Segmentation architecture and validated firewall rules
Harden vendor access: time-bounded, session-recorded, MFA with hardware tokens OT Security Vendor access gateway operational
Enable OT logging: historian, DCS, firewall, protection relay events OT Security Centralized OT log aggregation (air-gapped SIEM or historian)
Patch OT systems: test in lab, deploy in maintenance windows OT Engineering Patch management procedure with safety gates
Secure engineering workstations (EWS): application whitelisting, no internet OT Security EWS hardening standard deployed

Phase 3: Sovereignty (Days 60-90)

Action Owner Deliverable
Deploy local AI for OT anomaly detection pilot AI / OT Security OT anomaly detection with false positive tuning
Validate black start / islanding procedures Operations Test report with time-to-recovery metrics
Conduct OT-specific tabletop exercise Security / Operations Exercise report with structural improvements
Implement firmware integrity monitoring OT Security Baseline hashes for critical OT firmware
Test protection relay fail-over to electromechanical backup Engineering Fail-over test report

Phase 4: Antifragility (Days 90-180)

Action Owner Deliverable
Annual red team with bounded OT scope Security Red team report with kill chain analysis
Chaos engineering on non-safety IT systems Resilience Monthly experiment schedule and findings
Vendor exit architecture for critical OT platforms Procurement / Engineering 90-day vendor transition plan per critical system
Cross-training: operations staff on manual procedures Operations Training completion metrics
Participate in sector ISAC information sharing Security Threat intelligence integration report

Substation and Protection Specifics

IEC 61850 Security

IEC 61850 (substation communication) uses GOOSE and Sampled Values (SV) that were not designed with security in mind.

Hardening priorities:

  • IEC 61850-90-20: Implement cybersecurity recommendations for IEC 61850 networks
  • Authentication: Digitally sign GOOSE messages where IEDs support it
  • Network segmentation: GOOSE/SV traffic on dedicated VLAN; no routing to IT networks
  • IED hardening: Disable unused services; change default passwords; enable logging
  • Configuration management: Version control for SCL files; change detection for IED settings

Protection Relay Security

Protection relays are the safety-critical edge of the grid. Compromise can cause physical damage.

Control Implementation
Access control Vaulted credentials; multi-person approval for settings changes
Logging All settings changes logged with before/after values
Integrity Cryptographic checksums for firmware and settings files
Redundancy Independent protection schemes (e.g., distance + differential)
Manual backup Electromechanical or static relay backup for critical digital protections

Generation-Specific Considerations

Thermal / Nuclear / Hydro

Generation Type Specific Risk Control
Thermal Turbine control system compromise Dedicated turbine control network; no IT connectivity
Nuclear Safety system interference Air-gapped safety systems; regulatory compliance with national nuclear authority
Hydro Dam control / spillway gate manipulation Physical controls for critical water management; redundant level sensors
Renewables Inverter-based resource (IBR) vulnerability Secure firmware updates; anti-islanding protection; grid support function validation

Distributed Energy Resources (DER)

Solar, wind, and battery inverters connect to the distribution grid with varying security maturity.

  • Action: DER interconnection standards must include cybersecurity requirements
  • Action: Monitor DER communications for anomalous commands or settings changes
  • Action: Aggregate DER visibility in DMS/ADMS without direct control paths

Water and Wastewater Utilities

Water utilities share many characteristics with power but have additional concerns:

Concern Application
Safety Contamination prevention, pressure management, chemical dosing control
SCADA/OT Treatment plant automation, distribution pump control, reservoir level management
Criticality Water is life-sustaining; outages have immediate public health impact
Regulation EPA (US), Drinking Water Inspectorate (UK), national health authorities

Additional controls for water utilities:

  • Physical security for treatment chemicals (chlorine, fluoride) to prevent intentional contamination
  • Redundant water quality sensors with cross-validation
  • Manual override capability for all automated chemical dosing systems
  • Isolation of IT from operational water quality monitoring

M365 in Power and Utilities

Corporate IT in power utilities uses M365 but must be strictly separated from OT.

Consideration Power/Utility Requirement
Data residency M365 data in EU/national datacenters; verify tenant location
Conditional access Block M365 access from non-corporate devices for privileged users; geo-restrict admin access
Guest access Strictly prohibit in OT-connected tenants; heavily vet in corporate tenant
Teams / SharePoint Never used for OT document sharing or control room communication
Mobile device management Field engineer tablets Intune-managed; restricted app installation
Email security EOP baseline minimum; Defender for Office 365 P2 recommended for critical infrastructure

See M365 E3 Hardening for tactical hardening, and apply these overlays.


Evidence Package for Regulators

Requirement Evidence from Antifragile Program
NIS2 risk management Kill chain analysis, T0 asset classification, IT/OT connection matrix
NIS2 incident handling IR runbooks, OT-specific response procedures, quarterly drill reports
NIS2 business continuity Black start test reports, islanding validation, manual procedure verification
NIS2 supply chain security Vendor risk register, firmware provenance, vendor exit architectures
NIS2 encryption Data classification with encryption mapping, TLS configuration audits
NIS2 vulnerability handling Vulnerability scan reports with safety-impact prioritization
CER resilience Chaos engineering results, cross-training metrics, spare parts inventory

Previous: NIST CSF Mapping Next: Vertical: Telco