# Vertical Reference: Power and Utilities

> *"The grid does not care about your quarterly targets. It cares whether you understood the boundary between IT and operations before the adversary did."*

This document adapts the antifragile rapid modernisation approach for power generation, transmission, distribution, and water utilities. These organizations operate industrial control systems (ICS/SCADA) where safety and availability are paramount, regulatory oversight is intense, and the convergence of IT and OT creates existential attack surfaces.

---

## The Power and Utility Context

### What Makes This Sector Different

| Factor | Enterprise Default | Power/Utility Reality |
|--------|-------------------|----------------------|
| Downtime tolerance | Hours | Seconds to minutes (protection systems); hours for generation |
| Safety impact | Data loss, financial harm | Physical harm, loss of life, environmental catastrophe |
| System lifetime | 3-5 years | 20-40 years (generation, transmission, protection relays) |
| Regulatory driver | GDPR, industry standards | NIS2, CER, IEC 62351, NERC CIP (North America), national energy regulators |
| OT/IT boundary | Often porous or nonexistent | Legally and physically mandated; convergence is the primary risk |
| Supply chain | Moderate depth | Extreme (multi-vendor, multi-national, obsolete equipment) |
| Remote access | Common, convenient | Heavily restricted; often requires physical presence or dedicated lines |

### The IT/OT Convergence Problem

Power utilities historically operated OT networks (SCADA, EMS, DMS, protection relays) as **air-gapped systems**. Over the past two decades, convergence has introduced:

- Remote diagnostics over internet-connected VPNs
- Centralized patch management through IT SCCM/WSUS
- Business intelligence systems reading OT historian data
- Vendor remote support terminals in control centers
- Smart grid and Advanced Metering Infrastructure (AMI) connecting customer-facing IT to grid operations

Every convergence point is a **potential bridge for adversaries** from IT to OT.

**The executive framing**:

> *"Your control room does not need email. Your protection relays do not need internet access. Every connection between your IT network and your operational technology is a connection an adversary can cross. We are not adding bureaucracy. We are re-establishing the boundary that keeps the lights on."*

---

## Regulatory Landscape

### EU NIS2 Directive (2023)

Power utilities and water suppliers are classified as **essential entities** under NIS2.

| NIS2 Requirement | Power/Utility Application |
|-----------------|--------------------------|
| Risk management measures | Kill chain analysis for IT→OT bridges; physical security assessment |
| Supply chain security | Vendor access inventory for all OT equipment; firmware provenance tracking |
| Incident reporting (24h → 72h) | Automated detection and reporting to national CSIRT and energy regulator |
| Business continuity | Black start capability; grid islanding procedures; manual override validation |
| Cryptography | Encrypted communications for all IT/OT integration points |
| MFA | Hardware tokens for all remote access to OT or critical IT systems |
| Vulnerability handling | Risk-based prioritization with **safety impact assessment** |

### CER Directive (Critical Entities Resilience)

Requires power utilities to demonstrate resilience against:

- Natural disasters
- Cyberattacks
- Supply chain disruptions
- Pandemics and workforce unavailability

**Antifragile application**: Chaos engineering for non-safety systems; cross-training for manual procedures; distributed spare parts inventory.

### Sector-Specific Standards

| Standard | Scope |
|----------|-------|
| **IEC 62351** | Power systems cybersecurity: communications protocols, authentication, encryption |
| **IEC 61850** | Substation communication (GOOSE, SV); security extensions for IEC 61850-90-20 |
| **NERC CIP** | North American electric reliability; mandatory standards with heavy penalties |
| **ENTSO-E Cybersecurity Guidance** | European transmission system operator requirements |
| **BDEW Whitepaper** | German energy sector cybersecurity best practices |

---

## The Antifragile Posture for Power and Utilities

### Pillar 1: Structural Decoupling — The IT/OT Firewall

**Principle**: IT and OT must be decoupled to the maximum extent compatible with operational requirements. The air gap is the default. Any bridge must be justified, documented, and monitored.

**Antifragile Moves**:

| Action | Implementation | Priority |
|--------|---------------|----------|
| **Network segmentation** | Physically separate IT and OT; unidirectional gateway or data diode for IT→OT data flows | P0 |
| **No AD trust to OT** | OT AD (if any) must be a separate forest with one-way trust or no trust | P0 |
| **Jump host architecture** | All IT-to-OT access via hardened, monitored jump hosts with session recording | P1 |
| **Vendor access airlock** | Vendor VPNs terminate in dedicated DMZ; no direct OT access; remote hands or on-site escort for OT | P1 |
| **Remove internet from OT** | OT VLANs have no direct internet egress; updates via offline media or controlled proxy | P0 |
| **AMI/ Smart Grid isolation** | Advanced Metering Infrastructure on dedicated network; no direct path to SCADA or EMS | P1 |

### Pillar 2: Optionality Preservation — Vendor and Technology Independence

**Principle**: Power utilities depend on vendors for SCADA, protection relays, turbine control, and substation automation. This dependency must not become a single point of failure.

**Antifragile Moves**:

- **Multi-vendor strategy for critical systems**: No single vendor should control >50% of protection, control, or monitoring functions
- **Spare parts inventory**: Maintain critical spares for legacy OT equipment that vendors no longer support
- **Firmware escrow and provenance**: Require vendors to deposit firmware; verify cryptographic signatures before deployment
- **Local competence**: Train internal staff to operate and maintain systems without vendor support for 30 days
- **Protocol independence**: Where possible, support multiple communication protocols to avoid single-vendor lock-in

### Pillar 3: Stress-to-Signal Conversion — OT Incident Learning

**Principle**: OT incidents are rare but high-impact. The organization must learn from every anomaly, near-miss, and exercise.

**Antifragile Moves**:

- **OT security operations centre (SOC) integration**: Feed OT alarms into the SOC with analysts trained on industrial protocols
- **Monthly tabletop exercises**: Simulate OT-specific scenarios (compromised EMS, rogue protection relay settings, ransomware on engineering workstations)
- **Post-incident structural mandate**: Every OT incident or near-miss must produce at least one architectural or procedural change
- **Red team with bounded OT scope**: Annual exercise including OT reconnaissance, constrained by safety requirements

### Pillar 4: Sovereign Intelligence — Local AI for the Grid

**Principle**: Grid data is among the most sensitive an organization possesses. It reveals generation capacity, topology, switching patterns, load profiles, and operational routines.

**Antifragile Moves**:

- **Local AI for OT anomaly detection**: Analyze historian data, DCS logs, and protection relay events without cloud exfiltration
- **Closed-loop digital twin**: Train models on local OT data to predict equipment failures; never export raw telemetry
- **Air-gapped AI inference**: Deploy inference nodes in OT DMZ with no return path to IT or internet
- **Load forecasting sovereignty**: Local models for demand prediction using proprietary grid data

**The executive framing**:

> *"Your grid data tells an adversary exactly when and where to strike. It tells a competitor your capacity constraints. Sending it to a cloud AI for 'optimization' is not a technology decision. It is a national security and competitive intelligence decision. Local models on local hardware. Full stop."*

### Pillar 5: Asymmetric Payoff — Resilience Over Prevention

**Principle**: In power utilities, perfect prevention is impossible. The goal is to survive and recover faster than the adversary can exploit.

**Antifragile Moves**:

- **Black start capability**: Maintain the ability to restart the grid from shutdown without external power
- **Grid islanding**: Design systems so that sections can disconnect and operate independently during disturbances
- **Manual override procedures**: Every automated system must have a documented, tested manual procedure
- **Redundant communication paths**: Power line carrier, microwave, satellite backup for SCADA and protection communications
- **Protection relay independence**: Electromechanical or static relays as backup for digital relays in critical paths

---

## The Rapid Modernisation Plan: Power/Utility Variant

### Phase 1: Hygiene (Days 0-30)

In addition to standard hygiene:

| Action | Owner | Deliverable |
|--------|-------|-------------|
| Inventory all OT assets: DCS, SCADA, EMS, protection relays, RTUs, AMI | OT Security / Engineering | OT asset inventory with vendor and firmware versions |
| Map all IT-to-OT network connections | Network / OT | Connection matrix with business justification per connection |
| Audit vendor remote access: who, how, when, for how long | OT Security / Procurement | Vendor access log and hardened policy |
| Identify OT systems with internet connectivity | Network | List with immediate remediation plan |
| Document manual override procedures for critical systems | OT Engineering | Procedure manual, signed off by operations and safety |
| Validate backup of EMS / DMS configurations | OT Engineering | Backup integrity test report |

### Phase 2: Control (Days 30-60)

| Action | Owner | Deliverable |
|--------|-------|-------------|
| Implement network segmentation: IT/OT DMZ with unidirectional gateway | Network / OT | Segmentation architecture and validated firewall rules |
| Harden vendor access: time-bounded, session-recorded, MFA with hardware tokens | OT Security | Vendor access gateway operational |
| Enable OT logging: historian, DCS, firewall, protection relay events | OT Security | Centralized OT log aggregation (air-gapped SIEM or historian) |
| Patch OT systems: test in lab, deploy in maintenance windows | OT Engineering | Patch management procedure with safety gates |
| Secure engineering workstations (EWS): application whitelisting, no internet | OT Security | EWS hardening standard deployed |

### Phase 3: Sovereignty (Days 60-90)

| Action | Owner | Deliverable |
|--------|-------|-------------|
| Deploy local AI for OT anomaly detection pilot | AI / OT Security | OT anomaly detection with false positive tuning |
| Validate black start / islanding procedures | Operations | Test report with time-to-recovery metrics |
| Conduct OT-specific tabletop exercise | Security / Operations | Exercise report with structural improvements |
| Implement firmware integrity monitoring | OT Security | Baseline hashes for critical OT firmware |
| Test protection relay fail-over to electromechanical backup | Engineering | Fail-over test report |

### Phase 4: Antifragility (Days 90-180)

| Action | Owner | Deliverable |
|--------|-------|-------------|
| Annual red team with bounded OT scope | Security | Red team report with kill chain analysis |
| Chaos engineering on non-safety IT systems | Resilience | Monthly experiment schedule and findings |
| Vendor exit architecture for critical OT platforms | Procurement / Engineering | 90-day vendor transition plan per critical system |
| Cross-training: operations staff on manual procedures | Operations | Training completion metrics |
| Participate in sector ISAC information sharing | Security | Threat intelligence integration report |

---

## Substation and Protection Specifics

### IEC 61850 Security

IEC 61850 (substation communication) uses GOOSE and Sampled Values (SV) that were not designed with security in mind.

**Hardening priorities**:

- **IEC 61850-90-20**: Implement cybersecurity recommendations for IEC 61850 networks
- **Authentication**: Digitally sign GOOSE messages where IEDs support it
- **Network segmentation**: GOOSE/SV traffic on dedicated VLAN; no routing to IT networks
- **IED hardening**: Disable unused services; change default passwords; enable logging
- **Configuration management**: Version control for SCL files; change detection for IED settings

### Protection Relay Security

Protection relays are the **safety-critical edge** of the grid. Compromise can cause physical damage.

| Control | Implementation |
|---------|---------------|
| Access control | Vaulted credentials; multi-person approval for settings changes |
| Logging | All settings changes logged with before/after values |
| Integrity | Cryptographic checksums for firmware and settings files |
| Redundancy | Independent protection schemes (e.g., distance + differential) |
| Manual backup | Electromechanical or static relay backup for critical digital protections |

---

## Generation-Specific Considerations

### Thermal / Nuclear / Hydro

| Generation Type | Specific Risk | Control |
|----------------|--------------|---------|
| **Thermal** | Turbine control system compromise | Dedicated turbine control network; no IT connectivity |
| **Nuclear** | Safety system interference | Air-gapped safety systems; regulatory compliance with national nuclear authority |
| **Hydro** | Dam control / spillway gate manipulation | Physical controls for critical water management; redundant level sensors |
| **Renewables** | Inverter-based resource (IBR) vulnerability | Secure firmware updates; anti-islanding protection; grid support function validation |

### Distributed Energy Resources (DER)

Solar, wind, and battery inverters connect to the distribution grid with varying security maturity.

- **Action**: DER interconnection standards must include cybersecurity requirements
- **Action**: Monitor DER communications for anomalous commands or settings changes
- **Action**: Aggregate DER visibility in DMS/ADMS without direct control paths

---

## Water and Wastewater Utilities

Water utilities share many characteristics with power but have additional concerns:

| Concern | Application |
|---------|-------------|
| **Safety** | Contamination prevention, pressure management, chemical dosing control |
| **SCADA/OT** | Treatment plant automation, distribution pump control, reservoir level management |
| **Criticality** | Water is life-sustaining; outages have immediate public health impact |
| **Regulation** | EPA (US), Drinking Water Inspectorate (UK), national health authorities |

**Additional controls for water utilities**:

- **Physical security** for treatment chemicals (chlorine, fluoride) to prevent intentional contamination
- **Redundant water quality sensors** with cross-validation
- **Manual override capability** for all automated chemical dosing systems
- **Isolation of IT from operational water quality monitoring**

---

## M365 in Power and Utilities

Corporate IT in power utilities uses M365 but must be strictly separated from OT.

| Consideration | Power/Utility Requirement |
|--------------|--------------------------|
| **Data residency** | M365 data in EU/national datacenters; verify tenant location |
| **Conditional access** | Block M365 access from non-corporate devices for privileged users; geo-restrict admin access |
| **Guest access** | Strictly prohibit in OT-connected tenants; heavily vet in corporate tenant |
| **Teams / SharePoint** | Never used for OT document sharing or control room communication |
| **Mobile device management** | Field engineer tablets Intune-managed; restricted app installation |
| **Email security** | EOP baseline minimum; Defender for Office 365 P2 recommended for critical infrastructure |

See [M365 E3 Hardening](../playbooks/m365-e3-hardening.md) for tactical hardening, and apply these overlays.

---

## The Controlled Burn Adaptation: When Greenfield Is Not an Option

The antifragile framework holds that organisations should build toward the ability to deploy greenfield — rebuild from scratch, on clean infrastructure, from version-controlled configuration. This is the ultimate expression of structural decoupling: if you can rebuild the environment, no adversary and no vendor holds you hostage.

Power utilities, water suppliers, and telecom network operators frequently view this principle as inapplicable. The grid does not go dark for a rebuild exercise. Protection relays cannot be factory-reset during a fault. OT systems operate under safety cases that require regulatory approval for any configuration change. The controlled burn, taken literally, cannot happen.

This is correct. It is also not the end of the conversation.

**The goal of greenfield capability is to eliminate inherited compromise and return to a known-good operational state.** For IT environments, the method is rebuild. For OT/NT environments, the method is different — but the goal is identical, and it is achievable. The absence of a literal rebuild path does not justify the absence of a recovery plan.

### The OT-Adapted Greenfield Stack

**Layer 1: IT greenfield protects OT.** The corporate IT environment, M365 tenant, SCADA servers, historian, engineering workstations, and HMI layer can almost always be made greenfield-capable even when OT hardware cannot. An adversary who compromises the IT layer and finds a clean rebuild path loses their persistence and pivot path without a single OT device being touched. IT greenfield is the outer perimeter of an OT environment that cannot be rebuilt itself. This is the first investment.

**Layer 2: OT configuration as code.** PLC logic, IED settings files, protection relay configuration archives, SCADA database snapshots, DCS export files — all of these belong in version-controlled backups with integrity verification. The ability to restore a known-good configuration to existing hardware is the OT equivalent of greenfield: the hardware remains, but the software state is wiped and rebuilt from a verified baseline. This is not a backup exercise. It is a discipline — with the same rigour that ASTRAL applies to M365 configuration, applied to OT configuration archives. Every piece of OT configuration that exists only in the device and nowhere else is a single point of failure.

**Layer 3: Manual operation as the fallback layer.** The ability to operate critical systems without the automation layer is, in practice, the ability to drop the compromised layer and continue service. A power utility that can maintain 70–80% of service from manual procedures during a SCADA compromise has a fundamentally different risk profile than one that cannot. Manual override procedures must be:
- Documented in detail, not just referenced in an emergency plan
- Tested under realistic conditions, not just reviewed in a tabletop
- Known by currently assigned operations staff, not just veterans who may have left
- Validated at least annually — capability that is not practised does not exist when it is needed

**Layer 4: Compartmentalisation as partial burn.** OT environments are typically sectionable. Grid islanding, substation isolation, plant-level control separation, and control centre failover allow the operator to sacrifice and rebuild one section while maintaining critical service in others. This is the OT equivalent of the controlled burn: localised rather than total, sequential rather than simultaneous, but governed by the same principle — designed-in ability to contain, recover, and restore without waiting for a complete environment to be clean.

**Layer 5: Planned long-cycle refresh.** OT systems have 20–40 year operational lifetimes, but those lifetimes should be a programme, not an accident. Organisations without a documented OT refresh schedule — with component-by-component replacement milestones, firmware escrow requirements, spare parts inventory targets, and vendor succession planning — are not avoiding greenfield. They are deferring it until a crisis forces it under the worst possible conditions: compromised hardware, unavailable vendors, missing documentation, and no tested procedures.

### The Acceptance Statement

Some OT components in critical infrastructure genuinely cannot be replaced on any timescale that security planning can influence. Legacy protection relays on operational transmission lines. Nuclear instrumentation systems under active safety cases. Water treatment chemical dosing controllers that predate the organisation's current IT function.

For these systems, the correct position is explicit acceptance, not avoidance:

1. **Name them.** Identify specifically which systems are outside the rebuild envelope and why.
2. **Isolate them.** The isolation must be proportional to the acknowledged unrepairability. A system that cannot be patched, cannot be replaced, and cannot be rebuilt must be surrounded by compensating controls so thorough that its compromise cannot propagate.
3. **Monitor them obsessively.** Configuration integrity monitoring, network traffic baselining, and anomaly detection for these specific systems — because when you cannot fix the asset, detection and containment are the only remaining defences.
4. **Plan their eventual replacement.** "This system cannot be replaced in the current operational context" is acceptable. "This system will never be replaced" is not a security posture — it is a deferred decision that will be made under worse conditions later.

The acceptance statement is not a sign of weakness. It is the honest foundation of a credible security programme. Regulators, insurers, and incident responders all prefer an organisation that knows exactly where its limits are and has compensating controls in place over one that claims no limits and has no plan.

### The OT Greenfield Test

*"If our IT and SCADA layers were fully compromised tonight: could we maintain critical service from manual procedures within 4 hours? Rebuild the IT layer from clean baselines within 48 hours? Restore full automated operation from verified OT configuration backups within two weeks? And have we actually tested each of these in the past 12 months?"*

If any answer is no, the gap is in manual procedures, IT rebuild capability, OT configuration management, or test cadence — not in the impossibility of the OT environment itself.

---

## Evidence Package for Regulators

| Requirement | Evidence from Antifragile Program |
|------------|----------------------------------|
| NIS2 risk management | Kill chain analysis, T0 asset classification, IT/OT connection matrix |
| NIS2 incident handling | IR runbooks, OT-specific response procedures, quarterly drill reports |
| NIS2 business continuity | Black start test reports, islanding validation, manual procedure verification |
| NIS2 supply chain security | Vendor risk register, firmware provenance, vendor exit architectures |
| NIS2 encryption | Data classification with encryption mapping, TLS configuration audits |
| NIS2 vulnerability handling | Vulnerability scan reports with safety-impact prioritization |
| CER resilience | Chaos engineering results, cross-training metrics, spare parts inventory |

---

*Previous: [NIST CSF Mapping](nist-csf-mapping.md)*
*Next: [Vertical: Telco](vertical-telco.md)*