# The Definitive Guide to AI Data Centers: Siting, Design, Build, Commissioning & Operations (2026 and Forward)

This is the master Table of Contents for the guide: 17 Parts (numbered 0–16) plus Appendices A–G. The guide is global and international in scope — it does not assume a single-region default; design-basis forks for region (50/60 Hz, IEC vs NEC/ANSI, EU/UK/Nordic/Middle East/APAC regimes) are woven throughout Parts 3, 4, and 6 and consolidated in Appendix G. The spine follows the project lifecycle — from strategy and project delivery through siting, power, electrical, cooling, the building, compute, networking, storage, software, security, reliability, commissioning, operations, sustainability, and the future. It is current to 2026 and deliberately forward-looking.

Cross-reference bullets ("→ see Chapter X.Y") mark where a concept's canonical derivation lives elsewhere; it is defined once and referenced everywhere else.

Status: Phase 1 — structure (pending sign-off).

---

## PART 0 — Foundations & How to Use This Guide

### Chapter 0.1 — Orientation: The AI Data Center as a Single Co-Designed Machine
- The thesis: compute + fabric + power + cooling + software are one system, optimized for TCO and goodput, not siloed subsystems
- Why 2024–2026 is an inflection: chip-bound → power-bound; racks from ~10–15 kW to 120–600 kW; inference overtaking training
- The lifecycle spine of this guide: Strategy → Delivery → Siting → Power → Electrical → Cooling → Building → Compute → Networking → Storage → Software → Security/Reliability → Commissioning → Operations → Sustainability → Future
- Two parallel tracks that must interlock: facility (power/cooling/building) and IT/cluster (compute/fabric/storage/scheduler)
- The "grid-to-chip" power spine and "chip-to-atmosphere" heat spine as recurring narrative threads

### Chapter 0.2 — How to Read This Guide: Decisions, Consequences & Reference Data
- Decision-and-consequence framing: every fork names the tradeoff and the downstream cost
- The "what happens if you choose X vs Y" pattern across cost, power, reach, latency, reliability, serviceability, schedule
- Navigating by lifecycle stage vs by engineering discipline vs by workload archetype
- Reference artifacts: design-basis sheets, scalable-unit (SU) budgets, decision matrices, scorecards
- Numbers provenance and vintage: how to read volatile forecasts and date-stamped claims
- Altitude rule enforced throughout: a roadmap forward-pointer is a bullet, never a chapter — per-part roadmaps consolidate into Chapter 16.2

### Chapter 0.3 — Vocabulary, Mental Models & the Metric Stack
- Power vocabulary: MW vs MVA vs kVA, critical IT power vs facility power vs grid draw, EDPp vs TDP
- Capacity units: the rack, the NVLink domain, the scalable unit (SU), the pod, the gigawatt campus
- The metric stack at a glance: PUE/WUE/ERF/REF/CUE, MFU/MBU, goodput/ETTR, tokens-per-joule, $/GPU-hr, $/M-tokens → canonical definitions in Chapter 15.1; here we only orient the reader
- Availability algebra: MTBF/MTTR, serial vs parallel composition, the nines and their downtime budgets
- The three-tier network hierarchy: scale-up vs scale-out vs scale-across

### Chapter 0.4 — The Standards & Specifications Landscape (Living Index)
- Facility resilience: Uptime Tier I–IV, TIA-942-C Rated 1–4, EN 50600/ISO 22237 Availability Classes
- Thermal & efficiency: ASHRAE TC 9.9 (air A1–A4, liquid W/H classes), ISO/IEC 30134 KPI family
- Open hardware: OCP (Open Rack v3, Open Rack Wide, Mt. Diablo/Diablo 400, ORV3, S.A.F.E., Caliptra)
- Security & compliance: SOC 2, ISO 27001/42001, FedRAMP/FedRAMP 20x, CMMC, NIST families
- Reference architectures: NVIDIA DGX SuperPOD/MGX/DSX, hyperscaler designs; how to map them vendor-neutrally

### Chapter 0.5 — Reliability, Redundancy & Availability: The Design-Basis Primer
- The redundancy vocabulary defined up front for use throughout Parts 1–6: N, N+1, N+2, 2N, 2(N+1), distributed-redundant, block-redundant, and "catcher" topologies
- Fault domains, blast radius, and single-points-of-failure as a recurring design lens
- Concurrent maintainability vs fault tolerance; the Uptime Tier I–IV and TIA-942 Rated 1–4 ladder at a glance — full treatment in Chapter 12.1
- Availability vs goodput: why AI factories optimize a different target than traditional IT — full treatment in Chapter 12.2
- The primer feeds the quantitative machinery: RBD/Markov/Monte-Carlo availability math is built out in Chapter 12.5
- Reading a redundancy spec and mapping it to cost / schedule / serviceability; forward pointer to the redundancy-topology selector in Appendix C

---

## PART 1 — Strategy, Workload Archetypes & Economics

### Chapter 1.1 — The Archetype Decision Framework: Workload Is the Master Variable
- The five workload archetypes: pre-training, post-training (SFT/RLHF/RL), online inference, batch inference, edge inference
- The four procurement archetypes: greenfield self-build, retrofit, colocation (wholesale & retail), neocloud/GPU rental
- The requirements cascade: how archetype propagates into density, cooling, fabric, storage, redundancy, siting, TCO
- Reversible vs irreversible decisions; the cost of re-deciding at each lifecycle stage
- Scoping artifacts: workload profile sheet, capacity ramp curve, design-basis document
- Anti-patterns: training fabric for an inference business; retrofitting past the air-cooling cliff; over-provisioned redundancy for checkpointable jobs

### Chapter 1.2 — Training Data Centers: Synchronous, Dense, Checkpointable
- Pre-training as one tightly-coupled supercomputer: all-reduce/all-gather/reduce-scatter regime
- Scale-up domain design and scale-out fabric: NVLink domain sizing, rail-optimized fat-tree, oversubscription choices
- Power density and cooling mandates (120–140 kW GB200 NVL72, DLC, warm-water loops)
- Reliability philosophy: checkpoint-and-resume tolerance, MTBF realities, straggler economics
- Checkpointing as a training constraint → Young/Daly optimal-interval math canonical in Chapter 9.4; here we cover only its bearing on synchronous training
- Storage for training: checkpoint bandwidth, dataset streaming, LOSF
- Multi-datacenter and gigawatt-campus training: hierarchical/async SGD, inter-campus fiber → cross-region fabric engineering in Chapter 8.8

### Chapter 1.3 — Inference Data Centers: Bursty, Distributed, Always-On
- Inference taxonomy: interactive/online vs batch/offline vs agentic/long-context
- The economics shift: inference as 60–90% of compute draw and the revenue workload
- Reasoning and test-time compute as a demand multiplier: post-o1/R1 long-decode reasoning shifts the prefill:decode ratio, inflates decode share and KV-cache pressure, and drives fleet sizing → serving engineering in Chapter 10.11
- Latency-driven siting and regional/Tier-2 distribution
- Disaggregated inference: prefill (compute-bound) vs decode (memory-bound); KV-cache as a first-class resource
- Reliability and uptime expectations (2N/Tier IV vs N+1 + geo-redundancy)
- GPU vs inference ASIC selection; tokens-per-watt and tokens-per-dollar

### Chapter 1.4 — Post-Training, Fine-Tuning & RL: The Hybrid Middle
- The post-training landscape: SFT, RLHF/RLAIF, GRPO/PPO, large-scale RL for reasoning
- Why RL is inference-heavy training: rollout generation vs policy update
- Asynchronous and disaggregated RL infrastructure; staleness-constrained rollouts
- LoRA/QLoRA vs full fine-tune sizing; resource heterogeneity
- Capacity planning for spiky, project-based demand

### Chapter 1.5 — Edge Inference & Distributed Micro-Datacenters
- Defining the edge: on-prem appliance, telco/MEC, Tier-2 metro colo, CDN-adjacent grids
- Latency budgets and the 30/50/100 ms perceptibility thresholds
- Power/thermal constraints at the edge; containerized/modular form factors
- Lights-out fleet operations and zero-touch provisioning
- Edge economics vs centralized regional inference; when edge is wrong

### Chapter 1.6 — Procurement Archetypes: Build vs Buy vs Rent
- Greenfield self-build, retrofit limits, wholesale/retail colocation, neocloud/GPU rental
- Hybrid procurement: burst-to-neocloud, colo-anchor-plus-cloud-overflow, build-core-rent-edge
- Decision drivers: time-to-power, capital intensity, control, scale, workload duration, risk appetite
- Build-vs-own-vs-lease as a strategic fork: powered-shell vs build-to-suit vs colo-lease optionality under demand uncertainty → quantitative NPV treatment in Chapter 1.8, structuring/accounting in Chapter 2.5
- Behind-the-meter generation vs grid interconnection as a procurement gate

### Chapter 1.7 — The Requirements-and-Consequences Matrix
- Density-to-cooling-cliff map; cooling modality selection by archetype
- Network fabric sizing by workload; GPU:CPU, GPU:memory, GPU:storage ratios
- Storage architecture and redundancy tier mapped to workload tolerance
- Siting: power-first vs latency-first; reference design-basis sheets per archetype

### Chapter 1.8 — Business Models, Economics & ROI
- The AI data center as a capital asset: cost stack, TCO, capital recovery factor, costing denominators
- GPU economics, depreciation & obsolescence (canonical home): the 2–3 yr economic vs 5–6 yr book life debate, the training-to-inference cascade model, residual values — the contested load-bearing figures bind to the dated forecast register in Appendix D; Part 7 covers procurement execution, Part 14 covers refresh execution, Part 16 covers macro/financing
- Pricing, utilization & revenue management: $/GPU-hr ladder, the ~70% breakeven, $/M-tokens, revenue-per-MW tiering
- Inference revenue & unit economics (canonical home): cost/revenue-per-Mtoken build-up (tokens/GPU-sec → $/Mtoken), the application-layer gross-margin waterfall, and the ~10x/yr token-price deflation (LLMflation) as a first-class inference-ROI risk → serving-engineering drivers in Chapter 10.11, billing in Chapter 10.9, $/Mtoken calculator in Appendix C
- Build-vs-own-vs-lease economics (quantitative home): $/kW-month colo benchmarks → $/MW-yr TCO, build-vs-own-vs-lease NPV, leasing optionality value under demand uncertainty → strategy framing in Chapter 1.6, structuring in Chapter 2.5
- Financing STRATEGY (the why/whether): GPU-collateralized debt, DDTLs, SPVs, securitization, circular financing → the deal mechanics live in Chapter 2.5
- The four operating archetypes: hyperscaler vs neocloud vs colo/build-to-suit vs self-build
- Protecting ROI: stranded-asset risk, depreciation policy, power-cost lever, design-for-flexibility, the ROI scorecard
- Stress testing the downside: utilization collapse, residual shock, rate spike, contract non-renewal, the secondary-GPU-market-depth (Burry) thesis
- Altitude marker: this is the firm-level economic objective function; sector-macro in Chapter 16.4, structural scenarios in Chapter 16.5

---

## PART 2 — Project Delivery, Schedule, Procurement, Contracts & Risk

### Chapter 2.1 — Program & Project Management: The Integrated Master Schedule & Critical Path
- The project lifecycle and phase-gate model; the "time-to-first-train" race as a program-management problem
- Building the Integrated Master Schedule (IMS); the critical path and float management across power, building, and IT tracks
- Schedule risk analysis: Monte Carlo, P50/P90 dates, the long poles (interconnection, transformers, air permits)
- Owner's project controls: earned value, milestone deposits, change-order and claims management
- The facility-vs-cluster two-track schedule and the integration milestones that bind them
- Stage-gate governance, board approvals, and the assumptions/decisions register

### Chapter 2.2 — Delivery Models & the Owner's Organization
- EPC vs design-build vs design-bid-build vs progressive/IPD: risk allocation, speed, and price certainty tradeoffs
- Owner's representative, owner's engineer, and the commissioning agent as independent roles
- Owner-furnished equipment (OFE) vs contractor-furnished: GPUs, transformers, gensets, chillers
- Multi-prime vs single-GC strategy; design-assist and early contractor involvement
- Splitting the base building from the IT fit-out; the powered-shell handoff
- Decision drivers: schedule, cost certainty, scope maturity, in-house capability, repeatability across a portfolio

### Chapter 2.3 — Long-Lead Procurement & the End-to-End Equipment Supply Chain
- The consolidated lead-time register: transformers (distribution 2–3 yr, HV/GSU 4–5 yr), switchgear, gensets, gas/aeroderivative turbines, chillers/CDUs, GPUs/CoWoS/HBM, gas-pipeline lateral capacity
- Long-lead procurement as a schedule discipline: reservation deposits, slot-reservation agreements as a distinct contracting instrument, OFE staging, buffer inventory
- Allocation dynamics and vendor financing; the GPU and HBM allocation game → upstream silicon detail in Chapter 7.6, advanced-packaging engineering in Chapter 7.7
- Geographic concentration & single-source risk: transformers, the copper crunch, grid-scale switchgear
- Tariffs, trade policy & export controls on imported equipment; nearshoring and dual-sourcing
- The supply-chain control tower: expediting, factory acceptance, logistics of oversized/heavy loads

### Chapter 2.4 — The Contract Stack & Commercial/Legal Framework
- The end-to-end contract map: ISA/LGIA, PPA, EPC, equipment supply agreements, colo/MSA, O&M, GPU-cloud SLAs
- Risk allocation across the stack: who carries schedule, performance, and force-majeure risk
- Liquidated damages, performance guarantees, warranties, and bonding
- Equipment supply agreements, slot-reservation contracts, and the OFE interface warranties
- Interconnection and grid-side agreements as commercial instruments (cost allocation, milestones, security)
- Tenant-facing commercial terms → consumption/SLA productization detail in Chapter 10.9

### Chapter 2.5 — Project Finance & Capital Formation (Mechanics)
- Project finance vs corporate finance; SPV structuring and security packages
- The underwriting / project financial model: the multi-year pro-forma (capacity ramp → revenue → opex → debt service → levered/unlevered cash flow), WACC/hurdle rate, DCF → IRR/NPV/DSCR/equity-multiple, contracted-vs-merchant split driving debt capacity, the sensitivity tornado → companion levered-IRR model in Appendix C, chained to the Chapter 1.8 ROI scorecard
- Anchor-tenant/offtake/contracted-revenue underwriting; the lender's view of GPU-cloud demand
- Counterparty, concentration & residual-liquidity risk: single-anchor-tenant concentration, takeout/refinancing risk, and secondary-GPU-market depth as the load-bearing assumption under every ABS/residual structure → downside stress in Chapter 1.8
- The PPA as the second underwriting pillar: firm-power contract as a financing precondition, power-tenor-vs-debt-tenor matching, merchant/curtailment in the lender's downside → energy-supply strategy in Chapter 3.4
- Lease vs own: sale-leaseback of the shell and of the GPUs; operating vs finance lease treatment → build-vs-lease economics in Chapter 1.8
- Tax equity / ITC for on-site renewables and storage; PTC monetization
- Construction-loan-to-perm-takeout mechanics; DDTLs and draw schedules
- GPU-backed ABS and ratings-agency treatment; the "circular financing" debate examined
- Developer-vs-operator-vs-capital-provider deal structures: JV, build-to-suit, forward-funding
- High-level financing strategy → see Chapter 1.8; here we cover the deal mechanics

### Chapter 2.6 — Insurance & Risk Transfer
- Builder's risk / construction all-risk; delay-in-startup (DSU) and the schedule-risk overlay
- Property & business-interruption insurance for GPU-dense, liquid-cooled facilities
- The liquid-cooling and lithium/BESS insurability problem; insurers repricing the new risk pool
- How insurability drives design: FM Global approvals gating cooling and battery choices → fire-design in Chapter 6.5
- GPU-collateral lender insurance requirements; valuation and total-loss treatment
- Parametric/weather, environmental liability, and cyber insurance for the AI campus

### Chapter 2.7 — Simulation-Driven Design & the Digital Twin as a Design-Validation Tool
- CFD thermal modeling of data halls, CDUs, and liquid loops before steel is cut
- Power-system simulation: load-flow, short-circuit, protection coordination, EMT for inverter-rich plants
- Fabric/network simulation: topology, congestion, and collective-performance modeling pre-build
- Discrete-event and queueing simulation for capacity, scheduling, and goodput projection
- AI-assisted design (generative layout, automated single-line/P&ID drafting, design copilots) as a delivery accelerant
- The design-validation digital twin vs the operational DCIM twin (Chapter 14.2) and the quantitative availability model (Chapter 12.5); model handoff and calibration
- Co-simulation and the "validate-before-commission" discipline that de-risks IST

---

## PART 3 — Site Selection, Power Procurement & Permitting

### Chapter 3.1 — Site Selection Strategy & the Reordered Criteria Hierarchy
- The 2024–2026 reordering: from latency/land-first to power/speed-first
- Speed-to-power as the gating screen; workload-driven siting
- Single gigawatt campus vs distributed multi-region fabric
- Build-to-suit vs powered-shell vs colocation vs greenfield self-develop
- The site selection funnel and time-value of speed

### Chapter 3.2 — Grid Interconnection, Queues & Speed-to-Power
- Anatomy of the interconnection process (feasibility → SIS → facilities study → ISA/LGIA)
- Queue reform: FERC Order 2023, cluster studies, first-ready-first-served; queue attrition and "phantom load"
- Large-load interconnection regimes: ERCOT (SB6, NPRR1234), PJM co-location, MISO, SPP, CAISO, vertically integrated utilities
- Non-US large-load interconnection regimes: EU/UK (Ofgem, National Grid ESO queues), Ireland (EirGrid constraints/moratorium), the Nordics, the Middle East, India, and APAC (Japan; Singapore & Malaysia moratoria) — how the process differs outside FERC/ISO-land
- Federalization collision: DOE Section 403, FERC large-load NOPR (RM26-4), state pushback
- Curtailable/flexible/non-firm load as the accelerant; bridge and phased energization
- Transmission buildout, 765 kV backbones, and long-lead equipment as a co-equal gate
- Grid-impact process mechanics owned here; the macro load-growth narrative → see Chapter 16.1; grid-services-as-revenue → see Chapter 15.8

### Chapter 3.3 — Power Availability & Power-Cost Structure
- Transmission proximity, voltage class, and substation headroom analysis
- Nodal/LMP pricing, congestion, capacity/demand charges, curtailment exposure
- Network upgrade cost allocation and the "who pays" question
- Power as 25–60% of opex and the dominant TCO lever

### Chapter 3.4 — Energy Supply Strategy: Grid PPA, BYOP & Co-Location
- Grid-only, grid + bridge, behind-the-meter island, hybrid/co-located portfolios
- PPA structures: physical vs virtual (VPPA), fixed vs indexed, 24/7 CFE matching, additionality
- Power-market risk management: nodal basis risk, VPPA shape/settlement risk (including negative settlement), CRRs/FTRs, heat-rate hedging — framed for ERCOT-merchant exposure → downside stress tests in Chapter 1.8, lender downside in Chapter 2.5
- Long-duration firm supply: nuclear restarts/uprates, SMRs, renewables + storage firming
- Co-location regulatory treatment (FERC PJM order, CIRs, cost-causation, ratepayer shielding)
- Energy cost modeling and the power-tenor-vs-GPU-life matching problem

### Chapter 3.5 — On-Site & Bring-Your-Own-Power Generation (Energy-Supply Strategy)
- The speed-to-power crisis and the bridge-power thesis (3-phase deployment sequence)
- Natural gas reciprocating engines (RICE) and gas turbines (aeroderivative/industrial/heavy-frame): the why/whether, efficiency, turndown, the supply-chain bottleneck and refurbished units
- Combined cycle, CHP & thermodynamic optimization as an energy-strategy choice
- Fuel cells (solid-oxide & beyond): the permitting advantage
- Nuclear: existing-plant restarts, uprates & SMRs (the 2030s solution, not a speed fix)
- Renewables-anchored & hybrid off-grid generation (overbuild, firming, 24/7 CFE)
- Gas-supply interconnection as a co-equal critical path: firm vs interruptible pipeline transport, lateral/midstream buildout lead time, LNG/CNG bridge fuel, and supply-interruption-as-firmness-risk → lead-time register in Chapter 2.3, fuel-process engineering in Chapter 4.9
- Fuel logistics, supply security & dual-fuel strategy
- Emissions, air permitting, water & regulatory compliance for on-site generation
- Microgrid/islanding regulatory strategy (FERC colocation, ERCOT vs PJM vs regulated)
- Electrical integration — sizing, paralleling, step-loading, protection → covered in Chapter 4.8; this chapter is deliberately strategy-only

### Chapter 3.6 — Fiber, Latency & Network Connectivity (Secondary Screen)
- Latency physics and budgets by workload; distributed training over WAN limits
- Inference proximity: edge/metro siting for sub-50 ms
- Fiber route diligence: carrier-neutral, route diversity, lit vs dark, conduit ownership
- Proximity to IXPs, cable landing stations, backbones; long-haul DWDM/OTN for inter-campus links
- Building the fiber path: middle-mile gaps, make-ready, new-build permitting

### Chapter 3.7 — Water Availability, Sourcing & Climate-Driven Cooling Strategy (Siting Gate)
- Why water is a top-3 screen; the energy-water nexus as a siting gate (operational stewardship → see Chapter 15.4)
- Water sourcing: potable vs reclaimed/non-potable vs groundwater vs surface vs produced water
- Water rights, withdrawal and discharge permits as siting gates; water-allocation and curtailment-priority vs agricultural/municipal claims → water-curtailment FMEA in Appendix F
- Climate as cooling-strategy determinant; free-cooling hours; the water-vs-energy trilemma
- Heat reuse and district heating as a siting advantage; drought and supply-security diligence

### Chapter 3.8 — Land, Geotechnical, Seismic & Flood Diligence (Secondary Screen)
- Land sizing, buildable-area constraints, developable ratio
- Geotechnical investigation, foundations for heavy/dense liquid-cooled racks
- Seismic hazard and flood risk; topography, grading, brownfield vs greenfield
- Environmental due diligence (Phase I/II ESA, species, cultural/archaeological)
- Title, easements, mineral rights, assemblage, and land-control instruments

### Chapter 3.9 — Permitting, Regulatory, Environmental & the Critical Path
- The four-layer jurisdictional matrix (federal/state/grid operator/local) and the 50+ approval permit register
- Air permitting & backup-power emissions: NSR/PSD, Title V, attainment vs nonattainment, the "temporary turbine" question
- Water, wastewater & cooling permits: NPDES, 404/401 wetlands, WUE disclosure laws, PFAS frontier
- NEPA & environmental review: federal nexus, categorical exclusions (EO 14318), CEQ reform, ESA/NHPA, state "little NEPA"
- Regional permitting regimes beyond the US: EU environmental directives & EIA, UK planning consent, Nordic, Middle East, and APAC approval matrices
- Noise, acoustics & local land-use as a permitting matter (LFN/tonal hum, ordinances, setbacks, by-right vs special-use zoning) → the engineering response in Chapter 6.8
- Identifying the long poles: interconnection and air permits as the critical path

### Chapter 3.10 — Tax Incentives, Fiscal Structuring & Economic Development
- The incentive stack: sales/use exemptions, property abatements, negotiated energy rates, grants
- Qualification thresholds, PILOTs, clawbacks, and the incentive "arms race"
- Negotiated large-load tariffs (take-or-pay, minimum-demand, cost-causation)
- Global incentive & fiscal landscape: EU state-aid limits, Middle East sovereign incentives, APAC special economic zones
- Incentive durability risk and NPV modeling into TCO

### Chapter 3.11 — Community Relations, Opposition & Social License
- The opposition landscape: ratepayer cost-shift, water, noise, property values, grid strain
- Moratoria, the rezoning battlefield, and the bipartisan opposition coalition
- Community benefits agreements, host fees, PILOT structures, decommissioning bonds
- Early engagement vs stealth assembly; litigation-hardening of the record

### Chapter 3.12 — Geopolitics, Sovereignty, Export Controls & Data Residency
- Sovereign AI as a siting driver; US chip export controls and country tiering
- Data residency vs technical sovereignty; CLOUD Act vs GDPR; geopatriation
- Energy-geopolitics, supply-chain dependence, allied vs non-allied siting

### Chapter 3.13 — Market Clusters & the Site-Scoring Playbook
- Comparative deep dives across first-class clusters: the US markets (Northern Virginia/PJM, Texas/ERCOT, US secondary markets); EU (Ireland, the Nordics, Iberia); the Middle East (UAE, Saudi Arabia); and APAC (Japan, Singapore, Malaysia, India); plus the Gulf and declining markets
- Weighted scoring matrices, hard pass/fail gates, and risk-adjusted scoring
- Stage-gated diligence (desktop → field → binding); backup-site and portfolio strategy
- Decision artifacts: site-selection memo, board package, assumptions register; sequencing land vs power

---

## PART 4 — Electrical & Energy Infrastructure

### Chapter 4.1 — Power Topology Foundations & Voltage Selection
- The end-to-end power chain: HV transmission → on-site substation → MV → LV/transformer → UPS → PDU/busway → rack → DC bus → VRM → chip
- Voltage taxonomy (HV/MV/LV and DC rails: 12V/48V/±400V/800V) and conversion-stage accounting
- Global electrical standards as a design-basis fork: 50 Hz vs 60 Hz, regional voltage classes (400/230 V IEC vs 480/277 V ANSI), the system-earthing regime (TN-S/TT/IT → engineered in Chapter 4.11), and IEC vs NEC/ANSI codes — how region reshapes topology, equipment selection, and protection
- Designing to EDPp (peak) not TDP; per-rack density tiers and the roadmap
- Procurement-as-design-constraint: lead times reshaping topology → lead-time register in Chapter 2.3

### Chapter 4.2 — Utility Interconnect, On-Site Substation & MV Distribution
- HV-to-MV step-down, GIS vs AIS switchgear (SF6/SF6-free)
- MV distribution architectures: radial vs primary/secondary-selective vs ring
- IEC vs ANSI switchgear and protection standards, and regional grid-code compliance (ENTSO-E, GB Grid Code) alongside the US/NERC view
- Pod/block transformer sizing; protective relaying (IEC 61850), fault current, arc-flash
- The protection & coordination study as a deliverable: short-circuit/fault-duty analysis, time-current-curve coordination (IEEE 242), arc-flash incident-energy (IEEE 1584/NFPA 70E), and the inverter-dominated-fault-current problem (current-limited ~1.1–1.5x breaking conventional overcurrent coordination)
- Campus-scale single-line patterns for 100 MW–1 GW factories

### Chapter 4.3 — Substation & Transmission Ownership, Operations & NERC Compliance
- Utility-owned vs customer-owned substation: capex, schedule, control, and risk tradeoffs
- Who operates and maintains the HV yard; switching coordination across the point of interconnection
- Relay setting and protection coordination spanning the utility–customer boundary
- NERC CIP for large transmission-connected loads: the CIP-002 → CIP-015 cluster, applicability, registration, and the compliance program (canonical home; cross-referenced from Part 11 security)
- Reliability obligations of a transmission-connected entity, with the standards named: TPL-001/PRC (ride-through and protection), MOD-032/026/027 (model and steady-state/dynamic data submittals to the TP/PC), EOP-004/PRC-004 (event and disturbance reporting), and the large-load interconnection study deliverables
- The grid-interactive engineering (ride-through, reactive support, frequency response) → see Chapter 4.10
- The HV-yard EHS and qualified-worker interface → see Chapter 6.9

### Chapter 4.4 — Transformers, Harmonics & the AI Non-Linear-Load Problem
- Transformer families (liquid-filled, dry-type, cast-resin) and selection, led by their harmonics/active-front-end relevance
- K-factor and harmonic derating; why AI halls are 100% non-linear
- IEEE 519 compliance, harmonic mitigation, active front-end rectifiers
- Solid-state transformers (SST) as the disruptor (canonical home): MV→800 VDC single-stage, ~99% efficiency, collapsing the conventional 4.2/4.4/4.6 conversion chain → integration into DC-disaggregated designs in Chapter 4.7

### Chapter 4.5 — UPS & Energy Storage: From Ride-Through to Transient Absorption
- UPS topologies (double-conversion vs eco-mode), efficiency curves, when bypass is acceptable
- Central UPS vs rack-level BBU (OCP ORV3) vs distributed/block-redundant/"catcher" topologies
- Battery chemistry shift (VRLA → LFP), runtime sizing, supercapacitor ride-through
- The power-transient problem (canonical home): synchronized GPU load steps quantified; the chip→BBU→BESS mitigation spine (GPU/on-package capacitance → rack BBU → facility BESS) with the on-die origin in Chapter 7.12 and the cooling-side twin in Chapter 5.12 — metering, acceptance, and ops cross-reference here
- Energy-storage placement in DC-disaggregated designs

### Chapter 4.6 — LV Distribution: Busway, PDUs, RPPs & Rack Power
- Busway vs cable+PDU; tap-off units, rack whips, IEC 60309 connectors
- 415V 3-phase rack PDUs, A/B dual feeds, branch-circuit and outlet metering
- OCP power shelves and the 48V busbar; copper mass and the I²R wall
- Liquid-cooled busbars and thermal-electrical co-design

### Chapter 4.7 — The DC Power Revolution: 48V → ±400V → 800V & Disaggregated Sidecar Power
- The physics case for DC; ±400VDC (Mt. Diablo/Diablo 400) vs 800VDC (NVIDIA reference)
- The SST path into DC distribution: MV→800VDC single-stage feeding the rail directly → canonical SST treatment in Chapter 4.4
- Disaggregated "sidecar" power racks and the collapsed conversion chain (>92% e2e)
- Building blocks: SiC/GaN, LLC resonant DC-DC, PCB midplanes
- DC protection and safety: SSCBs, arc-flash, touch-safety; DC-bus grounding and isolation/ground-fault monitoring → canonical home in Chapter 4.11
- Standards/interoperability divergence; migration, coexistence, and when AC still wins

### Chapter 4.8 — On-Site Generation: Electrical Integration
- Electrical integration of behind-the-meter prime/island power (energy-supply strategy → see Chapter 3.5; the fuel/gas-process side → see Chapter 4.9)
- Generator sizing, block/step loading, and BESS as the bridge for transient acceptance
- Generator paralleling, synchronization, and bus-tie schemes
- Microgrid architectures and protection; grid-forming vs grid-following inverters
- Inertia and stability support: synchronous condensers, flywheels, grid-forming BESS
- Islanding and grid-interconnect electrical transitions (commissioning → see Chapter 13.4)

### Chapter 4.9 — Fuel-Supply & Gas-Process Engineering
- The molecule-side companion to Chapter 4.8's electron-side: the engineering of getting fuel to the prime movers
- Pipeline tie-in, custody-transfer metering, and pressure regulation/let-down stations
- Fuel-gas conditioning: filtration, scrubbing, dew-point and Wobbe-index control, heating for turbine inlet specs
- On-site fuel-gas compression and the boost-compressor train for aeroderivative turbines
- LNG/CNG storage, vaporization, and the bridge-fuel logistics chain
- Dual-fuel changeover (gas ↔ distillate) and on-site diesel/distillate storage for firmness
- Gas detection, venting, relief, and process-safety interlocks → energy strategy in Chapter 3.5, air permitting in Chapter 3.9, EHS/PSM in Chapter 6.9, commissioning in Chapter 13.4

### Chapter 4.10 — Grid-Interactive Behavior: Ride-Through, Reactive/Voltage Support & Frequency Response Toward the POI
- The engineering of fault-ride-through at the point of interconnection: SEL/relay, UPS, and undervoltage-load-retention settings that keep a gigawatt load online through a grid disturbance
- Reactive power and voltage support as a grid-code and tariff obligation: Mvar/power-factor requirements, dynamic-VAR and STATCOM provision, capacitor/reactor banks
- Primary frequency response and the load-side response to grid frequency excursions
- The 2025–26 NERC Level-3 alert as the motivating case: ~1,500 MW of data-center load lost on a single 230 kV fault and the ride-through mandate that followed
- The facility as a grid asset vs a grid risk; coordination with the utility's dynamic model submittals (Chapter 4.3)
- Cross-references, not demotions, to the canonical homes: transient physics in Chapter 4.5, NERC obligation framing in Chapter 4.3, islanded inertia in Chapter 4.8, facility-as-grid-risk in Chapter 12.2, grid-services revenue in Chapter 15.8

### Chapter 4.11 — Grounding, Bonding, Earthing, Lightning Protection, SPD & EMC
- AC system-earthing regimes (TN-S/TT/IT) and the design fork: IEC 62305, NEC and IEEE 142 (Green Book), and IEEE 80 (substation step-and-touch potential)
- The substation ground grid and ground-potential-rise (GPR); soil resistivity, grid design, and touch/step-voltage safety
- Equipotential bonding and the Signal Reference Grid for high-density halls (TIA-607/IEEE 1100); bonding of conductive liquid-cooled racks, manifolds, and CDUs
- DC-bus grounding and isolation/ground-fault monitoring on ungrounded ±400/800 VDC systems (canonical home; the novel DC-grounding problem)
- Multi-stage surge-protective-device (SPD) coordination: Type 1/2/3 staging, IEC 61643 and UL 1449
- Lightning protection systems (LPS) on the building envelope and structure → physical LPS coordination in Chapter 6.3
- EMC and shielding for multi-Tb/s SerDes and the management plane → STP/fiber-shield bonding in Chapter 8.9
- Cross-references: fault current and relaying in Chapter 4.2; the earthing-regime fork is surfaced in Chapter 4.1 and consolidated regionally in Appendix G

### Chapter 4.12 — Metering, Power Quality, Monitoring & Electrical Operations
- Metering hierarchy and power-quality monitoring (THD, sags, transients, IEEE 519)
- Closed-loop power smoothing (GPU SoC/SMI/Redfish power caps) and grid-impact telemetry
- Power-transient telemetry → physics and mitigation canonical in Chapter 4.5; here we cover only metering/observability
- DCIM/EPMS integration, SoC management across rack BBUs
- Electrical commissioning, step-load validation, selective coordination, arc-flash & DC touch-safety ops

---

## PART 5 — Cooling & Thermal Management

### Chapter 5.1 — Thermal Fundamentals & the Density Wall
- Heat-flux first principles, thermal-resistance stack-up, Tcase/Tjunction limits
- Why air hit a wall; the density curve 2020–2027; the cooling hierarchy and where each technology saturates
- Thermal metrics (approach temperature, NTU/effectiveness) used in this part; PUE/WUE/ITUE/TUE definitions → canonical in Chapter 15.1

### Chapter 5.2 — Air Cooling at the Limit
- Raised floor vs slab, hot/cold aisle containment, CRAC vs CRAH, in-row coolers
- Airflow management, ASHRAE A1–A4 envelopes, warmer supply air
- Pushing air to 40–50 kW; when air still wins (storage, networking, modest-density inference, edge)

### Chapter 5.3 — Rear-Door Heat Exchangers & Air-Assisted Liquid Cooling (The Bridge)
- Passive vs active RDHx; AALC/liquid-to-air for facilities without facility water
- RDHx as the brownfield bridge; condensation/dew-point management
- Hybrid racks: DLC for GPUs + RDHx for residual air load

### Chapter 5.4 — Direct-to-Chip Liquid Cooling (DLC) — The 2026 Default
- Cold-plate architectures; single-phase vs two-phase and why single-phase dominates
- In-rack plumbing: manifolds, blind-mate/floating-tray, dripless quick disconnects
- Per-chip/per-rack thermal design (flow per kW, delta-T, pressure-drop budgets); coolant selection (PG25)
- What stays on air (NICs, DIMMs, PSUs, optics) and how it's handled
- Microfluidics forward pointer: in-chip/direct-to-silicon microchannels and the thermal-resistance rationale → consolidated roadmap in Chapter 16.2

### Chapter 5.5 — Immersion Cooling (Single-Phase & Two-Phase)
- Single-phase and two-phase architectures; the PFAS reckoning and the two-phase stall
- Fluid logistics, serviceability, floor loading; heat-reuse synergy
- Why immersion remains niche despite best-in-class PUE; FM Global/insurability gating → see Chapter 2.6 and Chapter 6.5

### Chapter 5.6 — CDUs & the Secondary Loop
- CDU function and isolation of TCS from facility water; L2L vs L2A vs in-rack vs row-level
- Sizing, redundancy (N+1 pumps), filtration, fluid chemistry management
- Controls, dew-point margin, and leak-detection integration

### Chapter 5.7 — Facility Water Loops & Warm-Water Cooling
- ASHRAE liquid classes (W17–W45); loop topology and flow balancing
- Warm-water cooling enabling year-round free cooling and high-grade heat capture
- Water chemistry/treatment; piping design, fill/flush/commissioning; coupling to heat rejection

### Chapter 5.8 — Heat Rejection: Chillers, Dry Coolers, Towers, Adiabatic & Economizers
- Heat-rejection options mapped to loop temperature; free/economizer cooling
- Adiabatic/evaporative assist; hybrid-plant mode sequencing
- Cooling-tower water management: biocide/Legionella control (ASHRAE 188), blowdown, and zero-liquid-discharge (ZLD) as an arid-siting discharge enabler
- Climate-driven plant selection and the water-vs-energy frontier
- Plant redundancy, concurrent maintainability, and thermal ride-through (no chilled-water inertia in DLC)

### Chapter 5.9 — Heat Reuse & Waste-Heat Recovery (Engineering)
- Grades of waste heat and heat-pump upgrade to district-heating temperatures (canonical engineering home)
- Loop integration and the on-site/district interface; why warm-water DLC is the enabler
- The offtake chicken-and-egg as an engineering constraint
- Sustainability/regulatory/economics framing of heat reuse → see Chapter 15.5

### Chapter 5.10 — Retrofitting Air-Cooled Facilities for Liquid
- Brownfield assessment: floor loading, plenum, electrical headroom, available water
- Retrofit paths ranked by disruption (L2A → RDHx/AALC → L2L direct-to-chip)
- Phased migration, mixed air+liquid operation, live-facility commissioning and leak risk
- Stranded-capacity math: power vs cooling vs floor area after retrofit

### Chapter 5.11 — Thermal Design, Reliability, Leak Detection & Commissioning
- End-to-end thermal budget worked example; flow/delta-T/pressure-drop across the chain
- Redundancy architecture, UPS-backed pumps, ride-through; leak detection & containment
- Quick disconnects, blind-mate, serviceability/MTTR; commissioning sequence and ongoing fluid analysis
- Coolant leak cascade and thermal-runaway failure modes → consolidated catalog in Appendix F

### Chapter 5.12 — Cooling-Controls Transient Dynamics & Setpoint Stability
- The thermal twin of Chapter 4.5: thermal time constants vs the power-step rate, and why DLC has no chilled-water inertia to ride through a slam
- CDU control-valve and pump-VFD slew limits; the lag between a synchronized GPU load step and the loop's thermal response
- Anti-hunting and setpoint stability under synchronized load slams; control-loop tuning to avoid oscillation
- Transient dew-point excursion on a rapid load drop and the condensation-risk window
- Coordination with GPU power-capping and the chip→BBU→BESS electrical spine (Chapter 4.5)
- The matching cooling-controls failure mode → Appendix F; cross-references to Chapters 4.9 (metering), 12.2, 13.5, 13.6, and 14.7

### Chapter 5.13 — Facility Piping & Pressure-System Mechanical Engineering
- The code basis for charged piping: ASME B31.x and the EU PED / EN 13480 fork
- Water-hammer and pump-trip surge analysis; surge suppression and valve-closure sequencing
- NPSH and cavitation; pump-curve/system-curve matching
- Galvanic and erosion-corrosion control; material selection and dissimilar-metal isolation
- Expansion, anchoring, guides, and seismic restraint of charged pipe → structural/weight basis in Chapter 6.2
- Weld QA, NDE, and hydrostatic/pneumatic acceptance testing → flush and pressure acceptance in Chapter 13.5

---

## PART 6 — The Building: Civil, Structural, Fire/Life-Safety & Construction Execution

### Chapter 6.1 — Building Typologies & Data-Hall Layout
- Greenfield vs retrofit; the powered-shell model and the base-building/fit-out split
- Single-story vs multi-story data centers; the density and stacking drivers
- Data-hall layout: white space vs gray space, adjacencies, plant rooms, yard and laydown
- Spatial planning for liquid: CDU galleries, pipe racks, leak-zoning, and service access
- Module/zone repeatability and the campus master plan
- Future-flex: bay sizing, knockouts, and reservation for density growth

### Chapter 6.2 — Structural & Civil Engineering for Dense Liquid-Cooled Halls
- Floor loading for 3,000+ lb racks: slab-on-grade vs elevated structural slabs (rack mass, point loads, and floor-loading limits owned here as the civil basis)
- Long-span steel and column grids for crane access and unobstructed halls
- Multi-story structural design for liquid: live loads, vibration, deflection, and pipe/coolant weight
- Crane loads, rigging paths, and equipment-move structural provisions
- Seismic design, base isolation, and rack anchoring for heavy, plumbed, energized racks
- Foundations, soils interface, and tie-in to the geotechnical basis (Chapter 3.8)

### Chapter 6.3 — Building Envelope, Architecture & Site Civil Works
- Envelope design: thermal, moisture, and air-barrier strategy for high-internal-gain buildings
- Architecture for security zoning, blast/ballistic hardening, and operational flow
- Lightning protection (air terminals, down-conductors, the structural earthing interface) → electrical bonding/earthing canonical in Chapter 4.11
- Site civil: grading, stormwater, utilities, roads, and heavy-haul access
- Yard layout: substation, generators, fuel, chillers/towers, BESS, and setbacks
- Climate-driven envelope and site hardening → regional deltas in Appendix G

### Chapter 6.4 — Modular & Prefabricated Construction
- Skids, PODs, and factory-built power & cooling modules; the all-in-one vs component-module spectrum
- The prefab supply chain: factory capacity, transport limits, and lead-time interplay (Chapter 2.3)
- Modular vs stick-built tradeoffs: speed, quality, cost, repeatability, and site labor reduction
- Interface management: getting modules to mate on site; tolerances and commissioning-in-factory
- Standardization and the product-ization of the data-center building
- When stick-built still wins: site constraints, customization, and very-high-density halls

### Chapter 6.5 — Fire Detection, Suppression & Life-Safety
- Aspirating/very-early smoke detection (VESDA); air-sampling in high-airflow halls
- Suppression options: clean-agent vs water mist vs hypoxic/oxygen-reduction
- The liquid-cooling and Li-battery/BESS fire problem; thermal-runaway containment
- Codes and approvals: NFPA 75/76, and FM Global approvals gating cooling and battery choices (links to Chapter 2.6 insurability)
- International fire & life-safety codes: EN 50600 fire provisions and IEC/EN standards vs NFPA 75/76; regional insurer approvals (FM Global vs VdS/LPCB) and how they vary by jurisdiction
- Compartmentation, egress, smoke control, and the firefighter-access problem in dense halls
- Detection/suppression interlocks with cooling, power, and the BMS

### Chapter 6.6 — Construction Execution, Sequencing & Phased Turnover
- GC selection and the build-execution plan; mobilization and site logistics
- Construction sequencing: civil → shell → MEP rough-in → equipment set → fit-out
- Phased turnover and partial energization; commissioning windows woven into the schedule
- The skilled-trades shortage (electricians, pipefitters) as a schedule killer and mitigation strategies → workforce/talent program in Chapter 14.11
- Quality, inspections, and the path from substantial completion to Cx readiness
- Construction safety interface → EHS program in Chapter 6.9

### Chapter 6.7 — Rack Civil Integration: Mass, Floor-Loading & Seismic Anchoring
- The rack as a structural-civil load: 3,000+ lb wet racks, point loads, and slab/floor-loading verification (the civil scope; the rack as an IT integration unit is owned in Chapter 7.13)
- Seismic anchoring, base isolation, and tip-over restraint for heavy plumbed racks
- Rigging paths, move routes, and floor-protection during deployment
- Coordination with the structural basis (Chapter 6.2) and the geotechnical/seismic basis (Chapter 3.8)
- Raised-floor vs slab decisions driven by rack mass and under-floor service

### Chapter 6.8 — Acoustic & Emissions Engineering Design
- Acoustic enclosures for gensets, chillers, and dry-coolers; the low-frequency-noise problem
- Low-noise fan walls, attenuators, and sound modeling/prediction
- Emissions-control engineering: SCR and oxidation catalysts on turbines/RICE (the engineering response, distinct from Chapter 3.9 permitting)
- Stack design, dispersion, and tie-in to air-permit limits
- Vibration isolation and structure-borne noise control
- Designing to meet community/permit thresholds set in Chapter 3.11

### Chapter 6.9 — Environment, Health & Safety (EHS) Across Build & Operate
- Arc-flash and DC-shock as a managed EHS program (not just a design margin)
- Confined space, working at height, and heavy-rigging safety during construction and ops
- Coolant/glycol/dielectric/PFAS handling, exposure control, and spill response
- High-voltage qualified-worker programs and the HV-yard interface (Chapter 4.3)
- Lockout/tagout (LOTO) in concurrently-maintainable, live facilities
- OSHA and process-safety management for on-site gas generation and fuel storage (Chapter 4.9)

---

## PART 7 — Compute, Silicon & System Integration

### Chapter 7.1 — Accelerator Landscape & Taxonomy
- GPUs vs systolic-array TPUs vs hyperscaler custom ASICs vs inference-specialized silicon
- Merchant vs captive business models; Broadcom/Marvell design-partner duopoly; TSMC as universal fab
- Reading a datasheet: dense vs sparse FLOPS, peak vs sustained, the marketing-number trap

### Chapter 7.2 — NVIDIA Accelerators: Hopper → Blackwell → Vera Rubin → Rubin Ultra → Feynman
- Per-GPU specs across generations; NVL72/144/576 as the unit of purchase
- Per-GPU NVLink bandwidth as a datasheet attribute; the NVLink/NVSwitch fabric and NVLink-SHARP → canonical home in Chapter 8.2
- Disaggregated inference (Rubin CPX) and the annual cadence as a strategic weapon

### Chapter 7.3 — AMD Instinct & the Open Challenger
- MI300X/MI325X/MI355X and the rack-scale MI400 (Helios, UALink-over-Ethernet)
- Where AMD wins today; the ROCm-maturity TCO tax

### Chapter 7.4 — Hyperscaler XPUs: TPU, Trainium/Inferentia, Maia, MTIA
- Google TPU (Ironwood/TPU v7 for inference and agentic serving, OCS pods, XLA/JAX lock-in)
- AWS Trainium/Inferentia, the Neuron SDK, anchor-tenant economics (Trainium3/Trn3 UltraServers, Project Rainier — ~500k Trainium2 for Anthropic)
- Microsoft Maia, Meta MTIA, OpenAI/Broadcom; the model-owner-builds-silicon trend

### Chapter 7.5 — Custom ASICs & the Merchant-Silicon Disruption
- The economics that justify custom silicon; NRE, lead time, minimum-volume thresholds
- Fixed-function risk vs reprogrammable flexibility; the hybrid fleet

### Chapter 7.6 — HBM: The Binding Constraint on AI Compute
- HBM generations (HBM3E → HBM4 → HBM4E), the three-supplier oligopoly, the 2026 supply crisis
- HBM as a top-3 BOM line; capacity vs bandwidth optimization; DRAM spillover
- HBM/CoWoS as the upstream allocation gate → procurement strategy in Chapter 2.3; the packaging engineering that sets stack-count-per-package → see Chapter 7.7

### Chapter 7.7 — Advanced Packaging & the Integration Substrate
- The 2.5D/3D packaging taxonomy and why packaging is the most-cited binding constraint on AI compute through 2030
- CoWoS-S/R/L families and SoIC/hybrid-bonding: per-package-area and reticle-stitching ceilings
- Interposer options (silicon vs RDL vs glass) and their reach/cost/yield tradeoffs
- HBM-stack-count-per-package as a function of interposer area — the engineering driver behind the Chapter 7.6 allocation view
- Chiplet disaggregation and UCIe; die-to-die interconnect economics
- Thermal, warpage, and yield consequences of large packages → large-package hotspot/heat-flux cooling in Part 5; procurement/allocation stays in Chapters 7.6/2.3 (point here for the engineering)

### Chapter 7.8 — Host CPUs, GPU:CPU Ratios & System Composition
- Grace/Vera (Arm) vs EPYC/Xeon (x86); coherent NVLink-C2C host-attach vs PCIe host
- The agentic-inference rebalance of GPU:CPU ratios; memory (LPDDR5X vs DDR5/MRDIMM)

### Chapter 7.9 — Software Ecosystems & Lock-In
- CUDA vs ROCm vs XLA vs Neuron; the realized-MFU gap vs paper FLOPS
- Portability layers (Triton, MLIR, PyTorch compile); switching-cost quantification

### Chapter 7.10 — Precision, Quantization & the Compute-Memory Tradeoff
- The precision ladder (FP32 → BF16 → FP8 → FP6 → FP4); microscaling (NVFP4 vs MXFP4)
- FP4 training viability; inference quantization quality/speed; KV-cache quantization

### Chapter 7.11 — Accelerator Selection, TCO & Procurement Strategy
- Cost-per-token / cost-per-M-tokens as the governing metric; building the TCO model
- Supply and allocation strategy; buy vs rent vs build; heterogeneous fleet composition
- Power-limited vs capex-limited regime; RFP construction for cross-vendor comparison
- GPU depreciation/economic-vs-book-life → canonical argument in Chapter 1.8; here we cover procurement execution
- Per-generation perf/watt and cost-per-token trajectory in brief → consolidated compute/memory roadmap in Chapter 16.2

### Chapter 7.12 — On-Package Power Delivery & Power Integrity
- Board-level 48V → core conversion: multiphase VRMs and smart power stages
- Vertical and backside power delivery (BSPDN) and the move to power-from-below
- On-die power-delivery-network design: decoupling, on-package capacitance budgets, and the PDN impedance target
- Load-line, Vdroop, and the di/dt event — the physical origin of the synchronized load step
- The chip→BBU→BESS spine read end-to-end: the on-die di/dt event forward-refs to facility transient mitigation in Chapter 4.5 (which back-refs here)

### Chapter 7.13 — The Rack as Integration Unit
- Defining the rack as the IT integration unit and the canonical home that points outward (the civil rack-mass/anchoring scope lives in Chapter 6.7)
- Rack anatomy and form factors: 19" EIA-310 vs 21" OCP ORV3 vs Open Rack Wide; depth/width growth
- In-rack power delivery and busbars → canonical in Chapter 4.6 (LV) and Chapter 4.7 (DC/sidecar); BBU shelves → Chapter 4.5
- Liquid distribution at the rack: manifolds, ~150–200 QDs/rack, leak detection, negative-pressure designs → canonical in Chapter 5.4
- In-rack network cabling: copper NVLink spine vs structured fiber; the copper-vs-optics rack decision → canonical in Chapters 8.2/8.9
- The physical DCIM layer: 3D digital twin, asset/port mapping, physical-layer telemetry → operational DCIM in Chapter 14.2

### Chapter 7.14 — Server & System Integration
- The L1–L12 manufacturing-level model; ODM vs OEM vs systems integrator
- Build-vs-buy: DGX vs HGX-OEM vs ODM-direct vs OCP self-design
- OCP ecosystem & Open Rack standards; reference systems deep dive (HGX, MGX, GB200/GB300 NVL72, AMD Helios)
- Factory vs field integration: where the rack gets built; shipping wet vs dry; logistics of 1.5–3 t racks
- Burn-in, validation & acceptance testing; goodput-oriented acceptance gates
- Supply chain, lead times & allocation (CoWoS/HBM as the upstream gate)
- Deployment logistics, commissioning & operations handoff; spares/RMA and serviceability

### Chapter 7.15 — Deployment Velocity & Cabling at Scale
- Install-rate engineering: racks and links per day as the velocity metric that gates time-to-goodput
- Pre-terminated trunk/MPO cabling as the primary velocity lever; structured-cabling staging
- Optics burn-in before install; screening transceivers off the critical path
- Mis-cabling as the #1 acceptance failure and how to engineer it out (labeling, polarity discipline, automated verification)
- The xAI Colossus 122-day buildout as a worked exemplar of install-velocity engineering
- The install act is owned here; acceptance/validation lives in Part 13

---

## PART 8 — Networking, Fabrics & Optics

### Chapter 8.1 — Network Fundamentals & AI Traffic Characterization
- The three-network model and the moving scale-up/scale-out/scale-across boundaries
- Collective primitives and parallelism mapping (TP/EP inside scale-up; PP/DP across scale-out)
- Tail-latency tyranny, BSP barrier semantics, goodput vs raw link rate
- Training vs inference traffic profiles; JCT as the north-star

### Chapter 8.2 — Scale-Up Fabric (Intra-Node / Intra-Rack)
- NVLink/NVSwitch architecture, generations, the NVL domain (NVL72/144/576, IMEX, MNNVL), and NVLink-SHARP (canonical fabric home)
- The scale-up-over-Ethernet three-way: NVLink/NVLink Fusion vs UALink/UALoE vs Broadcom SUE/ESUN — per-standard reach, lock-in vs commoditization, and the memory-semantic landscape (CXL, TPU ICI + OCS)
- How domain size shapes training (TP/EP ceilings, MFU) and inference (prefill/decode, EP degree)
- Copper vs optical inside the domain; the CPO transition; reliability and blast radius
- Scale-up roadmap (per-GPU bandwidth, multi-rack optical domains, 800 VDC/600 kW racks) → consolidated in Chapter 16.2

### Chapter 8.3 — Network Silicon: Switch ASICs, NICs & DPUs
- Switch ASIC families: Broadcom Tomahawk/Jericho, NVIDIA Quantum (InfiniBand)/Spectrum (Ethernet); radix, bandwidth, and the SerDes generation as the gating spec
- NICs and SuperNICs: ConnectX, and the RoCE/IB offload path
- DPUs and IPUs: BlueField, Pensando, and the offload of storage, security, and virtualization
- Shallow-shared vs deep-buffer VOQ switch architectures and the buffering tradeoff
- Merchant vs captive switch silicon → business-model framing in Chapter 7.1; DPU-enforced security → Chapter 11.7; DPU-offloaded storage → Chapter 9.3

### Chapter 8.4 — Scale-Out Fabric: Protocols, Standards & Transport
- The protocol war: InfiniBand vs Ethernet/RoCEv2 vs Spectrum-X vs Ultra Ethernet (UEC)
- The transport semantics: lossless vs lossy, ordered vs out-of-order, and the RoCE in-order penalty
- Standards trajectory and the UEC roadmap → consolidated in Chapter 16.2
- Multi-tenant fabrics and cloud GPU isolation (PKeys, VXLAN, DPU-enforced VPC) as a fabric concern → tenant security boundary in Chapter 11.6, zero-trust segmentation in Chapter 11.7

### Chapter 8.5 — Scale-Out Topology, Sizing & Oversubscription
- Topology design: fat-tree/Clos, rail-optimized, rail-only, dragonfly, torus/OCS
- Network sizing and BOM math per GPU count (scalable units, switch/cable/optic counting)
- Oversubscription strategy: 1:1 non-blocking for training vs 2:1–3:1 for inference
- Mapping parallelism strategy onto the topology; the cost/reach/blocking tradeoff

### Chapter 8.6 — Congestion Control, Load Balancing & In-Network Compute
- PFC head-of-line blocking, deadlock, and the watchdog; lossless-Ethernet pitfalls
- DCQCN/ECN tuning and the congestion-control parameter space
- Adaptive routing, packet spray, and the UEC answer to RoCE's in-order penalty
- In-network compute: SHARP and collective offload
- Congestion telemetry and observability → operations correlation in Chapters 10.6 and 14.2

### Chapter 8.7 — Management, Out-of-Band Fabric & PTP/IEEE-1588 Timing
- The OOB/BMC/Redfish management fabric and in-band-vs-OOB segmentation as a network-design decision
- PTP/IEEE-1588 time synchronization: boundary vs transparent clocks, the PTP hardware clock (PHC), GPS/PPS distribution, and holdover budgets
- Why precise time underpins telemetry correlation, RoCE diagnostics, and multi-DC training
- Time-sync accuracy as a commissioning acceptance gate → see Chapter 13.7
- Management-plane isolation and security → see Chapter 11.7

### Chapter 8.8 — Scale-Across: Multi-Campus & Cross-Region Fabric (DCI for Distributed Training)
- The defining 2025–26 frontier-lab architecture: the latency/bandwidth hierarchy across scale-up → scale-out → scale-across
- Coherent DCI optics: ZR/ZR+, 800G → 1.6T, OTN transport → physical-layer primitives in Chapter 8.9
- Algorithms over the WAN: asynchronous/hierarchical/local-SGD and gradient compression for inter-campus training
- Cross-site fault domains and checkpointing across regions → checkpoint math in Chapter 9.4
- The power-bound rationale for going multi-DC → macro narrative in Chapter 16.1; the scattered 1.2 mentions are redirected here

### Chapter 8.9 — Physical-Layer & Interconnect Taxonomy
- Physical-layer foundations: link budgets, PAM4 modulation, the SerDes ladder, FEC/TDECQ
- The interconnect taxonomy: DAC, ACC/AEC, AOC, pluggable modules ("copper inside, optics outside")
- Pluggable form factors: OSFP, QSFP-DD, OSFP-XD; thermal/mechanical design
- The DSP spectrum: retimed pluggables, LPO, LRO, half-retimed optics
- Optics as a failure and cost driver: reliability engineering, commissioning, operations
- STP/fiber-shield bonding and EMC at the connector → grounding/EMC canonical in Chapter 4.11

### Chapter 8.10 — CPO, Fiber Plant & Structured Cabling
- Co-packaged optics (CPO): architecture, economics, the serviceability tradeoff → CPO roadmap consolidated in Chapter 16.2
- Fiber plant: single-mode vs multimode, reach classes, channel architectures
- Structured cabling at scale: trunking, MPO systems, polarity, loss budgets
- Deployment and install-velocity interface → cabling-at-scale engineering in Chapter 7.15

---

## PART 9 — Storage & Data

### Chapter 9.1 — Storage in the AI Lifecycle: Why It Determines GPU Efficiency
- The economic argument and the four I/O personalities (ingestion, checkpointing, many-small-files, KV-cache)
- Where storage lives; throughput vs IOPS vs latency vs metadata; scratch vs durable vs archive

### Chapter 9.2 — Parallel & Distributed File Systems
- Lustre/EXAScaler, GPFS/Storage Scale, WekaFS, VAST DASE, Ceph/DAOS/JuiceFS
- Centralized vs distributed metadata and the LOSF problem; multiprotocol access
- POSIX vs object-native primary store; erasure coding vs replication; certification

### Chapter 9.3 — NVMe Tiers, GPUDirect Storage & the CPU-Bypass Data Path
- The media hierarchy; node-local NVMe scratch; GPUDirect Storage and NVMe-oF
- SCADA (GPU-initiated I/O), BlueField DPU-offloaded storage; PCIe 5.0/6.0, E1.S/E3.S; TLC vs QLC
- Storage-fabric placement: dedicated storage rail vs converged-onto-backend vs frontend/management network; NVMe-oF transport tradeoffs (RoCE vs TCP) → fabric design in Chapter 8.5

### Chapter 9.4 — Checkpointing for Large-Scale Training
- Checkpoint anatomy (~14 bytes/param), failure model, the optimal-interval derivation (Young/Daly) — canonical home; Chapters 1.2, 10.7, and 14.4 cross-reference this math
- Synchronous vs asynchronous vs in-memory; tiered checkpointing; sharded/distributed formats
- Compression/sparsification; restart, recovery, and bandwidth sizing

### Chapter 9.5 — Data Ingestion, Preprocessing & the Data-Loader Path
- The runtime pipeline; formats (WebDataset, Parquet, MDS); sharding and shuffling
- Data loaders and CPU bottlenecks (DALI/FFCV); streaming vs staging; caching tiers
- Multimodal pipelines; data versioning, lineage, and governance → legal/compliance regime in Chapter 10.10; the offline prep supercomputer → see Chapter 9.9

### Chapter 9.6 — Object Storage, Data Lakes & the Capacity Tier
- On-prem vs cloud object; S3-over-flash; lifecycle/tiering; lakehouse formats
- Durability and protection; object as the inference cold tier and model-distribution backbone

### Chapter 9.7 — Inference & KV-Cache Storage: The New Memory Hierarchy
- Why inference created a new tier; the HBM → local DRAM → CXL-expanded DRAM → NVMe → Ethernet-flash → object hierarchy
- CXL DRAM as a present-tense tier: memory expansion vs pooling, and the rule for when CXL-expanded DRAM beats NVMe KV-offload
- KV-cache offload and reuse; Dynamo/NIXL/CMX; multi-tier KV management; model/weight serving and cold-start

### Chapter 9.8 — Sizing, Data Gravity & Resilience
- Storage:compute ratios, bandwidth budgets, capacity sizing; network co-design
- Isolating checkpoint incast from training collectives as a fabric-and-storage co-design problem → see Chapter 8.5
- Data gravity, egress economics, move-compute-to-data vs move-data-to-compute, multi-site strategy
- Reliability, multi-tenancy/QoS, security, TCO; benchmarking (MLPerf Storage) and acceptance
- 2026 trends forward pointer (GPU/DPU-initiated I/O, all-flash everywhere, file/object convergence, deeper CXL tiering) → see Chapter 16.2

### Chapter 9.9 — The Data-Prep Supercomputer: Offline Data Processing
- Offline data preparation as a first-class workload distinct from the runtime loader (Chapter 9.5) and the legal regime (Chapter 10.10)
- Deduplication: exact-match plus near-duplicate MinHash/LSH at corpus scale
- Quality and safety filtering; classifier-based curation pipelines
- Decontamination against evaluation/benchmark leakage
- Tokenization throughput and the CPU-bound preprocessing wall
- Sizing the CPU/storage cluster for the prep pipeline; its distinct power, network, and storage profile vs the GPU training fleet

---

## PART 10 — Software, Orchestration & Service Delivery

### Chapter 10.1 — Orchestration Architecture & the Scheduling Plane
- The three planes: workload scheduler, node runtime, fleet/control plane
- Slurm vs Kubernetes; batch schedulers on K8s (Volcano, KAI, Run:ai); Slinky hybrids; DRA
- Reference control-plane topology; build vs buy the platform

### Chapter 10.2 — Topology-Aware & Rack-Scale Scheduling
- Why topology matters (NVLink vs PCIe vs inter-node bandwidth cliffs); discovery
- Slurm block scheduling; GB200 NVL72 IMEX/ComputeDomains; K8s topology-aware placement
- Mapping parallelism strategy to the fabric; fragmentation and rack-as-scheduling-unit

### Chapter 10.3 — Multi-Tenancy, Isolation & Resource Sharing
- The sharing spectrum: whole-GPU, MIG, MPS, time-slicing, fractional GPU
- Quota and fairness; tenant isolation models; security isolation and confidential computing → TEE/attestation canonical in Chapter 11.5
- Noisy-neighbor management and QoS guarantees

### Chapter 10.4 — Node Software Stack: Drivers, CUDA/ROCm, NCCL & Firmware
- The vertical stack and driver lifecycle; GPU Operator; container runtime layer
- AMD ROCm path; NCCL/RCCL tuning; fleet-wide version pinning (the golden stack)

### Chapter 10.5 — Provisioning, Bring-Up & Infrastructure as Code
- Bare-metal lifecycle (BMC/Redfish, PXE, image, join); MAAS/Ironic + Terraform/Ansible
- Golden images vs config management; burn-in/acceptance gates; cluster management platforms; day-2 lifecycle

### Chapter 10.6 — Observability, Telemetry & GPU Health
- DCGM/NVML, XID/SXID taxonomy, fabric telemetry; liquid-cooling and node correlation
- Congestion and fabric-health telemetry (PFC/ECN counters, queue depth) → congestion engineering in Chapter 8.6
- Silent Data Corruption detection; job-level observability; goodput as the headline metric
- Self-hosted vs managed observability; cardinality, sampling, and cost

### Chapter 10.7 — Fleet Reliability, Fault Tolerance & Autonomous Recovery
- The reliability problem at scale; checkpoint/restore strategy → Young/Daly math canonical in Chapter 9.4
- Autonomous hardware recovery; hot spares; elastic/redundant training and algorithmic fault tolerance
- Detection-to-recovery loop; failure taxonomy → canonical in Chapter 14.3

### Chapter 10.8 — MLOps & Training Frameworks
- Distributed training frameworks (FSDP/FSDP2, DeepSpeed/ZeRO, Megatron-Core) and 3D parallelism
- Pipeline orchestration (Kubeflow, Ray, SkyPilot); experiment tracking and the training control plane
- Training-job lifecycle, hyperparameter sweeps, and reproducibility (inference serving is owned in Chapter 10.11)

### Chapter 10.9 — Customer Onboarding, Delivery & Productization
- Consumption models: bare-metal vs GPUaaS vs managed vs serverless; the value-stack ladder
- Multi-tenancy and isolation architecture; control plane, API & provisioning
- SLAs, reliability commitments & support; billing, metering & capacity reservation (on-demand/spot/reserved/take-or-pay)
- Onboarding, time-to-first-job & the productization lifecycle; offboarding and portability
- The unit-economics tie-back: metered revenue per token/GPU-hr → see Chapter 1.8

### Chapter 10.10 — Data Governance, Privacy & the Training-Data Legal Regime
- Training-data provenance, licensing and the copyright/fair-use frontier; dataset documentation
- PII handling, minimization, and the privacy program for training and inference data
- Data-residency-by-jurisdiction for customer data; cross-border transfer mechanisms (links to Chapter 3.12 sovereignty)
- Retention, deletion, and the right-to-erasure plumbing; data lineage as a compliance artifact
- Customer data processing agreements (DPAs), sub-processor management, and tenant data isolation
- Distinct from weight/model security → see Chapter 11.8; this chapter governs the data, not the model

### Chapter 10.11 — Inference Serving Engineering: SLOs, Batching, Disaggregation & Goodput-Optimal Scheduling
- The latency taxonomy that governs the inference SLO: TTFT (time-to-first-token), TPOT (time-per-output-token), and ITL (inter-token latency)
- Continuous/in-flight batching and chunked prefill as the throughput levers
- Prefill/decode (P/D) disaggregation tradeoffs, with KV-cache transfer as the "disaggregation tax" → KV-cache hierarchy in Chapter 9.7
- SLO-attainment-constrained goodput: maximizing served tokens subject to latency targets
- Queueing-theory fleet sizing for bursty, always-on inference demand
- Speculative and parallel decoding economics; when the extra compute pays for itself
- Prefix-cache-aware routing and request scheduling across the fleet
- The serving-engine landscape: vLLM, TensorRT-LLM, SGLang, and disaggregated serving (Dynamo, llm-d, KServe); autoscaling against SLOs
- The economic tie-back: serving efficiency as the governor of $/token → see Chapters 1.3, 1.8, 7.11; SLA framing → see Chapter 12.4

---

## PART 11 — Security

### Chapter 11.1 — Threat Model, Assets & Security Levels for AI Infrastructure
- Why AI DCs are a special case (asset value density, IP concentration, strategic targeting)
- Asset taxonomy and adversary tiers; the RAND Weights Security Level framework
- Defense-in-depth reference architecture; setting a target level and deriving controls

### Chapter 11.2 — Physical Security: Siting, Zones & Kinetic/Drone Threats
- Concentric zone model; perimeter, access control, biometrics, surveillance
- Aerial threats and counter-UAS; kinetic/sabotage resilience; cage/rack controls
- Physical-cyber convergence (securing access control, BMS/EPMS, cameras as OT)

### Chapter 11.3 — Supply-Chain Security & Hardware Provenance
- Counterfeiting, implants, tampering-in-transit; supplier vetting and country-of-origin risk
- Tamper-evident logistics; device provenance (NIST), HBOM/SBOM, RIM, OCP S.A.F.E.
- Secure decommissioning, media sanitization, and data remanence

### Chapter 11.4 — Hardware Root of Trust, Firmware & BMC Security
- Why firmware is the high-value target; silicon root of trust (Caliptra, DICE), DC-SCM
- Secure/measured boot, firmware update integrity, platform attestation
- BMC threat model and management-plane segmentation; continuous firmware integrity monitoring

### Chapter 11.5 — GPU Confidential Computing & Trusted Execution
- CPU TEE foundations (SEV-SNP, TDX); NVIDIA GPU TEE (CPR, BAR0 decoupler, encrypted transfers) — canonical TEE/attestation home; Chapters 10.3 and 11.6 cross-reference
- Attestation flow (NRAS/RIM); multi-GPU confidential computing (TEE-I/O); performance characteristics
- Confidential containers, key brokers, and residual attack surface

### Chapter 11.6 — Multi-Tenant & Workload Isolation Security
- The multi-tenancy spectrum and documented isolation failures (CVEs, uncore side-channels)
- MIG vs vGPU vs time-slicing as security boundaries; memory hygiene on reassignment
- Confidential multi-tenancy → attestation flows canonical in Chapter 11.5; the isolation-vs-utilization economics

### Chapter 11.7 — Network Segmentation, Microsegmentation & Zero Trust
- AI-cluster network as attack surface; macro- and micro-segmentation
- DPU-enforced security; securing the high-performance fabric; egress control as the anti-exfiltration linchpin
- API/inference-endpoint security; management-plane isolation; monitoring east-west

### Chapter 11.8 — Model & Weight Protection (At-Rest, In-Transit, In-Use)
- Weights as crown jewels; at-rest, in-transit, and attestation-gated in-use protection
- Egress as the choke point; insider exfiltration paths; model integrity and provenance
- Key management and PKI as a discipline: HSM/KMS and the key hierarchy, fleet-scale rotation/revocation, attestation-gated key release (→ Chapter 11.5), control-plane secrets (→ Chapters 10.6/11.7), and the crypto-agility/PQC forward-pointer for long-lived weights (→ Chapter 16.x)
- Reaching higher Weights Security Levels
- Distinct from data-governance/privacy → see Chapter 10.10

### Chapter 11.9 — Insider Threat & Human-Layer Security
- Why insiders are the dominant unaddressed vector; personnel security, least privilege, two-person rules
- PAM, separation of duties, behavioral monitoring; offboarding; vendor/contractor risk; culture and deterrence

### Chapter 11.10 — Cyber-Physical & Destructive Attacks on OT/Facility Systems
- The OT/ICS threat model for BMS/EPMS/SCADA, CDU controllers, BESS management, and the power-cap/firmware planes
- The power-cap and firmware as a weapon: a forced synchronized load step → grid trip; CDU disablement → thermal runaway; BESS-induced runaway; GPU-bricking via malicious firmware
- OT/IT segmentation and the Purdue model / IEC 62443
- Safety-instrumented-system (SIS) independence and hardwired interlocks that a compromised control plane cannot override
- Each Appendix F failure mode treated as explicitly dual-use (random AND attacker-induced)
- Cross-references: transient physics in Chapter 4.5, cooling-controls in Chapter 5.12, firmware integrity in Chapter 11.4, network segmentation in Chapter 11.7

### Chapter 11.11 — Compliance, Certification & Governance
- The framework landscape (SOC 2, ISO 27001/42001, FedRAMP/20x, CMMC) and how they differ
- NERC CIP for transmission-connected loads → canonical treatment in Chapter 4.3; here we place it in the compliance portfolio
- Continuous monitoring and compliance-as-code (OSCAL/RFC-0024); control crosswalks
- Audit logging, data residency/sovereignty, and sector regimes

### Chapter 11.12 — Security Operations, Detection & Incident Response
- SOC for an AI campus (converged vs isolated); detection coverage and telemetry at scale
- IR playbooks (weight theft, firmware implant, kinetic strike, isolation breach, cyber-physical/OT attack); forensics on confidential systems
- The unified incident-command interface and the converged cyber-physical escalation trigger → canonical incident-command model in Chapter 14.11
- Resilience/continuity, red-teaming, threat intelligence, and metrics

---

## PART 12 — Reliability, Resilience & Standards

### Chapter 12.1 — Resilience Standards, Redundancy Topologies & Fault-Domain Engineering
- The standards landscape: Uptime Tiers, TIA-942, ISO/EN 50600, and their limits for AI factories
- Redundancy topologies: N/N+1/N+2/2N/2(N+1), block-redundant, and distributed-redundant architectures
- Fault-domain and blast-radius engineering as the recurring design lens
- Concurrent maintainability vs fault tolerance as a topology property
- Availability vs goodput defined here in one line, then forwarded → the full rethink in Chapter 12.2
- The redundancy primer (Chapter 0.5) extended into topology selection; the quantitative math → see Chapter 12.5

### Chapter 12.2 — The AI-Cluster Reliability Rethink: Goodput vs Facility Availability
- Why availability (the facility "nines") is the wrong target for AI factories; goodput as the metric that actually governs return
- Redundancy moving from the facility into the silicon and the software stack — and what that does to the design-basis
- Reliability of the thermal/mechanical path: liquid-cooling continuity, CDU/pump ride-through, and the disappearance of chilled-water inertia
- Grid-coupling and facility-as-grid-risk: synchronized load swings as a reliability and stability problem → grid-interactive engineering in Chapter 4.10
- Mapping the rethink onto Uptime/TIA-942/EN 50600 (Chapter 12.1) and onto the redundancy primer in Chapter 0.5
- The goodput-availability tradeoff curve and where to spend the next dollar of redundancy — quantified by the model in Chapter 12.5

### Chapter 12.3 — Disaster Recovery, Business Continuity & Geographic Failover
- Multi-region failover for inference SLAs; active-active vs active-passive across regions
- RTO/RPO targets and the cost of each tier of continuity
- Capacity reservation for failover; the "spare region" economics and warm/cold standby
- Pandemic/black-swan continuity, key-personnel and supply continuity
- Contractual continuity obligations and tenant-facing DR commitments
- Failover drills, runbooks, and the link to FMEA scenarios in Appendix F

### Chapter 12.4 — SLAs, Goodput Contracts & Availability Commitments
- Availability-vs-goodput SLAs: what is actually being promised, and why the two diverge for AI workloads
- Penalty/credit structures, service-credit ladders, and how goodput shortfalls are measured and attributed
- Mapping commitments to customer delivery and productization → see Chapter 10.9; serving-engineering SLOs → see Chapter 10.11
- Tying SLA terms back to acceptance/commissioning gates and the goodput baseline captured at go-live
- Goodput accounting and badput attribution as the contractual measurement basis
- Negotiating realistic commitments against the failure environment and the redundancy design-basis (Chapters 0.5 and 12.1), backed by the availability model in Chapter 12.5

### Chapter 12.5 — Quantitative Reliability & Availability Modeling (RBD / FTA / Monte-Carlo)
- Reliability block diagrams and series/parallel/k-of-n availability algebra — extending the Chapter 0.5 primer into an actual method
- Markov and state-space models for repairable systems: MTTR, repair-crew constraints, and degraded operating states
- Fault trees and minimal cut sets; finding the dominant failure paths
- Common-cause failure and beta-factor modeling for shared loops, buses, and firmware
- Monte-Carlo simulation of availability and goodput under a realistic failure environment
- The roll-up: taking Chapter 14.3 AFRs and the Appendix F FMEA scenarios up to cluster-level availability AND goodput
- Sensitivity analysis — where the next dollar of redundancy buys the most nines/goodput — as the engine behind the Chapter 12.2 tradeoff curve
- Inputs from Chapters 0.5/14.3/12.1; outputs to Chapters 12.2/12.4; distinct from the design-validation digital twin (Chapter 2.7)

---

## PART 13 — Commissioning & Go-Live

### Chapter 13.1 — Commissioning Fundamentals, Levels & Program Governance
- The L1–L5 (Cx Level 1–5) ladder; concurrent maintainability vs fault tolerance
- The two parallel tracks (facility and IT/cluster) and how they interlock
- Roles, governing documents (OPR/BOD/SOO), and standards (ASHRAE Gd 0/1.1, Uptime, BICSI 002)

### Chapter 13.2 — Documentation, Scripts & Acceptance Test Plans
- Anatomy of a commissioning script and quantitative pass/fail gates
- Facility vs cluster ATP/SAT; instrumentation and data acquisition
- Deficiency/punch-list management; baseline "fingerprint" capture; digital Cx platforms

### Chapter 13.3 — Electrical Power Acceptance (L3/L4)
- Utility energization, switchgear, generator, UPS/BESS commissioning
- Resistive vs reactive vs AI-load-emulating load banks; redundancy-topology validation
- Arc-flash and protection-coordination verification against the study (Chapter 4.2)
- Validating NVL72 power-smoothing and dynamic-load-swing tolerance → transient physics canonical in Chapter 4.5

### Chapter 13.4 — Commissioning On-Site Generation & Microgrid Controls
- Prime/island power acceptance: generator paralleling and load-sharing validation
- Fuel-supply and gas-process acceptance: fuel-gas conditioning, compression, and dual-fuel changeover (Chapter 4.9)
- Islanding and grid-forming/grid-following transition testing; seamless vs break-before-make
- Microgrid controller tuning: droop, frequency/voltage regulation, dispatch logic
- Black-start capability demonstration and restoration sequencing
- BESS/synchronous-condenser inertia and transient-support validation
- The seam between facility Cx and microgrid Cx; sequenced between electrical acceptance (13.3) and IST (13.6)

### Chapter 13.5 — Cooling Acceptance: Air, Liquid-to-Chip & CDU Commissioning
- Airside acceptance; secondary-loop flushing, fluid quality, fill/purge; hydrostatic and pressure acceptance (Chapter 5.13)
- CDU acceptance, flow/thermal verification, worst-case-branch testing
- The load-realism limit: facility load banks reject to air, not into cold plates/CDUs, so the liquid loop, CDU controls, and worst-case-branch thermal-hydraulics cannot be exercised at realistic transient heat-flux without real GPUs
- Mechanical-Cx ↔ GPU-burn-in as an explicit overlapping, sequenced gate
- Leak integrity/detection; cooling failover; interlock with cluster burn-in

### Chapter 13.6 — Level 5 Integrated Systems Testing (IST) & Failure-Mode Demonstration
- IST scope and the master sequence; black-building/pull-the-plug test
- Cascading/concurrent-fault scenarios; thermal ride-through; BMS/DCIM/SCADA integration
- Failure-mode demonstration drawing on the FMEA catalog in Appendix F
- The dynamic-load realism gap (canonical home): resistive vs reactive vs AI-emulating load banks and what each can and cannot reproduce across BOTH the electrical and thermal dimensions; the BBU → BESS → GPU-capacitance mitigation stack under realistic dynamics; the proxy training run (Chapter 13.9) as the only true emulator; acceptance criteria bridging load-bank IST → first-real-workload

### Chapter 13.7 — Network Fabric Commissioning & Validation
- Physical-layer acceptance; BER and link-flap screening; topology validation
- Point-to-point bandwidth/latency; scale-up (NVLink) validation; congestion/QoS verification
- PTP/IEEE-1588 time-sync accuracy as an acceptance gate (Chapter 8.7)
- Fabric baseline capture and link-health handoff

### Chapter 13.8 — GPU Node Burn-In, Diagnostics & Stress Validation
- Node bring-up and inventory; DCGM diagnostics; burn-in/soak (72–168 hr)
- HBM/ECC validation; thermal screening; SDC hunting; straggler detection; power-behavior validation

### Chapter 13.9 — Cluster-Scale Benchmarking, Reference Training & Storage/Scheduler Validation
- NCCL collective benchmarking and bandwidth acceptance gates; MLPerf/OSU benchmarks
- A reference/proxy training run; storage and scheduler/orchestration acceptance
- Goodput/badput accounting; defining the contractual SLA

### Chapter 13.10 — Staged Power/Load Ramp, Go-Live & Handover to Operations
- Staged energization preserving live-block redundancy; managing synchronized load swings
- Soft-launch/canary; the Operational Readiness gate; the handover package
- Monitoring/observability handoff; seeding the day-2 reliability program; warranty and close

---

## PART 14 — Day-2 Operations, Upgrades & Lifecycle

### Chapter 14.1 — Operational KPIs, Goodput & the Reliability Economics of AI Factories
- Why day-2 economics dominate; the goodput stack (ETTR, ML goodput, MFU/MBU)
- The failure environment vs traditional IT; the blast-radius problem; the operations scorecard

### Chapter 14.2 — DCIM, Telemetry & Observability for GPU-Dense, Liquid-Cooled Facilities
- The AI telemetry stack and facility-layer telemetry; liquid-cooling observability
- Protocols and data architecture; IT/facility correlation; digital twins/autonomous DCIM
- Alarm management distinct from alerting: alarm rationalization, shelving, and flood-handling on the ISA-18.2 / EEMUA-191 lineage
- The operational twin vs the design-validation twin (Chapter 2.7); closing the loop with as-built models

### Chapter 14.3 — Component Failure Modes, Failure Rates & Fleet Reliability Data
- Failure taxonomy (hard, transient, silent) — canonical home; Chapters 12.1, 13.6, and 10.7 cross-reference
- GPU/network/infrastructure failure modes; the consolidated FMEA catalog → Appendix F
- SDC mechanisms and detection programs; empirical fleet failure-rate data; AFR modeling and burn-in → these AFRs feed the availability model in Chapter 12.5

### Chapter 14.4 — Reliability Engineering for Training (Operational)
- Checkpoint/restart in production → Young/Daly optimal-interval math canonical in Chapter 9.4; here we cover only operational tuning
- Multi-tier and asynchronous checkpointing in operation
- Fault detection/isolation, lemon-node ejection, automated remediation
- Elastic/redundant training and MTTR decomposition

### Chapter 14.5 — Predictive & Preventive Maintenance of Power and Cooling Plant
- Run-to-failure vs time-based vs condition-based/predictive maintenance
- Predictive maintenance of power chain, cooling, and liquid-cooling-specific systems
- Maintenance windows in a 24/7 synchronous world; concurrent maintainability; spares forecasting

### Chapter 14.6 — Spares Strategy, RMA Logistics & Repair Operations
- Sparing models and failure-rate-driven forecasting; the RMA lifecycle
- Field-replaceable units and design for serviceability; logistics at GW scale; repair-vs-replace-vs-harvest

### Chapter 14.7 — Capacity, Power & Thermal Management in Operation
- Live power management and oversubscription (training ~3% vs inference ~21% headroom)
- Power transients and grid impact → physics canonical in Chapter 4.5; cooling-controls transients → Chapter 5.12; here we cover operational management
- Stranded capacity; thermal operations
- Capacity planning/forecasting; workload-aware operations and demand response; energy/efficiency operations

### Chapter 14.8 — Firmware & Software Lifecycle Management at Fleet Scale
- The firmware estate and update mechanics (Redfish/PLDM, OCP GPU FW Update)
- Fleet orchestration (rolling/canary, drift detection); the dependency matrix; firmware supply-chain security
- Downtime minimization, driver/software-defined operations; firmware change-management under the procedures framework → see Chapter 14.12; rollback discipline

### Chapter 14.9 — Hardware Refresh, Depreciation Strategy, Decommissioning & ITAD
- Refresh-cycle drivers; the training-to-inference cascade as refresh execution → economic-vs-book-life argument canonical in Chapter 1.8
- IT decommissioning workflow; data sanitization and security; ITAD/resale and the circular economy
- Migration and transition operations (capacity continuity during refresh)

### Chapter 14.10 — Facility Decommissioning, Repowering & Site Remediation
- End-of-life of the facility itself as an explicit lifecycle stage (distinct from IT/ITAD in Chapter 14.9)
- Generator, fuel-tank, and BESS removal; decommissioning-bond drawdown (links to Chapter 3.11)
- Coolant, glycol, dielectric-fluid, and PFAS disposal and environmental closeout
- Repowering vs demolition vs adaptive reuse; reusing power and shell for the next-generation build
- Brownfield legacy, contamination remediation, and the Phase I/II ESA exit
- Restoration obligations under leases, host agreements, and permits

### Chapter 14.11 — Operations Organization, Workforce, Talent & Incident Command
- Org structure across the lifecycle: the development-phase team (developers, EPC PMs, commissioning agents) vs the steady-state ops org
- The skilled-labor shortage (electricians, pipefitters, HV technicians) as a siting and schedule constraint → construction-phase view in Chapter 6.6
- Training pipelines, apprenticeships, and certification ladders; growing the operator bench
- The facility-ops vs ML-platform-ops org boundary — who owns goodput; the integrated operating model
- Shift models, on-call, and the unified incident-command model (canonical home; the converged cyber-physical escalation trigger referenced from Chapter 11.12); runbooks and the automation tradeoff
- Human error as the dominant outage cause → the procedures/error-trap discipline in Chapter 14.12; vendor/colo/operator boundaries; continuous improvement and postmortems

### Chapter 14.12 — Operational Procedures, Change Management & Human-Error Control
- The MOP/SOP/EOP framework: method, standard, and emergency operating procedures as the canonical home
- Change classification and the CAB/MOC process spanning BOTH the facility and the cluster
- Switching orders and the disciplined-operation regime for live, concurrently-maintainable plant → LOTO in Chapter 6.9, concurrent maintainability in Chapters 12.1/14.5
- EOPs tied directly to the Appendix F failure-mode catalog
- Human-factors and error-trap analysis; why human error is the plurality outage cause and how procedure design controls it
- The canonical home pointed to by Chapter 14.8 (firmware change-mgmt) and Chapter 14.11 (human error/runbooks)

### Chapter 14.13 — Agentic Ops, RL Control & the Autonomy Ladder
- The autonomy ladder: observe → recommend → act → fully autonomous facility operation
- Reinforcement-learning joint control of workload + cooling + power within explicit safety envelopes
- The human-oversight and liability boundary; when an agent may act unsupervised
- Consolidating (not duplicating) the threads in Chapters 14.2, 10.7, 15.2, and 4.5 via cross-reference; AI-assisted design lives in Chapter 2.7
- Differentiated from the org/workforce-automation lens of Chapter 14.11

### Chapter 14.14 — Continuous & Re-Commissioning on a Live Campus
- Periodic re-testing on a live campus: black-building, failover, and thermal-ride-through re-tests without taking the factory down
- Re-commissioning triggers, ranked by risk: a density step-up as the highest-risk event, equipment replacement, and topology change
- The drift-detection (Chapters 14.2/14.8) → re-validation loop
- Tie-in to DR drills (Chapter 12.3) and the FMEA catalog (Appendix F)

---

## PART 15 — Sustainability & Efficiency

### Chapter 15.1 — Efficiency Metrics: PUE, WUE, ERF, REF & the Post-PUE Metric Stack
- PUE fundamentals and why it breaks for AI/liquid cooling; WUE site vs source (canonical metric-stack home; Chapters 0.3 and 5.1 reference this chapter)
- ERF/ERE, REF, CUE; beyond facility efficiency (TUE/ITUE, DCeP); work-based metrics
- Building an honest measurement plan and avoiding vanity numbers

### Chapter 15.2 — Energy Efficiency: Cooling, Free Cooling, Setpoints & Power-Chain Losses
- The efficiency budget and where the leverage is; cooling architecture and PUE bands
- Free cooling/economization; warm-water/high-temp cooling; setpoint strategy
- Power-chain efficiency; ML-driven cooling optimization; part-load and utilization

### Chapter 15.3 — Carbon, Clean Power Procurement & 24/7 Carbon-Free Energy
- Scope 1/2/3 for data centers; location-based vs market-based; the GHG Protocol Scope 2 revision
- Annual matching vs 24/7 CFE; clean-firm procurement (nuclear/SMR, geothermal, gas+CCS)
- On-site/behind-the-meter carbon tradeoffs; carbon-aware compute scheduling; offsets as last resort

### Chapter 15.4 — Water Stewardship
- The water problem and the energy-water nexus as operational stewardship (siting gate → see Chapter 3.7)
- WUE drivers and water-source strategy; closed-loop/zero-evaporation designs
- Water quality, blowdown, Legionella control (ASHRAE 188), and ZLD/discharge compliance → cooling-tower engineering in Chapter 5.8
- Water-positive accounting; basin-level risk and social license; source/total footprint disclosure

### Chapter 15.5 — Heat Reuse & District Heating (Sustainability & Economics)
- The waste-heat opportunity and heat-grade economics (engineering home → see Chapter 5.9)
- District-heating integration (Stockholm model, temperature-indexed tradable contracts) as a sustainability/regulatory matter
- Economics, payback, and EU regulatory drivers (EnEfG, DDADUE, EED ERF reporting)
- Designing for heat reuse as a constraint; reference projects and heat-as-a-product models

### Chapter 15.6 — Embodied Carbon & Circularity Across the Lifecycle
- Why embodied carbon now dominates lifecycle emissions; LCA for data centers
- Building-shell embodied carbon and low-carbon construction levers; IT/infrastructure embodied carbon
- The refresh-cycle dilemma; circularity, end-of-life, and embodied carbon in procurement/disclosure

### Chapter 15.7 — Regulation, Reporting & Disclosure Frameworks
- EU EED Art. 12 and the data-center rating scheme; national implementations
- Carbon/ESG disclosure (CSRD/ESRS, ISSB, SEC); the standards backbone (ISO 30134, EN 50600, ASHRAE)
- US and other-region landscape; building a defensible, audit-ready reporting program

### Chapter 15.8 — Grid Impact, Energy-Systems Integration & Grid Services
- Co-location and behind-the-meter as a grid-integration strategy (the macro load-growth narrative → see Chapter 16.1; process mechanics → see Chapter 3.2)
- Grid services and flexibility as both siting accelerant AND ongoing revenue/cost lever (canonical home for flexibility): ancillary-service stacking, capacity payments, demand-response/interruptible credits, and UPS/BESS-as-grid-resource
- The flexibility-unlocks-headroom thesis: ~100 GW of interconnection headroom via ~0.5% curtailment (EPRI DCFlex, Duke 98 GW) → grid-interactive engineering in Chapter 4.10, financing-downside framing in Chapters 1.8/3.2
- Powering strategies and their grid/carbon footprint; community relations and social license

---

## PART 16 — Trends, Roadmaps & the Future

### Chapter 16.1 — The Power-Bound Era: Why the Bottleneck Moved to the Substation
- From chip-bound to power-bound; the gigawatt campus as the unit of compute
- The global demand curve and interconnection wall (macro narrative home; mechanics → Chapter 3.2, flexibility → Chapter 15.8); long-lead equipment as the real gate
- "Announced vs under construction"; time-to-megawatt as the master variable

### Chapter 16.2 — Subsystem Roadmaps 2026 → 2030 (Consolidated)
- Power: 415 VAC → 800 VDC, solid-state transformers, behind-the-meter generation
- Cooling: single-phase DLC as table stakes, two-phase and microfluidics on deck
- Compute & memory: Vera Rubin → Rubin Ultra/Kyber → Feynman, HBM4, the advanced-packaging gate
- Network & optics: scale-up wars, Ultra Ethernet, co-packaged optics (CPO), 1.6T → 3.2T, the standards bet (NVLink/NVLink Fusion vs UALink/UALoE vs Broadcom SUE/ESUN)
- Storage: GPU/DPU-initiated I/O, all-flash everywhere, file/object convergence, CXL tiering
- Rack/facility/integration: from the 19" rack to the 600 kW power-and-cooling chassis (consolidates the per-part forward pointers from Parts 4–9)

### Chapter 16.3 — Software, Orchestration & Efficiency at the Frontier
- The efficiency curve vs demand (Jevons paradox); algorithmic efficiency
- Reasoning and test-time compute as a structural demand multiplier reshaping the decode-heavy future → demand framing in Chapter 1.3
- Utilization as the hidden ROI lever; inference-dominant futures; power-aware orchestration
- Software portability, the CUDA moat, and energy/carbon FinOps

### Chapter 16.4 — The Economics of the Build-Out
- The capex wave and build cost per MW; the depreciation problem and GPU ROI decay → economic-vs-book-life argument canonical in Chapter 1.8; here we cover the macro/financing lens (sector-level, distinct from the firm-level model in Chapter 1.8)
- Financing and the debt wave; power oversubscription as a yield strategy
- The revenue-vs-capex gap; what actually protects returns

### Chapter 16.5 — Scenarios for 2030
- Centralized mega-campus vs distributed inference; supercycle vs digestion vs bubble (the structural-scenario lens, distinct from the firm-level economics of Chapter 1.8)
- Industry structure (neocloud vs hyperscaler vs sovereign/enterprise); the geographic re-map
- The power-supply endgame (CFE, SMR, fusion optionism); demand-side wildcards
- Stranded-asset/obsolescence risk; systemic and societal constraints

---

## APPENDICES & REFERENCE DATA

### Appendix A — Standards & Specifications Cross-Reference Matrix
- Uptime/TIA-942/EN 50600 crosswalk; ASHRAE classes; ISO 30134 KPIs; OCP spec library (edition/date/AI-relevance)

### Appendix B — Reference Designs & Worked Examples
- Per-archetype design-basis sheets; scalable-unit (SU) power/cooling/water/network budgets
- A 50 MW campus sized from SU up; 100k-GPU cluster reference BOM

### Appendix C — Decision Tables & Calculators
- Open GPU TCO / $-per-GPU-hr and $-per-M-token calculators; PUE/WUE reference; networking taxonomy
- The levered-IRR / project-finance model: pro-forma cash flow, DCF→IRR/NPV/DSCR, contracted-vs-merchant debt capacity, sensitivity tornado (companion to Chapter 2.5)
- Density-to-cooling-cliff map; redundancy-topology selector; site-scoring template

### Appendix D — Numbers Provenance & Forecast Register
- Date-stamped market/forecast table (capex, GW, % of grid, TWh) with source, vintage, scenario, caveats
- The contested depreciation figures bound from Chapter 1.8: 2–3 yr economic vs 5–6 yr book life, residual curves, and hyperscaler life-extension entries

### Appendix E — Glossary, Phase-Gate Timeline & Learning/Community Map
- Full glossary; the project phase-gate timeline with realistic durations and the critical path
- Certification ladder, conference calendar, and feeds-to-follow

### Appendix F — Failure-Mode / FMEA Catalog
- Consolidated FMEA reference cross-referenced from Chapters 12.1, 12.5, 14.3, and 13.6, and used by the availability model in Chapter 12.5
- Coolant leak cascade; simultaneous-GPU-load-step grid trip; HBM thermal runaway
- Cooling-controls transient excursion (setpoint hunt / dew-point excursion on load drop)
- Fiber cut; fuel-supply interruption; water-curtailment event
- BESS thermal runaway; utility ride-through / voltage-disturbance events
- Each failure mode treated as dual-use (random AND attacker-induced) → cyber-physical analysis in Chapter 11.10
- For each: trigger, propagation path, detection, blast radius, mitigation, and recovery

### Appendix G — Regional & International Design Deltas: Consolidated Quick-Reference Crosswalk
- Consolidated quick-reference crosswalk: this appendix now summarizes and points back to the regional/international content woven into Parts 3, 4, 6, and the sustainability part (it is no longer the sole home of global content)
- Electrical standards: 50Hz vs 60Hz, voltage levels, system-earthing regime (TN-S/TT/IT), plug/connector and distribution-norm differences
- Code deltas: EU vs US electrical and fire codes; CE/IEC vs NEC/NFPA
- Regional specifics: APAC, Middle East, India, China — regulatory, grid, and water context
- Climate hardening: cold-climate vs hot-arid vs tropical; monsoon/typhoon/extreme-cold facility design
- How the deltas reshape siting (Part 3), electrical (Part 4), cooling (Part 5), and building (Part 6) decisions
