<!-- Provenance -->
> **Compiled 2026-06-29** from a 31-agent research sweep (439 raw sources gathered across 4 recon angles + 24 technical domains, deduplicated and credibility-ranked here). Prefer **High**-rated primary sources; cross-check **Medium**-rated specifics before quoting a number. Raw per-domain research lives in [`research/`](research/). When citing a volatile figure in the guide, date-stamp it and link back to the entry here.

# AI Data Center Bibliography — Deduplicated, Credibility-Ranked, by Topic

Credibility key: **High** = primary/authoritative (standards bodies, vendor primary docs, peer-reviewed, top analysts, government/IEA/EPRI); **Medium** = useful secondary synthesis, vendor-marketing-inflected, or advocacy-adjacent (cross-check specifics); **Low** = dropped. Duplicate entries appearing under multiple research streams have been merged.

---

## 1. Existing End-to-End Guides, Books & Standards

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| BICSI 002-2024: Data Center Design & Implementation Best Practices | BICSI | https://www.bicsi.org/standards/available-standards-store/single-purchase/ansi-bicsi-002-the-standard-for-data-center-design | High | The most comprehensive lifecycle design+implementation standard (~575 pp); 2024 ed. expanded liquid/immersion + edge/modular. Use as the chapter-structure spine. |
| Uptime Institute Tier Standard (Topology + Operational Sustainability) | Uptime Institute | https://uptimeinstitute.com/tiers | High | De facto Tier I-IV resilience classification + operations standard; ATD/ATS credential ladder. |
| ANSI/TIA-942-C (May 2024) | TIA TR-42.1 | https://tiaonline.org/products-and-services/tia942certification/ansi-tia-942-standard/ | High | Full-facility Rated-1..4 system + telecom/cabling; C revision adds AI-growth & sustainability accommodations. |
| EN 50600 / ISO/IEC 22237 series | CEN / ISO/IEC JTC 1 | https://www.iso.org/standard/72687.html | High | International/European modular standard family; Availability/Protection Classes + ISO/IEC 30134 KPIs (PUE/WUE/REF). |
| Data Center Handbook, 2nd ed. | Wiley / Hwaiyu Geng (ed.) | https://www.wiley.com/en-us/Data+Center+Handbook...-p-9781119597506 | High | Closest single-volume end-to-end DC guide; 36 chapters across plan→design→build→operate→DR. |
| The Datacenter as a Computer, 3rd ed. | Barroso, Hölzle, Clidaras, Ranganathan (Google) | https://link.springer.com/book/10.1007/978-3-031-01761-2 | High | Canonical WSC academic text; deep on power/energy efficiency, TCO modeling, failure at scale. Free PDF. |
| AI Data Center Network Design and Technologies | Subramaniam, Styszynski, Tambakuwala (Pearson/O'Reilly) | https://www.oreilly.com/library/view/ai-data-center/9780135436370/ | High | Rare book-length, vendor-agnostic treatment of AI cluster fabric design (rail-optimized, RoCE/IB, congestion control). |
| Comprehensive Overview of DC Design Standards (2025) | GBC Engineers | https://gbc-engineers.com/news/data-center-design-standards | Medium | Cross-walk of Uptime/BICSI/TIA-942/EN 50600 + compliance frameworks. |
| DC construction & lifecycle web guides | Dgtl Infra; Global Data Center Hub; Mastt; Broadstaff | https://dgtlinfra.com/building-data-center-construction/ | Medium | Practitioner build-lifecycle (siting→permitting→procurement→construction→commissioning→decommission) with timelines. |
| 7x24 Exchange International | 7x24 Exchange | https://www.7x24exchange.org/ | Medium | "End-to-end reliability" practitioner community; conference talks/white papers, not a written guide. |

## 2. Analysts, Newsletters & Independent Deep-Dives

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| Datacenter Anatomy series (Pt 1 Electrical, Pt 2 Cooling) | SemiAnalysis (Dylan Patel et al.) | https://newsletter.semianalysis.com/p/datacenter-anatomy-part-1-electrical | High | Best technical-journalism walk of DC electrical + cooling; "Four Delta Ts," hyperscaler cooling-philosophy comparison. Closest thing to a lifecycle guide; steal the TOC. |
| 100,000 H100 Clusters: Power, Topology, Ethernet vs IB, Reliability | SemiAnalysis | https://newsletter.semianalysis.com/p/100000-h100-clusters-power-network | High | The single best full-cluster end-to-end teardown; quantified (power, BoM, failures, checkpointing). |
| AI Neocloud Playbook and Anatomy | SemiAnalysis | https://newsletter.semianalysis.com/p/ai-neocloud-playbook-and-anatomy | High | 1024-GPU reference arch, 8-rail fat-tree, oversubscription, GPU:CPU ratios, per-server cost ($283-318k), full stack. |
| How AI Labs Are Solving the Power Crisis: Onsite Gas Deep Dive | SemiAnalysis | https://newsletter.semianalysis.com/p/how-ai-labs-are-solving-the-power | High | Definitive BTM-gas teardown: turbine/engine/fuel-cell models, real fleets (xAI, Crusoe Abilene), redundancy math, supply-chain bottleneck. |
| Multi-Datacenter Training (OpenAI vs Google) | SemiAnalysis | https://newsletter.semianalysis.com/p/multi-datacenter-training-openais | High | Why GPUs co-locate, synchronous gradient sync, gigawatt-campus topology, inter-site fiber, async/hierarchical SGD. |
| Inside the 800VDC Revolution (Part 1) | SemiAnalysis | https://newsletter.semianalysis.com/p/inside-the-800vdc-revolution-part | High | Authoritative rack-power roadmap: 48V→±400/800VDC physics, SST efficiency, market sizing, UL/NEC timeline. |
| GB200 Hardware Architecture & Component Supply Chain/BOM | SemiAnalysis | https://newsletter.semianalysis.com/p/gb200-hardware-architecture-and-component | High | Definitive NVL72 teardown: tray layout, 5,184 copper NVLink cables, copper-vs-optics economics, NVL72 vs NVL36×2. |
| AI Capacity Constraints — CoWoS and HBM Supply Chain | SemiAnalysis | https://newsletter.semianalysis.com/p/ai-capacity-constraints-cowos-and | High | The upstream gate: CoWoS/HBM allocation as the real lead-time driver above assembly. |
| The New AI Networks: Ultra Ethernet vs UALink vs SUE | SemiAnalysis | https://newsletter.semianalysis.com/p/the-new-ai-networks-ultra-ethernet-uec-ualink-vs-broadcom-scale-up-ethernet-sue | High | Fabric-standards landscape and the copper/optics reach implications of each. |
| The GPU Cloud ClusterMAX Rating System / ClusterMAX 2.0 | SemiAnalysis | https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard | High | De facto GPU-cloud quality/SLA benchmark across 10 dimensions; acceptance, health-checks, isolation, goodput. |
| How Much Do GPU Clusters Really Cost? / H100 Rental Index | SemiAnalysis | https://newsletter.semianalysis.com/p/how-much-do-gpu-clusters-really-cost | High | $/GPU-hr build-up, goodput/MFU, market structure (on-demand/reserved/take-or-pay). |
| Models & Research: Datacenter Industry Model, AI Cloud TCO, AI Networking | SemiAnalysis | https://semianalysis.com/datacenter-industry-model/ | High | 5,000+ facility bottom-up capacity model (permits/FOIA/satellite); GPU TCO and networking models. Paid. |
| The Next Platform (HPC/AI systems analysis) | Timothy Prickett Morgan, Nicole Hemsoth | https://www.nextplatform.com/author/tpmn | High | Free deep systems/architecture journalism; accelerator µarch, interconnects, roadmaps, market sizing. |
| ServeTheHome (teardowns, facility tours, deploy speed) | Patrick Kennedy | https://www.servethehome.com/category/ai-deep-learning/ | High | Hands-on board-level hardware detail, PCIe-GPU guide, facility tours, rack-as-a-unit deployment. |
| Construction Physics (power, grid, construction econ) | Brian Potter | https://www.construction-physics.com/ | High | "The Grid" series, AI power-demand pieces, Virginia grid-trip event, gas-turbine backlog. |
| Asianometry (visual explainers) | Jon Y | https://www.asianometry.com/ | High | Best YouTube/Substack on semis, cooling physics, "Big Data Center Water Problem," supply-chain history. |
| Fabricated Knowledge (semi + financial framing) | Doug O'Laughlin | https://www.fabricatedknowledge.com/ | High | "DC is the New Compute Unit," HBM/memory economics, who's well-positioned. |
| The Register — systems/DC news of record | Tobias Mann | https://www.theregister.com/author/tobias-mann | High | Skeptical, technically literate coverage (xAI Colossus networking, liquid cooling, optical scale-up). |
| Data Center Frontier & Data Center Knowledge | Endeavor Business Media | https://www.datacenterfrontier.com/home | High | Leading trade press; best secondary syntheses of bank/RE reports; strong daily news + explainers. |
| DataCenterHawk (market GW tracking) | David Liggitt | https://datacenterhawk.com/ | High | Metro-level GW supply/absorption; popular market-fundamentals podcast. |
| Tom's Hardware (DC/AI + GPU lab testing) | Tom's Hardware Labs | https://www.tomshardware.com/tech-industry/data-centers | Medium | Lab-tested GPU benchmarks + investigative DC reporting (labor, water stress, transformer imports). |
| Long-form analyst interviews | Dwarkesh; Latent Space; Stratechery | https://www.dwarkesh.com/p/dylan-jon | High | Top analysts' mental models unpaywalled (scaling bottlenecks, semi industry mechanics). |

## 3. Economics, Market & Forecasts

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| TCO of a 1 GW AI data center | Epoch AI | https://epoch.ai/data-insights/ai-datacenter-cost-breakdown | High | Rigorous bottom-up TCO (~$38B capex, ~$8.5B/yr); the canonical cost-structure reference. |
| The cost of compute: $7T race to scale | McKinsey | https://www.mckinsey.com/industries/technology-media-and-telecommunications/our-insights/the-cost-of-compute-a-7-trillion-dollar-race-to-scale-data-centers | High | ~$6.7T DC / $5.2T AI capex by 2030; 3 scenarios; companion infra & financing pieces. |
| Bridging the Data Center Financing Gap | Morgan Stanley Research | https://www.morganstanley.com/content/dam/msdotcom/en/assets/pdfs/Research_Bridging-Data-Center-Gap.pdf | High | ~$2.9T build 2025-28; the unique ~$1.5T financing-gap framing (private credit/ABS/SPVs). |
| Data Center Power Demand: The 6 Ps | Goldman Sachs Research | https://www.goldmansachs.com/insights/goldman-sachs-research/data-center-power-demand-the-6-ps-driving-growth-and-constraints | High | +165-175% power by 2030, ~$720B grid spend as bottleneck. |
| DC Capex 21% CAGR through 2029 / +57% 2025 | Dell'Oro Group | https://www.delloro.com/news/data-center-capex-to-grow-at-21-percent-cagr-through-2029/ | High | Hard market-tracking: 2026 capex >$1T, GPUs ~1/3 of capex, quarterly hyperscaler tracking. |
| 2026 Global Data Center Outlook | JLL Research | https://www.jll.com/en-us/insights/market-outlook/data-center-outlook | High | RE-lens: ~14% CAGR, ~1% vacancy, 4-yr grid queues, build vs lease lead times, inference overtaking training. |
| 2026 North America DC Trends / Global Outlook | CBRE Research | https://www.cbre.com/insights/books/north-america-data-center-trends-h2-2025 | High | "Demand surges, delivery is the constraint"; power as #1 siting criterion; primary-market data. |
| Powering Intelligence 2026 (US electricity scenarios) | EPRI | https://powering-intelligence.epri.com/executive-summary.html | High | Utility-grade US scenarios: DCs = 9-17% of US power by 2030; state concentration; flexible-load strategies. |
| Tracking Trillions: assumptions shaping the AI build-out | Goldman Sachs | https://www.goldmansachs.com/insights/articles/tracking-trillions-the-assumptions-shaping-scale-of-the-ai-build-out | High | Capex magnitudes, GPU useful-life/depreciation, ROI-vs-capex gap; sizing & ROI framing. |
| The evolution of neoclouds and their next moves | McKinsey | https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/the-evolution-of-neoclouds-and-their-next-moves | High | Consulting-grade neocloud strategy: differentiation, margin pressure, capital intensity. |
| Neocloud Business Model and Unit Economics | AM Compute | https://www.amcompute.com/blog/neocloud-business-model | Medium | Concrete unit economics (per-provider rates, depreciation schedules, breakeven $/GPU-hr, CoreWeave financials). |
| Build or Lease? / Power Not Space: Colocation Battleground | Global Data Center Hub; DCK; datacenters.com | https://www.globaldatacenterhub.com/p/build-or-lease-inside-the-billion | Medium | Build-vs-lease framework, wholesale colo $/kW-mo, BTS/powered-shell/CTL structures, optionality thesis. |
| Nvidia, CoreWeave, Nebius: Circular Financing of the GPU Boom | I/O Fund | https://io-fund.com/ai-stocks/nvidia-coreweave-nebius-circular-financing-gpu-boom | Medium | Leverage, DDTLs, customer concentration, residual backstop, circular-financing loop. |
| GPU-collateralized debt / Litigation Risks in AI DC Financing | Quartz; Quinn Emanuel; Bird & Bird | https://www.quinnemanuel.com/the-firm/publications/client-alert-emerging-litigation-risks-in-financing-ai-data-centers-boom/ | High | GPU-backed lending, SPVs, securitization, under-collateralization and failure modes. |
| How long before a GPU depreciates? + useful-life analyses | CNBC; Stanley Laman; DeepQuarry | https://www.stanleylaman.com/signals-and-noise/gpus-how-long-do-they-really-last | Medium | Three useful lives, hyperscaler depreciation extensions, secondary-market residuals, Burry thesis. |
| Inference Unit Economics / AI gross margins / LLMflation | Introl; a16z; ICONIQ; Spheron | https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide | Medium | Cost-per-Mtoken build-up, tokens/GPU-s, app gross-margin benchmarks, ~10x/yr inference cost decline. |
| AI Capex 2026 (~$600-725B) & demand outlook | CreditSights; Futurum; Goldman; DCD | https://www.goldmansachs.com/insights/articles/tracking-trillions-the-assumptions-shaping-scale-of-the-ai-build-out | High | Top-down capex magnitudes, sovereign/enterprise demand broadening, Stargate, ROI-vs-capex gap. |
| Jevons paradox in AI: cost decline vs demand growth | SSRN/arXiv; CloudZero; IDC | https://aiproem.substack.com/p/the-jevons-paradox-in-ai-infrastructure | Medium | ~1000x token-cost decline vs rising aggregate spend; DeepSeek shock; two-tier inference market. |
| AI DC Forecast: From Scramble to Strategy (2030 scenarios) | Bain & Company | https://www.bain.com/insights/ai-data-center-forecast-from-scramble-to-strategy-snap-chart/ | High | Centralized vs distributed, inference share by 2030, neocloud emergence, ~$1T infra spend. |

## 4. Power & Energy — Grid, Interconnection, On-Site Generation

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| Datacenter Anatomy Pt 1: Electrical Systems | SemiAnalysis | https://newsletter.semianalysis.com/p/datacenter-anatomy-part-1-electrical | High | Utility-to-rack chain: HV/MV substation, switchgear, transformer MVA sizing, redundancy topologies, UPS, busbar. |
| How AI Labs Are Solving the Power Crisis: Onsite Gas | SemiAnalysis | https://newsletter.semianalysis.com/p/how-ai-labs-are-solving-the-power | High | The definitive on-site-gas technical/economic teardown (also relevant to Economics & Siting). |
| ERCOT Large Load Integration (NPRR1234 / PGRR115) | ERCOT | https://www.ercot.com/services/rq/large-load-integration | High | Primary ISO source: 75 MW definition, interconnection study, telemetry/ride-through, 138 GW forecast. |
| ERCOT Large Load Interconnection Process (PUC filing) | PUC of Texas | https://interchange.puc.texas.gov/Documents/55999_168_1550025.PDF | High | Primary regulatory document on the Texas large-load procedure. |
| FERC PJM Co-Located Load & BTM Generation Order (Dec 2025) | FERC | https://www.ferc.gov/news-events/news/fact-sheet-ferc-directs-nations-largest-grid-operator-create-new-rules-embrace | High | Primary colocation order: four transmission-service options, BTMG cost-causation, 60-day tariff window. |
| Interconnection of Large Loads (Docket RM26-4) + DOE Sec.403 | FERC; White & Case; Gibson Dunn | https://www.ferc.gov/rm26-4 | High | DOE directive federalizing >20 MW interconnection; Apr 30 2026 deadline; jurisdictional collision. |
| NERC Level 3 Alert on Data Center Load Losses (2026) | NERC / Utility Dive | https://www.utilitydive.com/news/nerc-issues-rare-level-3-alert-over-data-center-load-losses/819295/ | High | The ride-through reliability problem: ~1,500 MW trip on a 230 kV fault; mandated actions. |
| Why NERC Now Sees AI DCs as Grid Actors | NERC via Data Center Frontier | https://www.datacenterfrontier.com/energy/article/55376679/why-nerc-now-sees-ai-data-centers-as-grid-actors | High | Synchronized multi-MW load drops, fault-ride-through and large-load study requirements. |
| Can US Interconnection Queues Survive Load Growth? | Ascend Analytics | https://www.ascendanalytics.com/blog/large-load-interconnection-queues-data-center-grid-access | High | RTO large-load pathway comparison (SPP/PJM/ERCOT), queue-scale data, BTM-vs-grid economics. |
| Practical Guidance for Large Load Interconnections (interim) | GridLab | https://gridlab.org/wp-content/uploads/2025/03/GridLab-Report-Large-Loads-Interim-Report.pdf | High | Independent technical guidance on studies, flexibility, cost allocation across ISOs. |
| ~100 GW grid headroom via flexible/curtailable load | Duke Nicholas Institute / Latitude Media | https://www.latitudemedia.com/news/the-us-grid-may-have-over-100-gw-of-load-to-spare/ | High | Empirical basis for flexible interconnection: 98 GW at 0.5% annual curtailment. |
| EPRI DCFlex Initiative (flexibility hubs) | EPRI / Utility Dive | https://www.utilitydive.com/news/data-centers-flexibility-utilities-speed-to-power/822588/ | High | 40+ org consortium testing demand response, workload shifting, UPS-as-grid-resource. |
| DOE Directs FERC to Accelerate Interconnection (Sec. 403) | White & Case / DOE | https://www.whitecase.com/insight-alert/doe-directs-ferc-accelerate-interconnection-data-centers | High | The Section 403 directive and large-load NOPR (final rule due Apr 30, 2026). |
| Who pays for the buildout? 23 states' large-load tariffs | Environment+Energy Leader | https://environmentenergyleader.com/stories/who-pays-for-the-data-center-buildout-23-states-have-already-decided,129803 | Medium | State large-load tariff landscape (Oregon POWER Act, take-or-pay, cost-allocation). |
| Bypassing the Grid: Behind-the-Meter Data Centers | Cleanview | https://cleanview.co/reports/behind-the-meter-data-centers | Medium | BTM market data (~82 GW announced) + project tracker. |
| Three Mile Island restart / nuclear & SMR PPAs | DataCenterDynamics | https://www.datacenterdynamics.com/en/news/three-mile-island-nuclear-power-plant-to-return-as-microsoft-signs-20-year-835mw-ai-data-center-ppa/ | High | Firm clean-supply procurement: TMI/Crane 835 MW, Amazon-Talen ~2 GW, Google-Kairos SMR. |
| Grid-Enhancing Technologies (DLR, reconductoring) | US DOE | https://www.energy.gov/oe/grid-enhancing-technologies-improve-existing-power-lines | High | GETs capacity unlock (DLR +30-50% at ~10% cost) to avoid multi-year line builds. |
| Powering Data Centers (PPA/nuclear/colocation) | Orrick; EIA; LevelTen; Pillsbury | https://www.orrick.com/en/Insights/2025/11/Powering-Data-Centers | High | PPA structures, take-or-pay, FERC colocation rules, nuclear/SMR megadeal terms. |
| Gas Engines vs Turbines vs CCGT: Self-Generation Guide | Grid Capacity Intelligence | https://gridcapacityintelligence.substack.com/p/gas-engines-vs-gas-turbines-vs-ccgt | High | Side-by-side RICE/aeroderivative/frame/CCGT: capacity, efficiency, capex/kW, lead times; phased framework. |
| Combustion Engine vs Aeroderivative Turbine (comparison) | Wärtsilä Energy | https://www.wartsila.com/energy/learn-more/technology-comparison-engines-vs-aeros | High | Data-rich part-load efficiency, ramp rates, min load, derating (read with vendor-bias caveat). |
| Power crunch lifts engines & aeroderivatives | Power Engineering | https://www.power-eng.com/gas/data-center-power-crunch-lifts-engines-aeroderivatives-into-larger-role/ | High | Specific 2025 orders/capacities (Wärtsilä, ProEnergy, INNIO/Rehlko) and lead times. |
| Aeroderivative Turbines Move to Center of AI Power | Data Center Frontier | https://www.datacenterfrontier.com/energy/article/55358731/aeroderivative-turbines-move-to-the-center-of-ai-data-center-power-strategy | High | Aeroderivative vs industrial vs RICE, 5-min start, island capability, lead-time crisis. |
| Engine Power Plants Surge as Data Centers Drive Demand | POWER Magazine | https://www.powermag.com/engine-power-plants-surge-as-data-centers-drive-unprecedented-demand/ | High | RICE sizing/modularity/redundancy and market trajectory. |
| Bloom Energy–Oracle (2.8 GW) & AEP/Brookfield SOFC | Bloom Energy IR / DCD | https://investor.bloomenergy.com/press-releases/press-release-details/2026/Bloom-Energy-and-Oracle-Expand-Strategic-Partnership.../default.aspx | High | Primary GW-scale solid-oxide fuel-cell deployment data, manufacturing targets, hydrogen-readiness. |
| LNG/CNG for "Five Nines" + INGAA pipeline outlook | Natural Gas Intelligence / INGAA | https://naturalgasintel.com/news/lng-cng-gain-footing-as-data-centers-chase-five-nines-natural-gas-reliability/ | High | Fuel logistics: firm vs interruptible transport, LNG/CNG energy density, +39% pipeline buildout need. |
| Nuclear DC deals tracker + SMR analysis | SMR Intel / Deloitte / Utility Dive | https://smrintel.com/nuclear-data-center-deals/ | Medium | Hyperscaler nuclear-deal catalog, SMR capex/LCOE, NRC Part 53, realistic 2030-35 timelines. |
| SMRs and Data Centers: Emerging Regulatory Landscape | Davis Graham | https://davisgraham.com/news-events/small-modular-reactors-and-data-centers... | High | SMR specs/footprint, licensing risk, co-location/BTM nuclear deals. |
| Renewables-powered DCs feasible with ~7x overbuild | pv magazine / Canary Media | https://www.pv-magazine.com/2026/06/12/renewables-powered-data-centers-feasible-with-sevenfold-solar-and-wind-overbuild-study-finds/ | Medium | Renewable-firm hybrid feasibility, firmed-solar $/MWh trajectory, storage cost declines. |
| Surging Gas Turbine Demand / US Power Outlook | Mitsubishi Power / Turbomachinery Mag | https://power.mhi.com/regions/amer/insights/us-power-outlook-and-long-term-trends | High | OEM view on frame-turbine backlog, lead times, capacity expansion. |
| EPA Clean Air Act resources for data centers + NSPS review | US EPA; Trinity; Gibson Dunn | https://www.epa.gov/stationary-sources-air-pollution/clean-air-act-resources-data-centers | High | NSR/PSD thresholds, engine tiering, "temporary turbine" subcategory, BACT/LAER, GHG reporting. |

## 5. Electrical Distribution, 800 VDC & Backup

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| Inside the 800VDC Revolution (Part 1) | SemiAnalysis | https://newsletter.semianalysis.com/p/inside-the-800vdc-revolution-part | High | SST efficiency (ETH 98%/400kW), market sizing ($13B SST/$11B power-rack by 2030), Diablo 400 vs NVIDIA. |
| Enabling 1 MW IT Racks and Liquid Cooling (Mt Diablo / Deschutes) | Google Cloud | https://cloud.google.com/blog/topics/systems/enabling-1-mw-it-racks-and-liquid-cooling-at-ocp-emea-summit | High | Primary: ±400VDC disaggregated sidecar power, Mt Diablo standardization, EV supply-chain leverage. |
| Mt Diablo — Disaggregated Power for Next-Gen AI | Microsoft Azure | https://techcommunity.microsoft.com/blog/azureinfrastructureblog/mt-diablo---disaggregated-power.../4268799 | High | Microsoft's ±400VDC Diablo 400 account; sidecar power for >500kW Kyber-class racks. |
| 800 VDC Architecture for AI Data Centers | NVIDIA | https://www.nvidia.com/en-us/data-center/technologies/800-vdc-architecture/ | High | NVIDIA's 800VDC reference design, 1MW-rack vision, partner ecosystem, Kyber/Rubin Ultra roadmap. |
| How New GB300 NVL72 Features Provide Steady Power | NVIDIA Developer | https://developer.nvidia.com/blog/how-new-gb300-nvl72-features-provide-steady-power-for-ai/ | High | Transient mitigation: 65 J/GPU capacitance, 30% peak-grid reduction, ramp smoothing, Redfish controls. |
| NVIDIA Prepares Industry for 1MW Racks and 800 VDC | DataCenterDynamics | https://www.datacenterdynamics.com/en/news/nvidia-prepares-data-center-industry-for-1mw-racks-and-800-volt-dc-power-architectures/ | High | 800VDC ecosystem/roadmap, NVL144/Kyber, midplane vs cables, partner roles. |
| OCP Open Rack V3 specs (base, 48V BBU, power connector) | Open Compute Project | https://www.opencompute.org/wiki/Open_Rack/SpecsAndDesigns | High | Authoritative ORV3: 48V busbar, BBU trigger behavior, power-shelf ratings — the baseline Diablo 400 extends. |
| Diablo 400 Project: Rack and Power Base Spec (v0.5.2) | OCP (Google/Meta/Microsoft) | https://www.opencompute.org/documents/ocp-specification-diablo-400-v0p5p2-2025-05-30-pdf | High | Primary spec for disaggregated sidecar power and the 400/800VDC transition for >150kW/600kW racks. |
| Preparing for 800 VDC: ABB, Eaton support NVIDIA | Data Center Frontier | https://www.datacenterfrontier.com/energy/article/55323139/preparing-for-800-vdc-data-centers... | High | Vendor roles: ABB DC breakers + MegaFlex UPS, Eaton supercapacitor ride-through + busbar/sidecar. |
| Eaton 800 VDC Reference Architecture (with NVIDIA/ABB) | Eaton | https://www.eaton.com/us/en-us/company/news-insights/news-releases/2025/eaton-unveils-next-generation-architecture.html | High | Megawatt-rack power RA: 800VDC distribution, supercaps, ORv3 busbar, integrated storage. |
| Mitigating DC Harmonics & K-factor transformer guides | Consulting-Specifying Engineer; CalcPanel | https://www.csemag.com/articles/mitigating-data-center-harmonics/ | High | Harmonics/power-quality: THD, K-factor selection/derating, IEEE 519, mitigation hierarchy for 100% non-linear AI loads. |
| Designing Production-Ready BESS for AI Factories | NVIDIA | https://developer.nvidia.com/blog/designing-production-ready-battery-energy-storage-systems-for-ai-factories | High | Facility BESS roles (transient/ride-through/DR), Vera Rubin power smoothing (~400 J/GPU), closed-loop SoC. |
| The 1 MW AI IT rack needs 800 VDC power | Schneider Electric | https://blog.se.com/datacenter/2025/10/16/the-1-mw-ai-it-rack-is-coming-and-it-needs-800-vdc-power/ | High | Vendor-engineering view of 1MW-rack/800VDC, power-block modularization, MV-to-DC conversion. |
| Vertiv: Impact of Emerging Power Architectures + 800VDC line | Vertiv | https://www.vertiv.com/4aaa91/globalassets/products/critical-power/.../data-center-transformation...pdf | Medium | UPS topologies, eco-mode, modular power blocks, HVDC/800VDC migration (marketing-inflected). |
| Battery systems / grid integration / power stabilization | ScienceDirect; arXiv | https://www.sciencedirect.com/science/article/pii/S2352152X26000502 | High | Peer-reviewed: LFP chemistry, UPS-vs-BESS-vs-hybrid, SST-driven 800VDC simulation, multi-timescale control. |
| 800 VDC + solid-state transformers for AI DCs | IEEE Spectrum; Power Electronics News; Eaton; Vertiv | https://spectrum.ieee.org/data-center-dc | High | Power-architecture transition, SST (~99% efficiency), OCP Mt Diablo alignment, DC UPS/rack storage. |

## 6. Cooling & Thermal Management

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| ASHRAE TC 9.9 Thermal Guidelines (5th ed.) + 2024 Liquid Cooling Resiliency | ASHRAE TC 9.9 | https://tpc.ashrae.org/Documents?cmtKey=fd4a4ee6-96a3-4f61-8b85-43418dfa988d | High | Authoritative thermal/coolant envelope: air A1-A4, liquid H/W classes, FWS/TCS separation, coolant chemistry limits. |
| Major Changes to ASHRAE 5th Ed — Liquid Cooling / W-classes | ASHRAE / Upsite Technologies | https://www.upsite.com/blog/major-changes-to-ashraes-fifth-edition-of-thermal-guidelines-part-3-liquid-cooling-chapter-updates/ | High | W17-W45+ classes keyed to supply temperature; warm-water design rationale. |
| 30°C Coolant — A Durable Roadmap | ASHRAE TC 9.9 | https://ashrae.org.vn/wp-content/uploads/2024/12/30°C-Coolant-A-Durable-Roadmap...pdf | High | Case for standardizing higher facility-water temps to maximize free cooling and heat reuse. |
| Datacenter Anatomy Pt 2: Cooling Systems | SemiAnalysis | https://newsletter.semianalysis.com/p/datacenter-anatomy-part-2-cooling-systems | High | PUE/WUE, "Four Delta Ts," air/water/economizer comparison, hyperscaler cooling philosophies. |
| NVIDIA Contributes GB200 NVL72 Designs to OCP | NVIDIA | https://developer.nvidia.com/blog/nvidia-contributes-nvidia-gb200-nvl72-designs-to-open-compute-project/ | High | Primary rack cooling: ~120kW capacity, blind-mate liquid manifold, MGX rack/tray; also a racks/integration source. |
| NVIDIA, Partners Drive Gigawatt AI Factories (Vera Rubin/OCP) | NVIDIA | https://blogs.nvidia.com/blog/gigawatt-ai-factories-ocp-vera-rubin/ | High | Forward roadmap: Vera Rubin 100% liquid-cooled at 45°C, liquid-cooled busbar, Kyber 576-GPU. |
| Zero-water cooling next-gen datacenters | Microsoft | https://www.microsoft.com/en-us/microsoft-cloud/blog/2024/12/09/sustainable-by-design-next-generation-datacenters-consume-zero-water-for-cooling/ | High | Closed-loop chip-level cooling saving >125M L/yr; the WUE-vs-PUE tradeoff. |
| Microfluidic in-chip cooling (Microsoft Research) | Tom's Hardware / GeekWire / InfoQ | https://www.tomshardware.com/pc-components/liquid-cooling/microsoft-develops-breakthrough-chip-cooling-method-microfluidic-channels... | Medium | AI-designed bio-inspired channels, up to 3x cold-plate performance; next-gen roadmap signal. |
| OCP Door Heat Exchanger & OAI Liquid Cooling Guidelines | Open Compute Project | https://www.opencompute.org/projects/door-heat-exchanger | High | RDHx classes, door-HX limits, OAI/OCP liquid-cooling interface guidelines. |
| Understanding CDUs (+ Eaton/nVent specs) | Vertiv / Eaton / nVent | https://www.vertiv.com/en-us/about/news-and-insights/articles/educational-articles/understanding-coolant-distribution-units-cdus-for-liquid-cooling/ | High | CDU architecture (TCS/FWS isolation, HX, pumps), L2L vs L2A, filtration, dew-point control, capacity data. |
| PG 25 Coolant + flow-rate/delta-T design guidance | Dober | https://www.dober.com/performance-fluids/resources/pg-25-coolant-data-centers | Medium | Secondary-loop coolant selection, 7.5-12°C delta-T, 1.25-2.0 LPM/kW, biocide/material compatibility. |
| Two-phase immersion / PFAS crisis coverage | ServeTheHome / DCD / C&EN | https://www.servethehome.com/2-phase-immersion-cooling-halted-over-multi-billion-dollar-health-hazard-lawsuits/ | High | 3M Novec/PFAS exit, regulation, liability — why two-phase stalled vs single-phase DLC dominance. |
| Cold-plate vs immersion economics (2026) | Gottog / Energy Solutions / Introl | https://www.gottogpower.com/liquid-cooling-in-data-centers-explodes-in-2026.../ | Medium | DTC vs immersion CAPEX ($300-500 vs $1,000+/kW), density bands, 2026 adoption outlook. |
| Waste-Heat Recovery / Stockholm district heating | Stockholm Exergi; EU Covenant of Mayors; Energy Solutions | https://energy-solutions.co/articles/sub/data-center-waste-heat-district-heating | High | Heat-reuse economics, temperature-indexed contracts, heat-pump lift, quantified Stockholm cases. |
| Data center waste heat for district heating: a review | Renewable & Sustainable Energy Reviews (Elsevier) | https://www.sciencedirect.com/science/article/pii/S1364032125005362 | High | Peer-reviewed review: temperature grades, heat-pump integration, feasibility and barriers. |
| WUE benchmarks & water-vs-energy | Vertiv / EESI / dgtlinfra / Introl | https://www.vertiv.com/en-us/insights/articles/educational-articles/optimizing-water-usage-effectiveness-for-data-centers/ | Medium | WUE definition/benchmarks, hyperscaler water data, evaporative-vs-closed-loop tradeoffs. |
| Retrofitting legacy DCs for AI liquid cooling | Introl / Tetra Tech / Schneider | https://introl.com/blog/retrofitting-legacy-data-centers-ai-liquid-cooling-integration | Medium | Brownfield paths (L2A/RDHx/L2L), floor-loading, CDU placement, phased migration, stranded capacity. |
| Cold-plate liquid-cooling technical reviews | Elsevier (Applied Thermal Engineering) | https://www.sciencedirect.com/science/article/abs/pii/S1359431123021518 | High | Experimental thermal resistance, microchannel/jet-impingement, heat-flux/pressure-drop design data. |
| CoolIT 15kW cold plate / next-gen DLC roadmap | DataCenterDynamics / CoolIT | https://www.datacenterdynamics.com/en/news/coolit-designs-15kw-coldplate-to-future-proof-liquid-cooling-for-gpus/ | High | Cold-plate roadmap for 1.8-2.3kW GPUs; flow/pressure future-proofing. |
| Commissioning Liquid-Cooled Data Centers (step-by-step) | Peer-reviewed/industry (ResearchGate) | https://www.researchgate.net/publication/384915780_Maximizing_Cooling_Potential... | High | Multi-stage flushing, leak checks, sensor calibration, worst-case-branch full-load test. |
| Liquid-cooling roadmap: single/two-phase D2C, immersion | DCD / IDTechEx / Signal Integrity Journal | https://www.datacenterdynamics.com/en/opinions/two-phase-vs-single-phase-direct-to-chip-liquid-cooling-which-is-right-for-ai-data-centers-in-2026/ | High | Cooling-tech trajectory and market shares; single-phase D2C as 2026 default (~55%). |
| DC cooling state of play (2025) | Tom's Hardware | https://www.tomshardware.com/pc-components/cooling/the-data-center-cooling-state-of-play-2025... | Medium | DTC vs immersion, rising thermal densities, warm-water, direct-to-silicon trends. |

## 7. Silicon & Compute (Accelerators, CPUs, Memory, Numerics)

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| Inside the NVIDIA Vera Rubin Platform: Six New Chips | NVIDIA Developer | https://developer.nvidia.com/blog/inside-the-nvidia-rubin-platform-six-new-chips-one-ai-supercomputer/ | High | Primary Rubin/Vera/NVLink6/ConnectX-9/BlueField-4/Spectrum-6 + NVL72 specs. |
| Introducing NVFP4 for Low-Precision Inference | NVIDIA Developer | https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ | High | NVFP4 microscaling vs MXFP4, quantization-error reduction, inference accuracy. |
| NVIDIA Rubin CPX for 1M+ Token Context | NVIDIA Developer | https://developer.nvidia.com/blog/nvidia-rubin-cpx-accelerates-inference-performance-and-efficiency-for-1m-token-context-workloads/ | High | Disaggregated inference (context vs generation), Rubin CPX specs, attention acceleration. |
| Ironwood: First Google TPU for the Age of Inference (TPU v7) | Google Cloud | https://blog.google/products/google-cloud/ironwood-tpu-age-of-inference/ | High | Primary TPU v7 specs: FP8 TFLOPS, 192GB HBM3E, 9,216-chip pods, perf/watt. |
| Amazon EC2 Trn3 UltraServers / Trainium3 | AWS | https://aws.amazon.com/ec2/instance-types/trn3/ | High | Primary Trainium3 specs: PFLOPS, HBM3E, NeuronLink/EFAv3, perf/watt vs Trn2. |
| AWS Activates Project Rainier (~500k Trainium2) | Amazon/AWS | https://www.aboutamazon.com/news/aws/aws-project-rainier-ai-trainium-chips-compute-cluster | High | Primary account of the Anthropic anchor-tenant cluster, scale-out networking. |
| AMD Instinct MI355X product page | AMD | https://www.amd.com/en/products/accelerators/instinct/mi350/mi355x.html | High | Primary CDNA4 specs: HBM3E, FP4/FP6, TDP, positioning vs Blackwell. |
| AMD Advancing AI: MI350X / MI400 UALoE72 / MI500 UAL256 | SemiAnalysis | https://newsletter.semianalysis.com/p/amd-advancing-ai-mi350x-and-mi400-ualoe72-mi500-ual256 | High | AMD rack-scale roadmap, UALink-over-Ethernet reality, scale-up world sizes. |
| AWS Trainium3 Deep Dive | SemiAnalysis | https://newsletter.semianalysis.com/p/aws-trainium3-deep-dive-a-potential | High | Independent Trainium3 architecture + Neuron software maturity vs NVIDIA. |
| Google TPUv7: The 900lb Gorilla | SemiAnalysis | https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-swing-at-the | High | Independent Ironwood analysis, OCS interconnect, gap-closing to NVIDIA. |
| AMD vs NVIDIA Inference Benchmark (cost/Mtoken) | SemiAnalysis | https://newsletter.semianalysis.com/p/amd-vs-nvidia-inference-benchmark-who-wins-performance-cost-per-million-tokens | High | Independent inference benchmarking; realized ROCm-vs-CUDA MFU gaps. |
| MI300X vs H100 vs H200 Benchmark Pt 1: Training | SemiAnalysis | https://newsletter.semianalysis.com/p/mi300x-vs-h100-vs-h200-benchmark-part-1-training | High | Cross-vendor benchmarking gotchas; why measured diverges from spec (acceptance-bar realism). |
| The Custom AI ASIC State of Play (May 2026) | Tom's Hardware | https://www.tomshardware.com/tech-industry/semiconductors/custom-ai-asics-examined-from-broadcom-to-mtia | Medium | Survey of Maia 200, MTIA 300-500, Trainium4, OpenAI Jalapeño; Broadcom/Marvell roles. |
| Hyperscaler AI ASIC Market Report | Hashrate Index | https://hashrateindex.com/blog/hyperscaler-ai-asic-market-report-part-1/ | Medium | ASIC market sizing, design-partner shares, per-program analysis. |
| HBM Three-Way War (SK hynix/Samsung/Micron) | Momoview | https://momoview.com/blog/en/posts/hbm-industry-analysis...2026-ai-memory-supercycle... | Medium | HBM supplier shares, HBM3E/HBM4 pricing/capacity, hybrid-bonding constraints. |
| Samsung, SK hynix Tapped as Rubin HBM4 Suppliers | TrendForce | https://www.trendforce.com/news/2026/03/09/news-samsung-sk-hynix-reportedly-tapped-as-nvidia-rubin-hbm4-suppliers... | High | HBM4 supplier qualification for Rubin, shipment timing, pricing outlook. |
| HBM4 roadmap + TSMC CoWoS capacity/bottleneck | Tom's Hardware; Siemens; Micron; SK hynix; TSMC reporting | https://oplexa.com/ai-chip-packaging-bottleneck-2026/ | High | HBM3E→HBM4 specs, per-package memory, CoWoS wafer scaling as the true supply gate. |
| Rethinking AI TCO: Cost per Token | NVIDIA Blog | https://blogs.nvidia.com/blog/lowest-token-cost-ai-factories/ | Medium | Vendor framing of $/token, Blackwell-vs-Hopper throughput (treat as vendor-optimistic). |
| CPUs for AI infrastructure: EPYC, Xeon, Grace | Introl | https://introl.com/blog/cpus-for-ai-infrastructure-epyc-xeon-grace-server-processors-2025 | Medium | Host-CPU specs (Grace/EPYC Turin/Xeon Granite Rapids), GPU:CPU ratio considerations. |
| The Great Rebalance: Agentic AI and CPU:GPU Ratio | TrendForce Insights | https://insights.trendforce.com/p/agentic-ai-cpu-gpu | Medium | How agentic workloads shift the GPU:CPU ratio away from training-era ~1:8. |
| ROCm vs CUDA: GPU Computing Comparison | Thunder Compute | https://www.thundercompute.com/blog/rocm-vs-cuda-gpu-computing | Medium | State of the CUDA moat vs ROCm maturity, library gaps, multi-vendor engineering cost. |
| Quartet: Native FP4 Training Can Be Optimal | arXiv | https://arxiv.org/html/2505.14669v4 | High | Peer-style research: FP4 training Pareto-optimal at fixed compute; low-precision tradeoffs. |

## 8. Reference Architectures & System Integration (Rack→Pod→Cluster→Facility)

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| DGX SuperPOD RA (GB200 NVL72) | NVIDIA | https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-gb200/latest/index.html | High | Canonical SU reference: 8× DGX GB200/SU → 128+ racks/9,216 GPUs; fabric, storage, software stack. |
| NVIDIA Contributes GB200 NVL72 Designs to OCP | NVIDIA | https://developer.nvidia.com/blog/nvidia-contributes-nvidia-gb200-nvl72-designs-to-open-compute-project/ | High | Rack/tray mechanical/electrical/thermal: 1,400A busbar, blind-mate manifold, 120kW DLC, NVLink bandwidth. |
| Vera Rubin DSX AI Factory + Omniverse DSX Digital Twin | NVIDIA | https://nvidianews.nvidia.com/news/nvidia-releases-vera-rubin-dsx-ai-factory-reference-design-and-omniverse-dsx-digital-twin-blueprint | High | Forward gigawatt-scale "AI factory" RA integrated with OT layers; digital-twin-before-build. |
| NVIDIA Enterprise Reference Architectures (AI Factory) | NVIDIA | https://www.nvidia.com/en-us/technologies/enterprise-reference-architecture/ | High | Validated prescriptive RAs: node patterns, cabling, power/cooling envelopes, fabric provisioning. |
| NVIDIA Vera Rubin POD: Seven Chips, Five Systems | NVIDIA Developer | https://developer.nvidia.com/blog/nvidia-vera-rubin-pod-seven-chips-five-rack-scale-systems-one-ai-supercomputer/ | High | GB200 NVL72 + Vera Rubin NVL144/NVL576 Kyber roadmap, rack-scale system architecture. |
| Meta's Open AI Hardware Vision (Catalina/Grand Teton/DSF) | Meta | https://engineering.fb.com/2024/10/15/data-infrastructure/metas-open-ai-hardware-vision/ | High | Primary hyperscaler design: Catalina NVL72 pod on ORv3 (140kW), DSF fabric, FBNIC, 24K→100k+ clusters. |
| AMD Helios: AI Rack on Meta's OCP Open Rack Wide | AMD | https://www.amd.com/en/blogs/2025/amd-helios-ai-rack-built-on-metas-2025-ocp-design.html | High | Open rack-scale alternative: 72-GPU double-wide ORW, UALink scale-up, MI355X/MI455X, 2026. |
| Google: Enabling 1 MW Racks (Deschutes/Mt Diablo) | Google Cloud | https://cloud.google.com/blog/topics/systems/enabling-1-mw-it-racks-and-liquid-cooling-at-ocp-emea-summit | High | Google's OCP contribution path: 48→±400VDC, Deschutes Gen5 CDU. |
| Google TPU Architecture & OCS Pods (7 generations) | Introl (synthesis of Google primaries) | https://introl.com/blog/google-tpu-architecture-complete-guide-7-generations | Medium | TPU pod topology evolution, OCS wiring, physical-vs-logical topology decoupling. |
| Microsoft Azure Maia 100 + Sidekick | Microsoft Azure | https://azure.microsoft.com/en-us/blog/azure-maia-for-the-era-of-ai-from-silicon-to-software-to-systems/ | High | Custom-silicon-to-system: Maia 100, wider rack, Sidekick closed-loop liquid cooling, retrofit philosophy. |
| AWS Trainium2 UltraServer/UltraCluster + Project Rainier | AWS | https://aws.amazon.com/blogs/aws/amazon-ec2-trn2-instances-and-trn2-ultraservers...is-now-available/ | High | Scale-up/out alternative: 64-chip UltraServer, NeuronLink, 10P10U fabric, ~500k-chip Rainier. |
| Vertiv 360AI Reference Designs for GB200 NVL72 (7 MW) | Vertiv (with NVIDIA) | https://www.vertiv.com/493cf5/globalassets/campaigns/ai-hub/vertiv-reference-design-020-ds-en-na-2024-gr-00109-web.pdf | High | Facility-level power+cooling RA: 7 MW, 132kW/rack, 45/65°C coolant, MegaMod prefab, retrofit+greenfield. |
| Schneider EcoStruxure Pod + NVIDIA-aligned RAs | Schneider Electric (with NVIDIA) | https://www.se.com/us/en/work/solutions/data-centers-and-networks/ai-data-centers/ | High | Prefab modular pods to 1MW+, liquid+air portfolio, jointly published NVIDIA reference designs. |
| Supermicro/Dell/HPE NVL72 productized systems | Supermicro / Dell / HPE | https://www.supermicro.com/datasheet/datasheet_SuperCluster_GB200_NVL72.pdf | High | OEM implementations of NVL72 OCP/MGX: ~120-132kW liquid-cooled rack, L2L CDU, SuperCluster networking. |
| Server Manufacturing Levels Defined (L1-L12) | AMAX | https://www.amax.com/server-manufacturing-levels-defined/ | High | Canonical vocabulary: every manufacturing level and who performs each (ODM/integrator/OEM). |
| Rack integration 101: what L11 really means | Data Center Dynamics | https://www.datacenterdynamics.com/en/marketwatch/rack-integration-101-what-l11-really-means-for-ai-data-centers/ | High | L10/L11/L12 in AI context: factory cabling/optics/coolant integration, serviceability tradeoffs. |
| GB200 NVL72 Deployment: 72-GPU Liquid-Cooled | Introl | https://introl.com/blog/gb200-nvl72-deployment-72-gpu-liquid-cooled | Medium | Best single quantitative NVL72 logistics source: rack weights, ~120kW, flow, cabling, lead times. |
| H100 vs GB200 NVL72 Training Benchmarks (bring-up/reliability) | SemiAnalysis | https://newsletter.semianalysis.com/p/h100-vs-gb200-nvl72-training-benchmarks | High | The bring-up reality check: NVLink copper backplane reliability, diagnostics immaturity, software ramp. |
| HGX, DGX, MGX: NVIDIA's Server Platforms | AMC (explainer) | https://www.amcompute.com/blog/hgx-dgx-mgx | Medium | Clear HGX-vs-MGX-vs-DGX delineation and buyer/use-case mapping. |
| Why secure server rack logistics are critical | Nefab | https://www.nefab.com/news-insights/2026/why-secure-server-rack-logistics-are-now-critical-to-ai-data-centers/ | Medium | Deployment logistics: integrated-rack shipping, shock/tilt damage, JIT congestion, project-slip risk. |
| Hyperscalers deploy servers in under 3 seconds | ServeTheHome | https://www.servethehome.com/how-fast-do-you-deploy-hyperscalers-deploy-servers-in-under-3-seconds-inspur/ | High | Why factory rack integration compresses floor-install time. |
| Nvidia Draws GPU System Roadmap Out To 2028 | The Next Platform | https://www.nextplatform.com/2025/03/19/nvidia-draws-gpu-system-roadmap-out-to-2028/ | High | Rubin/Rubin Ultra/Feynman cadence, NVL144/576, 600kW Kyber, 800VDC, FP4 inference. |
| Vera Rubin: 600kW Racks by 2027 / B200-vs-GB200 deploy guide | Introl | https://introl.com/blog/nvidia-vera-rubin-gpu-600kw-racks-2027 | Medium | Practical deployment specs, retrofit-vs-new build cost ($5-10M/MW), 5-yr TCO models. |
| Racks/physical-infra guides (cabling, floor loading, DCIM) | datacenterss; AccessFloorStore; TechTarget; Wikipedia | https://datacenterss.com/data-center-cabling-standards-guide/ | Medium | Cabling standards, raised-floor load ratings, 19" RU vs 21" OU geometry, DCIM 2026 digital-twin state. |
| Containment strategies for high-density (hybrid air+DLC) | Vertiv | https://www.vertiv.com/en-asia/about/news-and-events/articles/educational-articles/data-center-containment-strategies-for-high-density-environments/ | High | Hot/cold-aisle containment for hybrid halls, RDHx for 30-60kW rows, leak detection. |
| In-rack manifolds, UQDs, CDU sizing | JetCool / Amphenol / Boyd / QCT / ToneCooling | https://jetcool.com/post/what-are-the-design-considerations-for-liquid-manifolds-and-quick-disconnects... | Medium | Manifold design, dry-break QDs, OCP-compliant couplings, NVL72 ~130 LPM/rack CDU sizing. |
| OCP Liquid Cooling Integration & Logistics + UQD | Open Compute Project | https://www.opencompute.org/documents/ocp-liquid-cooling-integration-and-logistics-white-paper-revision-1-0-1-pdf | High | TCS/secondary-loop definitions, UQD/UQDB standardization, dripless couplings, leak-detection practice. |

## 9. Networking & Optics

### 9a. Scale-Up Fabric (intra-node / intra-rack)

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| NVLink & NVLink Switch product page | NVIDIA | https://www.nvidia.com/en-us/data-center/nvlink/ | High | Primary NVLink5/NVSwitch4 specs (1.8 TB/s/GPU, 14.4 TB/s switch, NVL72 130 TB/s, NVLink-SHARP). |
| Scaling Large MoE with Wide Expert Parallelism on NVL72 | NVIDIA Developer | https://developer.nvidia.com/blog/scaling-large-moe-models-with-wide-expert-parallelism-on-nvl72-rack-scale-systems/ | High | How scale-up domain enables wide EP; EP32-vs-EP8 throughput, all-to-all bottlenecks. |
| Scaling AI Inference with NVLink & NVLink Fusion | NVIDIA Developer | https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion/ | High | NVLink Fusion (opening IP to 3rd-party CPUs/XPUs), in-network reduction. |
| GB200 NVL72 + Dynamo Boost MoE Inference | NVIDIA Developer | https://developer.nvidia.com/blog/how-nvidia-gb200-nvl72-and-nvidia-dynamo-boost-inference-performance-for-moe-models/ | High | Prefill/decode disaggregation, KV-cache over NVLink, domain-size impact on token economics. |
| UALink 200G 1.0 Spec (1,024 accelerators) | Phoronix / UALink Consortium | https://www.phoronix.com/news/UALink-200G-1.0-Released | High | UALink 1.0 final spec, switch fabric, comparison to NVLink; primary white paper at ualinkconsortium.org. |
| UALink and CXL 4.0 Interconnect/Memory-Pooling Guide | Introl | https://introl.com/blog/ualink-cxl-4-gpu-interconnect-memory-pooling-guide-2025 | Medium | UALink protocol stack, CXL 4.0, role split UALink vs CXL, vendor timeline. |
| Nvidia's Optical Boogeyman (copper vs optical in scale-up) | SemiAnalysis | https://newsletter.semianalysis.com/p/nvidias-optical-boogeyman-nvl72-infiniband | High | Copper vs optical economics, passive DAC power savings (~20kW/rack), CPO transition rationale. |
| Nvidia embraces optical scale-up as copper hits limits | The Register | https://www.theregister.com/on-prem/2026/04/05/nvidia-embraces-optical-scale-up-as-copper-reaches-limits/ | High | Copper reach wall, CPO for NVLink, Rubin Ultra NVL576 rack-to-rack optics. |
| Nonuniform-Tensor-Parallelism (GPU-failure mitigation) | arXiv | https://arxiv.org/abs/2504.06095 | High | Quantified failure amplification vs TP size; nonuniform/elastic TP as mitigation. |
| Enabling Fast Inference & Resilient Training with NCCL 2.27 | NVIDIA Developer | https://developer.nvidia.com/blog/enabling-fast-inference-and-resilient-training-with-nccl-2-27/ | High | NCCL collective behavior over NVLink, fault tolerance, NVLink-SHARP. |
| Google TPU OCS / ICI scale-up | FiberMall / NextBigFuture / Google (TPU v4 paper) | https://www.fibermall.com/blog/unveiling-google-tpu-architecture.htm | Medium-High | ICI 3D-torus scale-up, OCS (13,824 ports), v5p 8,960-chip and Ironwood pods. TPU v4 arXiv: https://arxiv.org/pdf/2304.01433 (High). |
| Cache-Coherent Heterogeneous Systems (CXL/NVLink-C2C/IF) | arXiv | https://arxiv.org/pdf/2411.02814 | High | Coherent memory-semantic fabrics compared; load/store latency tiers and programming model. |
| Multi-Node NVLink (MNNVL) on Kubernetes | NVIDIA Developer | https://developer.nvidia.com/blog/enabling-multi-node-nvlink-on-kubernetes-for-gb200-and-beyond/ | High | Operating/partitioning the NVLink domain, domain-as-resource scheduling, health considerations. |
| NVLink Architecture Explained (gen-by-gen) | Leviathan Systems | https://www.leviathansystems.co/blog/nvlink-architecture-gpu-interconnect | Medium | NVLink bandwidth per generation, NVSwitch evolution, switch-tray SPOF. |

### 9b. Scale-Out Cluster Fabric

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| DGX SuperPOD RA — Network Fabrics | NVIDIA | https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-gb200/latest/network-fabrics.html | High | Authoritative compute/storage/mgmt fabric: switch-count tables, rail-optimized fat-tree, blocking factors. |
| RoCE Networks for Distributed AI Training at Scale | Engineering at Meta | https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/ | High | Production RoCE: two-stage Clos, PFC-only at 400G, path-pinning failure vs ECMP+QP-scaling. |
| Ultra Ethernet Consortium Spec 1.0 | UEC | https://ultraethernet.org/ultra-ethernet-consortium-uec-launches-specification-1-0.../ | High | Primary UEC 1.0: UET, packet spray + reorder, UCCM congestion control, packet trimming, native RDMA. |
| Broadcom Tomahawk 6 (102.4 Tbps, Cognitive Routing 2.0, CPO) | Broadcom | https://investors.broadcom.com/news-releases/news-release-details/broadcom-ships-tomahawk-6-worlds-first-1024-tbps-switch | High | 102.4 Tbps, 224G SerDes, global load balancing, topology support, native CPO, 100k-1M XPU. |
| NVIDIA SHARP In-Network Computing | NVIDIA | https://developer.nvidia.com/blog/advancing-performance-with-nvidia-sharp-in-network-computing/ | High | SHARPv4 (Quantum-X800), in-network reduction offload, NCCL 2.27 integration, SM-count reduction. |
| GPU Cluster Network Topology: Fat-Tree/Dragonfly/Rail | Introl | https://introl.com/blog/gpu-cluster-network-topology-fat-tree-dragonfly-rail-optimized-2025 | Medium | Topology tradeoffs with concrete numbers; Meta 10.7% job failures from net config; 40k+ miles fiber. |
| Spectrum-X Ethernet Accelerates xAI Colossus | NVIDIA | https://nvidianews.nvidia.com/news/spectrum-x-ethernet-networking-xai-colossus | High | Largest production Ethernet AI fabric: 3-tier L3 Clos, ~95% throughput, zero flow-collision loss. |
| Designing Data Centers for AI Clusters | Juniper Networks | https://www.juniper.net/documentation/us/en/software/nce/ai-clusters-data-center-design/ai-clusters-data-center-design.pdf | High | Clos/leaf-spine design, oversubscription guidance (1:1 training vs 2:1/3:1 inference), lossless underlay. |
| GPU Networking: InfiniBand vs RoCE vs Spectrum-X (2026) | Spheron | https://www.spheron.network/blog/gpu-networking-infiniband-roce-spectrum-x-guide/ | Medium | Practitioner protocol-decision framework with latency figures and when-to-choose guidance. |
| DCQCN and lossless RoCE | Juniper / Broadcom | https://www.juniper.net/documentation/us/en/software/junos/traffic-mgmt-qfx/topics/topic-map/cos-qfx-series-DCQCN.html | High | Congestion-control mechanics: PFC pathologies, ECN+CNP, DCQCN tuning, deadlock/victim-flow risk. |
| Google TPU OCS / reconfigurable topology | Google (TPU v4) / Tom's Hardware / Global Semi Research | https://arxiv.org/pdf/2304.01433 | High | Optically reconfigurable supercomputer, dragonfly-inspired pods, topology-on-demand. |
| ClusterMAX Rating System (fabric/tenant isolation) | SemiAnalysis | https://semianalysis.com/2025/03/26/the-gpu-cloud-clustermax-rating-system-how-to-rent-gpus/ | High | Operator-maturity benchmark incl. fabric isolation (DPU-VPC, PKeys vs VLAN/VXLAN), observability. |
| How To Test AI Data Center Networks | Keysight | https://www.keysight.com/us/en/use-cases/test-ai-data-center-networks.html | High | T&M view on validating RoCE/IB fabrics, congestion/collective behavior, pre-production acceptance. |

### 9c. Optics & Cabling

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| Scaling AI Factories with Co-Packaged Optics | NVIDIA Developer | https://developer.nvidia.com/blog/scaling-ai-factories-with-co-packaged-optics-for-better-power-efficiency/ | High | Primary CPO numbers: 9W vs 30W/interface, signal-integrity/resiliency/power gains, Quantum-X/Spectrum-X. |
| NVIDIA Announces Spectrum-X Photonics (CPO switches) | NVIDIA Newsroom | https://nvidianews.nvidia.com/news/nvidia-spectrum-x-co-packaged-optics-networking-switches-ai-factories | High | Official CPO switch specs, port counts, 2026 availability. |
| NVIDIA Silicon Photonics product page | NVIDIA | https://www.nvidia.com/en-us/networking/products/silicon-photonics/ | High | Quantum-X/Spectrum-X Photonics, 200G SerDes, External Laser Source architecture. |
| Co-Packaged Optics — Scaling with Light | SemiAnalysis | https://newsletter.semianalysis.com/p/co-packaged-optics-cpo-book-scaling | High | Deep CPO-vs-pluggable economics, LPO/CPO power, serviceability tradeoffs, adoption curve. |
| Nvidia's Optical Ascent: >$1B Revenue / Missing 800G Ramp | SemiAnalysis | https://newsletter.semianalysis.com/p/nvidias-optical-ascent-1b-revenue | High | Optics revenue/volume ramp, 800G→1.6T transition economics, transceiver pricing. |
| The Llama 3 Herd of Models (optics-reliability case) | Meta | https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ | High | Canonical training-interruption data; network/cable failure share. |
| Meta Llama 3 interruption breakdown | Data Center Dynamics | https://www.datacenterdynamics.com/en/news/meta-report-details-hundreds-of-gpu-and-hbm3-related-interruptions-to-llama-3-training-run/ | High | Independent breakdown incl. ~8.4% network switch/cable share. |
| LRO, LPO, and Silicon Photonics | Credo | https://credosemi.com/blogs/lro-lpo-silicon-photonics/ | High | LPO vs LRO architectures, DSP placement, power savings, host co-design. |
| Evolving pluggable optics to reduce power | Nokia | https://www.nokia.com/blog/evolving-pluggable-optics-to-reduce-power-consumption/ | High | DSP power burden, linear-optics rationale, power-reduction roadmap. |
| 800G Optics & Cables Guide — LPO/LRO | Juniper Networks | https://www.juniper.net/documentation/us/en/hardware/800g-optics-cables-guide/optics/topics/concept/800g-optic-types-lpolro.html | High | Vendor reference for 800G optic types, reach classes, LPO/LRO definitions. |
| DACs, ACCs, AOCs & Transceiver Interconnects | NVIDIA Networking Docs | https://docs.nvidia.com/networking/display/CABLEOVpub/DACs,+ACCs,+AOCs,+and+Transceiver+Interconnects | High | Authoritative interconnect definitions, reach, selection in NVIDIA fabrics. |
| NVIDIA AI Structured Cabling Reference Architecture | Panduit / NVIDIA | https://www.panduit.com/.../nvidia-ai-web-fbag15-sa-eng.pdf | High | MPO trunking, fiber counts, polarity, loss budgets, rack-scale cabling design. |
| TDECQ Explained (PAM4 100G-800G) | Vitex | https://www.vitextech.com/blogs/blog/tdecq-explained... | Medium | TDECQ + temperature drift relevant to AI-rack link-margin reliability. |
| Ciena 448G innovations / path to 3.2T optics | Ciena | https://www.ciena.com/insights/blog/2025/ciena-update-on-448g-innovations-and-the-path-to-3.2t-data-center-optics | High | 448G/lane SerDes, PAM modulation, path to 3.2T. |
| Global AI Optical Transceiver Market to $26B (2026) | TrendForce | https://www.trendforce.com/presscenter/news/20260420-13017.html | High | Market sizing, 800G/1.6T trajectory, laser/DSP shortage as capacity bottleneck. |
| CPO Race: NVIDIA vs Broadcom | IDTechEx | https://www.idtechex.com/en/research-article/co-packaged-optics-race-strategic-approaches-from-nvidia-and-broadcom/34467 | High | Comparative CPO strategy, adoption sequencing, ecosystem outlook. |
| Co-Packaged Optics & 800G→1.6T→3.2T roadmap | IDTechEx; Siemens EDA; MapYourTech; Ayar Labs | https://mapyourtech.com/co-packaged-optics-architecture-status-and-the-path-to-1-6t-switches/ | High | CPO architecture/status, link-power savings, 400G/lane DSPs at OFC 2026, market sizing to 2036. |
| Nvidia Networking Roadmap (Ethernet/IB/CPO) | Network World | https://www.networkworld.com/article/4050881/nvidia-networking-roadmap-ethernet-infiniband-co-packaged-optics.../ | High | Forward roadmap spanning Ethernet/IB/CPO and optics implications. |
| OSFP vs QSFP-DD / flat-top vs finned-top | AscentOptics | https://ascentoptics.com/blog/osfp-flat-top-vs-finned-top/ | Medium | Practical form-factor guidance; liquid-cooled flat-top vs air-cooled finned-top cage compatibility. |
| How GB200 Utilizes 800G/1.6T DAC/ACC | FiberMall | https://www.fibermall.com/blog/how-nvidia-gb200-utilizes-800g-1600g-dac-acc.htm | Medium | Concrete GB200 NVL72 interconnect breakdown: in-rack DAC counts, 1.6T OSFP-XD. |

## 10. Storage & Data Infrastructure

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| DGX SuperPOD (B300) Storage Architecture | NVIDIA | https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300/latest/storage-architecture.html | High | Canonical per-SU read/write bandwidth tiers, per-GPU targets, write≥½ read rule, caching guidance. |
| DGX SuperPOD (H100) Storage Architecture | NVIDIA | https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-h100/latest/storage-architecture.html | High | H100-gen storage tiers; baseline storage:compute ratios. |
| Optimizing Checkpoint Bandwidth for LLM Training | VAST Data | https://www.vastdata.com/blog/optimizing-checkpoint-bandwidth-for-llm-training | High | 14 bytes/param rule, checkpoint sizing, async drain rates, <10% overlap target; 85k+ checkpoint survey. |
| BlueField-4 Context Memory Storage Platform (CMX) | NVIDIA | https://developer.nvidia.com/blog/introducing-nvidia-bluefield-4-powered-inference-context-memory-storage-platform.../ | High | Inference memory hierarchy + Ethernet-attached flash KV tier, DOCA/Dynamo/NIXL; the 2026 inference-storage tier. |
| BlueField-4 STX Storage Architecture | NVIDIA Newsroom | https://nvidianews.nvidia.com/news/nvidia-launches-bluefield-4-stx-storage-architecture-with-broad-industry-adoption | High | 800 Gb/s DPU, STX modular RA, WEKA/VAST/DDN adoption, H2 2026. |
| GPUDirect Storage Design Guide | NVIDIA | https://docs.nvidia.com/gpudirect-storage/design-guide/index.html | High | Definitive GDS architecture, DMA path, NVMe/NVMe-oF requirements, supported NICs/fabrics. |
| Enhancing Distributed Inference with NIXL | NVIDIA | https://developer.nvidia.com/blog/enhancing-distributed-inference-performance-with-the-nvidia-inference-transfer-library/ | High | NIXL unified transfer across RDMA/GDS/NVMe/object; KV-cache transport, ~10x prefill. |
| AI-Optimized Storage: NVMe-oF, GPUDirect, Parallel FS (2025) | Introl | https://introl.com/blog/ai-optimized-storage-nvme-gpudirect-parallel-file-systems-2025 | Medium | Comprehensive specs table: parallel-FS shares/architecture, GDS rates, vendor reference numbers, tiered checkpointing. |
| NVMe KV Cache Offloading for LLM Inference (2026) | Spheron | https://www.spheron.network/blog/nvme-kv-cache-offloading-llm-inference/ | Medium | Three-tier KV hierarchy, serve ~10x more users, prefix-cache economics. |
| NVIDIA pushes inference context to NVMe / KV extenders | Blocks & Files | https://www.blocksandfiles.com/ai-ml/2026/03/30/nvidia-and-its-partners-kv-cache-extenders/5209284 | Medium | Independent reporting on ICMSP/CMX, NVMe KV-cache offload standardization, partner landscape. |
| NVIDIA SCADA / Wiwynn PCIe 6.0 storage server (2.9 PB) | Tom's Hardware | https://www.tomshardware.com/pc-components/ssds/nvidias-high-speed-ai-data-center-storage-servers-break-cover... | Medium | GPU-initiated storage, 96 liquid-cooled E3.S SSDs on PCIe 6.0; dense-flash roadmap. |
| Data Loading Best Practices with Amazon S3 | AWS | https://aws.amazon.com/blogs/machine-learning/applying-data-loading-best-practices-for-ml-training-with-amazon-s3-clients/ | High | S3 data-loader parallelism, connector choices, sharding/prefetch to keep GPUs fed. |
| Architecting Scalable Checkpoint Storage on AWS | AWS | https://aws.amazon.com/blogs/storage/architecting-scalable-checkpoint-storage-for-large-scale-ml-training-on-aws/ | High | Tiered checkpoint architecture, async checkpointing, cadence/overhead tradeoffs. |
| Design Storage for AI/ML on Google Cloud | Google Cloud | https://docs.cloud.google.com/architecture/ai-ml/storage-for-ai-ml | High | Hyperscaler storage decision framework, throughput-vs-latency tier mapping, managed Lustre/GCS. |
| High Performance File Systems for AI/ML | WWT | https://www.wwt.com/article/high-performance-file-systems-for-aiml | Medium | Vendor-neutral Lustre/GPFS/WEKA/VAST/Panasas comparison, metadata/small-file tradeoffs. |
| Parallel File Systems Explained | Blocks & Files | https://blocksandfiles.com/2025/11/26/parallel-filesystem-definitions-and-powerscale/ | Medium | Primer on centralized-vs-distributed metadata, striping, small-file bottleneck. |
| The Economics of Data Gravity | Pure Storage | https://blog.purestorage.com/purely-technical/the-economics-of-data-gravity/ | Medium | Data-gravity economics, replication efficiency, move-compute-to-data argument. |
| Optimizing Storage for Petabyte-Scale AI Pipelines | Introl | https://introl.com/blog/ai-data-pipeline-architecture-petabyte-scale-training-2025 | Medium | Petabyte ingestion/preprocessing architecture, Meta RSC 46 PB cache, loader-to-storage co-design. |
| OCI + Magnum IO GDS + IBM Storage Scale | Oracle | https://blogs.oracle.com/cloud-infrastructure/accelerate-ai-ml-workloads-oci-nvidia-ibm | High | Hyperscaler-validated GDS + GPFS integration, measured throughput. |

## 11. Software Stack & Orchestration

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| Running Large-Scale GPU Workloads on K8s with Slurm (Slinky) | NVIDIA Developer | https://developer.nvidia.com/blog/running-large-scale-gpu-workloads-on-kubernetes-with-slurm/ | High | Primary source for Slurm/K8s convergence: slurm-bridge/operator, topology-aware scheduling, 8,000+ GPU scaling. |
| Peak Efficiency on GB200 NVL72 with Slurm Block Scheduling | NVIDIA Developer | https://developer.nvidia.com/blog/achieving-peak-system-and-workload-efficiency-on-nvidia-gb200-nvl72-with-slurm-block-scheduling/ | High | NVLink-domain block allocation, topology.yaml, rack-scale scheduling for coherent memory domains. |
| NVIDIA Mission Control — Autonomous Hardware Recovery | NVIDIA Docs | https://docs.nvidia.com/mission-control/docs/systems-administration-guide/2.3.0/autonomous-hardware-recovery.html | High | Fleet control-plane: break-fix workflows, health checks, BCM/observability, UFM/NetQ integration. |
| DCGM-Exporter & GPU Telemetry | NVIDIA Docs | https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html | High | Authoritative GPU observability/health telemetry pipeline (Prometheus, diagnostic levels, policy). |
| XID Errors Reference (r590) + driver/CUDA matrix | NVIDIA Docs | https://docs.nvidia.com/deploy/pdf/XID_Errors.pdf | High | XID taxonomy/triage + data-center driver/CUDA compatibility matrix; node software versioning. |
| KAI Scheduler (open-source, Apache-2.0) | NVIDIA / GitHub | https://github.com/kai-scheduler/KAI-Scheduler | High | K8s-native AI scheduler: gang scheduling, fair-share, bin-packing, DRA, topology-aware. |
| Run:ai Docs — Multi-Tenant & Advanced Cluster Config | NVIDIA | https://run-ai-docs.nvidia.com/multi-tenant/infrastructure-setup/advanced-setup/cluster-config | High | Fractional GPU, quota/policy, multi-tenancy, custom scheduler registration. |
| Multi-Instance GPU (MIG) | NVIDIA | https://www.nvidia.com/en-us/technologies/multi-instance-gpu/ | High | Hardware-enforced GPU partitioning, isolation, confidential-computing integration. |
| Silent Data Corruption in AI (OCP Whitepaper) | OCP / NVIDIA | https://www.opencompute.org/documents/sdc-in-ai-ocp-whitepaper-final-pdf | High | SDC failure mechanisms, detection guidance, training impact; + 2025-26 arXiv SDC studies. |
| Distributed Training: DeepSpeed vs Megatron vs FSDP (2026) | Independent tech blog | https://pdpspectra.com/blog/distributed-training-deepspeed-megatron-fsdp/ | Medium | FSDP2/ZeRO/Megatron-Core comparison, 3D parallelism, framework-selection guidance. |
| Why vLLM is the best choice for AI inference today | Red Hat Developer | https://developers.redhat.com/articles/2025/10/30/why-vllm-best-choice-ai-inference-today | High | vLLM internals (PagedAttention, continuous batching), KServe CRD, llm-d disaggregated prefill/decode. |
| The bare metal problem in AI Factories (MAAS) | Canonical | https://maas.io/blog/the-bare-metal-problem-in-ai-factories | Medium | Bare-metal provisioning: Redfish/IPMI, PXE, Terraform; bring-up automation as economic lever. |
| Slurm vs Kubernetes in the Age of AI | HPCwire | https://www.hpcwire.com/2026/05/15/slurm-vs-kubernetes-in-the-age-of-ai/ | High | Neutral scheduler-landscape analysis, gang-scheduling semantics, ~70% Slurm/~20% K8s, convergence. |
| ROCm Compatibility Matrix | AMD | https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html | High | ROCm 7.x supported GPUs, RCCL, framework/library versions; AMD node-software path. |
| Slurm Multifactor Priority / Fair Tree | SchedMD | https://slurm.slurm.com/priority_multifactor.html | High | Authoritative fair-share, QoS, preemption, priority math for multi-tenant scheduling. |
| Disaggregated Inference / KV-cache research | Red Hat / arXiv / GenAI System Design | https://developers.redhat.com/articles/2026/06/24/optimizing-distributed-ai-inference-advanced-deployment-patterns | High | Prefill/decode disaggregation, KV transfer over PCIe/NVLink/RDMA/CXL, KV-aware scheduling. |
| RL/RLHF Infrastructure | Introl / Nathan Lambert (RLHF Book) / arXiv | https://introl.com/blog/reinforcement-learning-infrastructure-rlhf-robotics-gpu-clusters-2025 | Medium | RL as inference-heavy training, rollout bottleneck, sync-vs-async, collocated-vs-disaggregated. |

## 12. Commissioning, Testing & Go-Live

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| ClusterMAX 2.0 (burn-in, acceptance, health-check) | SemiAnalysis | https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard | High | De facto standard for what a commissioned GPU cloud must demonstrate; anchors commissioning + SLA chapters. |
| ClusterMAX Criteria & Health-Checks pages | SemiAnalysis | https://www.clustermax.ai/criteria | High | 10 evaluation dimensions + health-check cadence across H100/H200/B200/GB200/MI300X. |
| A Practitioner's Guide to Testing & Running Large GPU Clusters | Together AI | https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models | High | Seven-phase validation with concrete tools/numbers (DCGM, GPU-Burn, nvbandwidth, NCCL, FIO). Backbone of burn-in chapter. |
| DGX BasePOD Deployment Guide — NCCL Validation | NVIDIA | https://docs.nvidia.com/dgx-basepod/deployment-guide-dgx-basepod/latest/nccl.html | High | Canonical NCCL acceptance procedure: exact commands, busbw tables, correctness checks. |
| ibdiagnet InfiniBand Fabric Diagnostic Tool Manual + BER | NVIDIA (Mellanox) | https://docs.nvidia.com/ibdiagnet-infiniband-fabric-diagnostic-tool-user-manual-v2-13-0.pdf | High | Authoritative fabric-commissioning tooling: BER test, default 1e-12 threshold, per-port checks. |
| Guard: Scalable Straggler Detection & Node Health | arXiv | https://arxiv.org/abs/2605.17879 | High | Offline node-sweep qualification (commissioning) + online straggler detection (day-2). |
| Goodput Metric as a Measure of ML Productivity | Google Cloud | https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity | High | Formal goodput definition, badput accounting; the right acceptance/SLA target. |
| Zettascale in Practice: OSU & NCCL Benchmark on H100 | Oracle Cloud (OCI) | https://blogs.oracle.com/cloud-infrastructure/zettascale-osu-nccl-benchmark-h100-ai-workloads | High | Hyperscaler-published NCCL/OSU benchmarking methodology + results; cluster-acceptance reference. |
| Level 5 Data Center Cx — Integrated Testing (IST) Guide | Construct & Commission | https://constructandcommission.com/level-5-data-center-commissioning-step-by-step-guide/ | Medium | IST scope, OPR/BOD traceability, L4→L5→handover sequencing; Cx-fundamentals backbone. |
| Understanding L1-L5 Commissioning | BMP MEP Contractor | https://bmp-mepcontractor.com/understanding-l1-l5-commissioning-in-data-centre-projects.../ | Medium | Clear L1-L5 definitions + tag/color taxonomy (corroborate vs ASHRAE Guideline 0/Uptime). |
| DC Testing: UPS and Generator Testing | CxPlanner | https://cxplanner.com/data-centers/resources/data-centers-test-ups-generator | Medium | UPS discharge/transfer tests, generator start/load-bank/black-start, failover testing. |
| Data Center Commissioning & Load Bank Testing | Aggreko | https://www.aggreko.com/en-us/sectors/data-centres/data-centre-commissioning | Medium | Load-bank commissioning (resistive vs reactive), temporary-power logistics, IST load realism. |
| Testing AI Infrastructure: Validation Frameworks | Introl | https://introl.com/blog/testing-ai-infrastructure-validation-frameworks-gpu-clusters-production | Medium | Consolidated burn-in (72-168 hr), thermal thresholds, DCGM levels (verify standout numbers vs primaries). |
| ASHRAE 2024 Liquid Cooling Commissioning (secondary) | Secondary summary of ASHRAE | https://electronics.alibaba.com/buyingguides/ashrae-2024-liquid-cooling-guide-for-data-centers | Medium | Flow-rate floors, CDU loop separation, flushing, fluid quality (cross-check vs primary ASHRAE Datacom). |

## 13. Operations, Reliability, Redundancy & Standards

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| Revisiting Reliability in Large-Scale ML Clusters | Meta AI (arXiv 2410.21680) | https://arxiv.org/html/2410.21680v1 | High | The single best primary source on AI cluster reliability: failure rates per node-day, ETTR, optimal checkpoint intervals, lemon-node detection. |
| The Llama 3 Herd of Models (reliability section) | Meta AI (arXiv 2407.21783) | https://arxiv.org/pdf/2407.21783 | High | Primary at-scale failure data: 466 interruptions/54 days, full root-cause table, >90% effective training time. |
| Meta Llama 3 interruptions report | DataCenterDynamics / Tom's Hardware / Meta | https://www.datacenterdynamics.com/en/news/meta-report-details-hundreds-of-gpu-and-hbm3-related-interruptions-to-llama-3-training-run/ | High | The canonical 419-in-54-days dataset with category-by-percentage breakdown. |
| How Meta keeps its AI hardware reliable | Meta | https://engineering.fb.com/2025/07/22/data-infrastructure/how-meta-keeps-its-ai-hardware-reliable/ | High | Authoritative SDC detection (Fleetscanner, Ripple, Hardware Sentinel), fault taxonomy, hyper-checkpointing, PVF. |
| Uptime Institute Global Data Center Survey 2025 / Outage Analysis | Uptime Institute | https://intelligence.uptimeinstitute.com/resource/uptime-institute-global-data-center-survey-2025 | High | Authoritative outage causes (power 45%, human error 70-80%), staffing crisis, cost of downtime. |
| Uptime Institute AI Training & Infrastructure Survey 2025 | Uptime Institute | https://datacenter.uptimeinstitute.com/rs/711-RIA-145/images/2025.AITraining.3Pager.pdf | High | Best data on how operators actually provision resilience for AI training (n=519). |
| Uptime Institute Tier Classification System | Uptime Institute | https://uptimeinstitute.com/tiers | High | Canonical Tier I-IV definitions, concurrent maintainability vs fault tolerance, certification program. |
| OCP GPU Firmware Update Spec v1.0 + SDC initiative | Open Compute Project | https://www.opencompute.org/documents/external-ocp-gpu-fw-update-specification-v1-0-1-pdf | High | Primary fleet firmware management (Redfish, PLDM-over-MCTP, secure OOB) + cross-vendor SDC standardization. |
| Multi-tier checkpointing / Checkpointless training | Google Cloud / AWS | https://cloud.google.com/blog/products/ai-machine-learning/using-multi-tier-checkpointing-for-large-ai-training-jobs | High | Checkpoint architecture, MTTR reduction (15-30min→<2min), goodput gains, fast-recovery mechanics. |
| Elastic training & optimized/multi-tier checkpointing | Google Cloud | https://cloud.google.com/blog/products/ai-machine-learning/elastic-training-and-optimized-checkpointing-improve-ml-goodput | High | Goodput definition, Young/Daly optimal interval, async/multi-tier, elastic training. |
| AI power fluctuations strain budgets & hardware | Uptime Institute Journal / Spheron / arXiv | https://journal.uptimeinstitute.com/ai-power-fluctuations-strain-both-budgets-and-hardware/ | High | Power oversubscription headroom (3% training vs 21% inference), transients, capping, power-as-binding-constraint. |
| AI Training Load Fluctuations at GW-scale — Blackout Risk? | SemiAnalysis | https://newsletter.semianalysis.com/p/ai-training-load-fluctuations-at-gigawatt-scale-risk-of-power-grid-blackout | High | Synchronized GPU load swings, grid-coupling risk, BBU/supercap/software mitigation. |
| Power Stabilization / Wide-Area Oscillations from AI | arXiv (2508.14318, 2508.16457) | https://arxiv.org/html/2508.14318v1 | High | Load-smoothing, layered storage buffering, grid-forming control, multi-site oscillation risk. |
| OPT-175B logbook / Unicron resilient training | Meta AI & Alibaba (arXiv) | https://arxiv.org/pdf/2205.01068 | High | Earlier-scale failure data (OPT-175B 105+ restarts) + Unicron job-level failure statistics; reliability scaling. |
| Block vs Distributed Redundancy | STACK Infrastructure | https://www.stackinfra.com/resources/thought-leadership/weighing-in-on-block-vs-distributed-redundancy/ | Medium | 2N vs distributed-redundant (3N/2, 4N/3), catcher topologies, utilization/complexity tradeoffs. |
| TIA-942 vs Uptime Tier — scope & certification | EPI / score-grp / datacenterss | https://www.epi-ap.com/content/32/900/TIA-942_vs_Uptime... | Medium | Best concise standards-comparison; scope differences, Rated vs Tier, EN 50600 context. |
| Data Center Redundancy: N, N+1, 2N, 2N+1 Explained | dgtl Infra | https://dgtlinfra.com/data-center-redundancy/ | Medium | Practitioner reference for the redundancy ladder, component vs path, cost framing. |
| CDU redundancy & DTC liquid-cooling reliability | Vertiv, Equinix, LiquidStack, Chilldyne | https://blog.equinix.com/blog/2026/05/07/the-anatomy-of-a-direct-to-chip-liquid-cooling-system/ | Medium | CDU internal redundancy, DTC loop reliability, leak/negative-pressure, concurrent maintainability. |
| GPU observability & health monitoring | Chronosphere; Last9; NVIDIA DCGM/NVSentinel; Rafay; Introl | https://docs.nvidia.com/dsx/ncp/inference-ra/home | Medium-High | Production GPU observability stack (DCGM/NVML, XID/SXID, cordon/drain, SLO-burn alerting). |
| DCIM for AI & predictive maintenance | Modius; Compass; Maintech; NVIDIA Omniverse; Schneider; Switch | https://modius.com/blog/dcim-for-ai-designing-power-cooling-and-observability-for-gpu-heavy-data-centers/ | Medium | DCIM evolution for GPU/liquid-cooled, predictive-maintenance lead times, digital-twin/agentic-ops direction. |
| GPU depreciation / useful-life debate | CNBC; theCUBE; DeepQuarry; Aravolta | https://www.cnbc.com/2025/11/14/ai-gpu-depreciation-coreweave-nvidia-michael-burry.html | Medium | Depreciation-policy debate (5-6 yr vs 2-3 yr), AWS/Meta divergence, understated-depreciation thesis. |
| Data center decommissioning / ITAD | Securis; STS; Human-I-T; SimsLifecycle; ERI | https://securis.com/blog/top-data-center-decommissioning-companies/ | Medium | Decommissioning workflow, NIST 800-88 sanitization, R2v3 certs, resale recovery (cross-check market figures). |

## 14. Site Selection, Land & Permitting

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| Data Center Energy Infrastructure: Federal Permit Requirements (CRS R48762) | Congressional Research Service | https://www.everycrsreport.com/reports/R48762.html | High | The single best primary federal-permit map: CAA/CWA/SDWA/FERC/NRC/CZMA thresholds, agency roles, federal-nexus. |
| Accelerating Federal Permitting of DC Infrastructure (EO 14318) | The White House | https://www.whitehouse.gov/presidential-actions/2025/07/accelerating-federal-permitting-of-data-center-infrastructure/ | High | Primary text: qualifying thresholds ($500M/100MW), categorical exclusions, federal-land siting. |
| Water, air, and backup power: Permitting pinch points | Nixon Peabody | https://www.nixonpeabody.com/insights/alerts/2026/02/18/water-air-and-backup-power-permitting-pinch-points-for-ai-facilities | High | Concise current synthesis of the three permitting pinch points with concrete conditions. |
| Permitting the AI Boom: A New NEPA Landscape | POWER Magazine; Squire Patton Boggs; K&L Gates; Williams Mullen | https://www.powermag.com/permitting-the-ai-boom-a-new-nepa-landscape.../ | High | NEPA reform mechanics (FRA/OBBBA, Seven County, CEQ guidance), federal-land siting, litigation exposure. |
| Senate Bill 6 Implementation (Texas) | Perkins Coie; Baker Botts; Mayer Brown | https://perkinscoie.com/insights/update/sb-6-implementation-shaping-data-center-future-texas | High | TX SB6: 75 MW threshold, mandatory curtailment/kill-switch, LLIS, ERCOT 765kV backbone. |
| State Data Center Legislation 2026 | MultiState; ArentFox Schiff | https://www.multistate.us/insider/2026/2/20/state-data-center-legislation-in-2026-tackles-energy-and-tax-issues | High | State-by-state regulatory/tax trends; best for state-divergence and regional-playbook chapters. |
| Water use in US data centers: Legal & regulatory risks | Nixon Peabody; EESI; Control Associates | https://www.nixonpeabody.com/insights/articles/2025/09/05/water-use-in-us-data-centers-legal-and-regulatory-risks | High | Withdrawal/discharge regimes, WUE benchmarks, Loudoun/Arizona cases, reclaimed-water mandates. |
| $64B of DC projects blocked or delayed | Data Center Watch | https://www.datacenterwatch.org/report | Medium | Best single source for community-opposition metrics (advocacy-adjacent; cross-check). |
| Noise pollution concerns / revising ordinances | EESI; Ramboll; Larson Davis | https://www.eesi.org/articles/view/communities-are-raising-noise-pollution-concernsabout-data-centers | Medium | Acoustics: low-freq hum vs dB(A), ordinance bands, mitigation effectiveness, 1/3-octave assessment. |
| Power Availability: The New #1 in Site Selection | Hanwha Data Centers | https://www.hanwhadatacenters.com/blog/power-availability-the-new-1-in-data-center-site-selection/ | Medium | Developer-side view of the 2024-26 reordering of siting criteria. |
| Data Center & Large Load Siting Guide | Enverus | https://www.enverus.com/data-center-site-selection-criteria/ | High | Three-pillar (power/price/land) framework, nodal-pricing analytics, hyperscale thresholds. |
| Texas & ERCOT: Structural Advantage | Davis Graham | https://davisgraham.com/news-events/texas-and-ercot-the-structural-advantage-for-data-center-power/ | High | ERCOT advantages, SB6 large-load framework, Texas regulatory regime. |
| Loudoun County DC Standards & Locations (Phase 2) | Loudoun County, VA | https://www.loudoun.gov/6222/Phase-2-Data-Center-Standards-Locations | High | Primary source on NoVA zoning reform: end of by-right, SPEX/conditional use, substation policy. |
| Virginia Faces New Headwinds in DC Growth | Data Center Knowledge | https://www.datacenterknowledge.com/data-center-site-selection/virginia-faces-new-headwinds-in-data-center-growth | High | NoVA power/zoning constraints, Dominion/PJM shortfall, reliability outlook. |
| 2026 Data Center Power Report — When Power Defines Growth | Bloom Energy | https://www.bloomenergy.com/wp-content/uploads/2026-power-report.pdf | Medium | BTM/on-site power adoption, fuel-cell timelines (vendor report, useful quantitative data). |
| Navigating the US DC Power Crunch | S&P Global | https://www.spglobal.com/en/research-insights/special-reports/look-forward/data-center-frontiers/navigating-us-data-center-energy-demand | High | On-site/BTM as faster path to power, transmission constraints, demand modeling. |
| 1Q 2026 Data Center Market Recap | datacenterHawk | https://datacenterhawk.com/resources/market-insights/1q-2026-data-center-market-recap | High | Secondary-market capital flows, pipeline-vs-deliverable gap. |
| 2025 AI Diffusion Export Controls — Impacts Quantified | SemiAnalysis | https://newsletter.semianalysis.com/p/2025-ai-diffusion-export-controls... | High | Country tiering, chip/weight export controls, sovereign-AI siting implications. |
| Framework for AI Diffusion (Federal Register) | US BIS | https://www.federalregister.gov/documents/2025/01/15/2025-00636/framework-for-artificial-intelligence-diffusion | High | Primary regulatory text of the (later rescinded) tiered export-control framework. |
| How Sovereign Is Sovereign Compute? (775 non-US DCs) | arXiv | https://arxiv.org/pdf/2508.00932 | High | Empirical study of sovereignty, control-of-stack vs residency, geopolitical dependencies. |
| The Middle East's Trillion-Dollar Bet on AI | Introl | https://introl.com/blog/middle-east-uae-saudi-arabia-ai-data-center-boom-2025 | Medium | Gulf buildout: G42/Stargate, HUMAIN, regional GW trajectory, US-partnership structure. |
| Crude to Compute: Building the GCC AI Stack | Middle East Institute | https://www.mei.edu/publications/crude-compute-building-gcc-ai-stack | High | Geopolitics of Gulf AI infra, sovereign strategies, export-control context. |
| Concerned with Sustainability and Power? Look North | Data Center Knowledge | https://www.datacenterknowledge.com/data-center-site-selection/concerned-with-data-center-sustainability-and-power-look-north | High | Nordic siting: free cooling, firm renewables, heat reuse, subsea cable landings. |
| Water Usage Efficiency (WUE) Cooling Guide 2025 | Introl | https://introl.com/blog/water-usage-efficiency-wue-ai-data-center-cooling-guide-2025 | Medium | WUE benchmarks, evaporative vs closed-loop vs dry, design-out-water strategy. |
| When AI Meets Water Scarcity | MSCI | https://www.msci.com/research-and-insights/blog-post/when-ai-meets-water-scarcity-data-centers-in-a-thirsty-world | High | Water-stress mapping, ESG/social-license risk, consumption scale and reporting. |
| Data Centers & Mission Critical (Geotech/Seismic/Flood) | Langan Engineering | https://www.langan.com/data-centers | High | Geotechnical investigation, seismic/base-isolation, flood-proofing, environmental due diligence. |
| FEMA National Flood Hazard Layer | FEMA | https://fpm-fema.hub.arcgis.com/ | High | Authoritative flood-hazard mapping (NFHL) for site flood-risk diligence. |
| Texas Sales Tax Exemption for Qualified DCs | Texas Comptroller | https://comptroller.texas.gov/taxes/data-centers/ | High | Primary source on TX sales-tax exemption thresholds. |
| Texas Losing a Billion a Year on DC Tax Break | The Texas Tribune | https://www.texastribune.org/2026/04/08/texas-data-centers-sales-tax-break-billion-dollars/ | High | Incentive-durability and backlash risk; fiscal cost and rollback dynamics. |
| Data Centers: Site Selection 101 | Site Selection Magazine | https://siteselection.com/data-centers-site-selection-101/ | High | Weighted evaluation-matrix methodology, phased diligence process. |
| AI Deployments Reshaping Intra-DC Fiber | Data Center Frontier | https://www.datacenterfrontier.com/machine-learning/article/55300534/ai-deployments-are-reshaping-intra-data-center-fiber-and-communications | High | Fiber-strand counts, 400G/800G/1.6T optics, latency-budget physics. |
| Why AI DC Projects Face Years of Delays After Approval | Data Center Knowledge | https://www.datacenterknowledge.com/energy-power-supply/why-ai-data-center-projects-face-years-of-delays-after-approval | High | Construction-vs-interconnection timeline gap, bridge power, delay drivers. |
| Clean Air Act Resources for Data Centers | US EPA | https://www.epa.gov/stationary-sources-air-pollution/clean-air-act-resources-data-centers | High | EPA's Dec 2025 air-permit resource; NSR/PSD, engine tiering, BACT/LAER. |

## 15. Sustainability & Efficiency

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| Data centre electricity use surged in 2025 / Energy and AI | IEA | https://www.iea.org/news/data-centre-electricity-use-surged-in-2025... | High | Authoritative macro figures (+17% 2025, 2030 doubling/AI-tripling); best primary sector-energy source. |
| EU Energy Efficiency Directive — DC energy performance | European Commission (DG ENER) | https://energy.ec.europa.eu/topics/energy-efficiency/.../energy-performance-data-centres_en | High | Primary: EED Art. 12 reporting, 500 kW threshold, KPIs, rating scheme (Delegated Reg 2024/1364). |
| EU-wide DC sustainability rating scheme adopted | European Commission | https://energy.ec.europa.eu/news/commission-adopts-eu-wide-scheme-rating-sustainability-data-centres-2024-03-15_en | High | Official detail on Delegated Reg 2024/1364, phased path to minimum performance standards. |
| EUDCA — EED knowledge resource | European Data Centre Association | https://www.eudca.org/energy-efficiency-directive | High | Industry interpretation: 31 data points, deadlines, ICT phase-in, waste-heat/EnMS provisions. |
| The EU's EED and Its Impact on Data Centers (legal) | Covington & Burling | https://www.cov.com/-/media/files/corporate/publications/2025/04/the-eus-energy-efficiency-directive-and-its-impact-on-datacenters.pdf | High | Legal analysis of EED obligations, national transposition (German EnEfG), enforcement. |
| GHG Protocol Scope 2 Standard Advances | GHG Protocol (WRI/WBCSD) | https://ghgprotocol.org/blog/scope-2-standard-advances-isb-approves-consultation-market-and-location-based-revisions | High | Primary on Scope 2 revision toward hourly + geographic matching, ~2027 final standard. |
| Moving toward 24x7 Carbon-Free Energy | Google | https://sustainability.google/reports/24x7-carbon-free-energy-data-centers/ | High | CFE Score methodology, 100%-by-2030 goal, 24/7 hourly vs annual REC matching. |
| Operating sustainably / Responsible water use | Google Data Centers | https://datacenters.google/water/ | High | Primary water stewardship data, reclaimed sourcing, 2030 water-positive commitment. |
| Advancing Water Stewardship in Meta's Communities | Meta | https://about.fb.com/news/2025/12/advancing-water-stewardship-in-metas-data-center-communities/ | High | Current hyperscaler water strategy, watershed replenishment, water-positive accounting. |
| Uptime Institute Global DC Survey 2025 (efficiency) | Uptime Institute | https://uptimeinstitute.com/resources/research-and-reports/uptime-institute-global-data-center-survey-results-2025 | High | Industry-weighted PUE (~1.54-1.56, flat), sustainability-metric reporting rates, operator sentiment. |
| Is PUE Dead? Better Ways to Measure Efficiency | Equinix | https://blog.equinix.com/blog/2025/11/12/is-pue-dead.../ | Medium | Critique of PUE's limits for liquid-cooled AI; case for TUE/CUE/water/work-based metrics. |
| Embodied Carbon in Data Centres | Opna / Vertiv | https://opna.earth/blog_embodied-carbon-data-centres-the-hidden-emissions | Medium | Embodied-carbon breakdown, low-carbon material levers, modular-construction reduction (pair with LCA). |
| SMR / BTM & gas-to-power (2026) | Enki AI (corroborated IEA & WWT) | https://enkiai.com/data-center/gas-to-power-boom-ai-drives-2026-on-site-energy-shift/ | Medium | On-site generation landscape, gas speed-to-power, SMR offtake, carbon-vs-speed tradeoff (cross-check). |

## 16. Security — Physical & Cyber

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| Securing AI Model Weights (RAND RRA2849-1) | RAND Corporation | https://www.rand.org/pubs/research_reports/RRA2849-1.html | High | The definitive framework: 5 Weights Security Levels, 5 attacker tiers, 38 attack vectors. Foundational. |
| NVIDIA Secure AI with Blackwell and Hopper GPUs (WP-12554) | NVIDIA | https://docs.nvidia.com/nvidia-secure-ai-with-blackwell-and-hopper-gpus-whitepaper.pdf | High | Primary GPU confidential computing: TEE architecture, encrypted HBM, attestation, TEE-I/O over NVLink. |
| NVIDIA GPU Confidential Computing Demystified | arXiv 2507.02770 | https://arxiv.org/html/2507.02770v1 | High | Best independent CC teardown: CPR, BAR0 decoupler, key derivation, attestation chain, residual side-channels. |
| CC on Hopper GPUs: Performance Benchmark Study | arXiv 2409.03992 | https://arxiv.org/pdf/2409.03992 | High | Quantitative CC overhead across workloads; the "what does CC cost" decision input. |
| Caliptra: Datacenter SoC Root of Trust | OCP / Microsoft / CHIPS Alliance | https://www.opencompute.org/documents/caliptra-silicon-rot-services-09012022-pdf | High | Open silicon RoT spec: DICE identity, measured boot, BMC separation; hardware-RoT/firmware chapter. |
| OCP S.A.F.E. Program | Open Compute Project | https://www.opencompute.org/projects/ocp-safe-program | High | Independent firmware security audit framework; SRPs, CVSS findings, RIM/SBOM/Caliptra integration. |
| NIST SP 1800-34 / IR 8320 / SP 800-193 | NIST / NCCoE | https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1800-34.pdf | High | Standards backbone: supply-chain provenance, platform/firmware integrity, secure/measured boot, C-SCRM. |
| NVIDIA Security Bulletins (vGPU CVE-2025-23290/23285) | NVIDIA | https://www.nvidia.com/en-us/security/ | High | Primary disclosures of real multi-tenant isolation failures; the "is MIG/vGPU a security boundary" input. |
| Veiled Pathways / "Spy in the GPU-box" + uncore side-channels | arXiv 2203.15981 | https://arxiv.org/pdf/2203.15981 | High | Demonstrates covert/side channels bypassing MPS/MIG; evidence partitioning isn't a strong confidentiality boundary. |
| SoK: A cloudy view on trust relationships of CVMs | arXiv 2503.08256 | https://arxiv.org/pdf/2503.08256 | High | Where confidential VMs/attestation fall short; needed for honest "CC is not a panacea" treatment. |
| FedRAMP 20x + RFC-0024 + OSCAL | GSA / FedRAMP PMO | https://www.fedramp.gov/ | High | Modernized federal authorization: KSIs, OSCAL machine-readable, continuous monitoring/compliance-as-code. |
| Check Point AI Factory Security Blueprint + BlueField zero-trust | Check Point / NVIDIA | https://blog.checkpoint.com/security/you-built-the-brain-now-protect-it/ | Medium | DPU-enforced microsegmentation, inline L4 firewalling, east-west containment (vendor-biased; corroborate). |
| 2026 drone strikes on AWS & data-center warfare | DefenseScoop / DCK / MWI (West Point) | https://defensescoop.com/2026/03/03/commercial-data-centers-drone-warfare-amazon-aws/ | Medium | March 2026 kinetic attacks; reframing DCs as war infrastructure; counter-UAS/dispersal basis. |
| Physical security controls & convergence (2026 guides) | Uptime Institute / Data Center Knowledge | https://www.datacenterknowledge.com/security-and-risk-management/data-centers-integrate-cyber-and-physical-security-in-2025 | Medium | Concentric zones, layered access, biometrics+AI analytics, physical-cyber convergence, spend trends. |

## 17. Courses, Certifications, Conferences & Community

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| NVIDIA GTC 2026 (+ DLI training) | NVIDIA | https://www.nvidia.com/gtc/ | High | Flagship AI-compute conference; annual hardware/roadmap agenda; DLI labs; free on-demand session library. |
| OCP Global Summit 2026 + Open Data Center for AI | Open Compute Project | https://www.opencompute.org/summit/global-summit | High | Open-hardware standards conference; launched the "Open Data Center for AI" initiative and rack/power/telemetry specs. |
| DCD>Connect / DCD Events + Zero Downtime podcast | DatacenterDynamics | https://www.datacenterdynamics.com/en/dcdconnect-live/new-york/2026/ | High | Operator/colo community calendar; DCD Academy training; strong daily news translation of primary reports. |
| Uptime Institute (Symposium, Tier & ATD/ATS, Global Survey) | Uptime Institute | https://uptimeinstitute.com/events | High | Resiliency authority; Tier certification + ATD/ATS credentials; annual Global Survey primary data. |
| CNet Training (CDCDP, CDCTP, mission-critical degrees) | CNet Training | https://cnet-training.com/programs/certified-data-centre-design-professional-cdcdp/ | High | Most established vendor-neutral DC education; BTEC-accredited credential ladder. |
| EPI — Certified Data Centre Professional (CDCP) ladder | EPI | https://www.epi-ap.com/services/1/3/4/Certified_Data_Centre_Professional_(CDCP) | Medium | Widely-taken EXIN-accredited cert family (CDCP/CDCS/CDCE), on-demand + global partners. |
| ASHRAE TC 9.9 Datacom Series | ASHRAE TC 9.9 | https://tpc.ashrae.org/?cmtKey=fd4a4ee6-96a3-4f61-8b85-43418dfa988d | High | 8-volume Datacom Series (thermal, liquid cooling, energy efficiency, UPS, fire); recently extended for AI density. |

## 18. Customer Onboarding, Delivery & Productization (GPU Cloud)

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| ClusterMAX 2.0 / Rating System | SemiAnalysis | https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard | High | Definitive GPU-cloud evaluation framework across 10 dimensions with concrete thresholds. |
| Confidential Computing on H100 GPUs (+ HCC WP-11459) | NVIDIA | https://developer.nvidia.com/blog/confidential-computing-on-h100-gpus-for-secure-and-trustworthy-ai/ | High | H100 CC internals, attestation, CPU TEE integration, tenant-isolation modes. |
| Slurm on Crusoe Managed Kubernetes | Crusoe | https://www.crusoe.ai/resources/blog/slurm-on-crusoe-managed-kubernetes-how-we-built-managed-gpu-training-infrastructure | High | Real managed-orchestration product architecture; API-driven lifecycle, single-tenant clusters. |
| Fault-tolerant training: building reliable clusters | Nebius | https://nebius.com/blog/posts/how-we-build-reliable-clusters | High | Neocloud-operator view of validation, burn-in, failure modes, reliability engineering at scale. |
| Self-Service Slurm Clusters / GPU PaaS | Rafay | https://rafay.co/ai-and-cloud-native-blog/self-service-slurm-clusters-on-kubernetes-with-rafay-gpu-paas | Medium | Buy-side control-plane: per-tenant namespace isolation, self-service provisioning, quota/RBAC. |
| Multi-tenant GPU Security: Isolation Strategies (2025) | Introl | https://introl.com/blog/multi-tenant-gpu-security-isolation-strategies-shared-infrastructure-2025 | Medium | Hard/soft/hybrid isolation taxonomy, MIG vs time-slicing, container-escape CVEs, DPU isolation. |
| What Multi-Tenancy Means in a GPU Neocloud | Aarna.ml | https://www.aarna.ml/post/what-does-multi-tenancy-mean-in-a-gpu-neocloud-context | Medium | Neocloud multi-tenancy: bare-metal vs VM, north-south/east-west isolation, per-tenant quota/RBAC. |
| GPU Cloud Pricing Comparison 2026 | Spheron / CloudZero / GMI Cloud | https://www.spheron.network/blog/gpu-cloud-pricing-comparison-2026/ | Medium | Per-GPU-hour pricing across 15+ providers/tiers, metering granularity, hidden-fee/egress analysis. |
| Reserved vs On-Demand GPU in 2026 | Compute Exchange | https://compute.exchange/blogs/reserved-vs.-on-demand-gpu-in-2026 | Medium | Capacity-reservation contract structures, take-or-pay, secondary markets for reserved blocks. |
| Serverless GPU Platforms Compared | Introl / Blaxel / RunPod | https://introl.com/blog/serverless-gpu-platforms-runpod-modal-beam-comparison-guide-2025 | Medium | Productized inference layer: scale-to-zero, cold-start mitigation, per-token billing, packaging. |
| VM/GPU Pricing & Reservations vs Commitments | Google Cloud | https://cloud.google.com/compute/gpus-pricing | High | Authoritative reference distinguishing capacity reservations from committed-use discounts, per-second metering. |

## 19. Archetypes, Strategic Scoping & Future (2026→2030)

| Source | Org/Author | URL | Cred. | Good for |
|---|---|---|---|---|
| AI Datacenter Energy Dilemma & power/networking deep dives | SemiAnalysis | https://newsletter.semianalysis.com/p/ai-datacenter-energy-dilemma-race | High | Power-as-bottleneck, GW-campus economics, transformer/turbine lead times, scale-up/out networking. |
| AI Cloud TCO Model and Datacenter Industry Model | SemiAnalysis | https://semianalysis.com/ai-cloud-tco-model/ | High | Quantitative TCO + capacity forecasting, $/GPU-hr economics, depreciation/utilization sensitivity. |
| AI Data Centers: Inference vs Training Design Guide | ArchiLabs | https://archilabs.ai/posts/ai-data-centers-inference-vs-training-design-guide | Medium | Side-by-side training-vs-inference facility design: density, redundancy philosophy, transient behavior. |
| Retrofitting Legacy DCs for AI / Liquid vs Air 2025 | Introl / DCD / Schneider / Tom's Hardware | https://introl.com/blog/retrofitting-legacy-data-centers-ai-liquid-cooling-integration | Medium | Retrofit feasibility/limits: air-cooling cliff (~41kW), 68% pre-2015 DCs unsuitable, hybrid strategies. |
| AI DC Grid Strain / Power Bottlenecks | Belfer Center (Harvard); Ropes & Gray; Tech Fund | https://www.belfercenter.org/research-analysis/ai-data-centers-us-electric-grid | High | Power-first siting: 1,500+ GW queue, BTM gas, energy-first vs latency-first selection, financing trends. |
| Data Center Tiers / N+1/2N redundancy references | Ingenious.build; CoreSite; Socomec | https://www.ingenious.build/blog-posts/data-center-tiers-explained | Medium | Reliability tiers and redundancy topologies mapped to training (checkpointable) vs inference (always-on). |
| NVIDIA Vera Rubin / Rubin Ultra (Kyber) roadmap | NVIDIA / Tom's Hardware / The Register (CES 2026, GTC) | https://www.tomshardware.com/pc-components/gpus/nvidia-shows-off-rubin-ultra-with-600-000-watt-kyber-racks-and-infrastructure-coming-in-2027 | High | Primary accelerator/rack roadmap: NVL144 (2H2026), NVL576 Kyber (600kW, 800VDC, 2H2027), NVLink 6, HBM4. |
| Hyperscaler capex, GPU depreciation, ROI/bubble debate | Goldman Sachs / Morgan Stanley / Bain / Fortune | https://www.goldmansachs.com/insights/articles/tracking-trillions-the-assumptions-shaping-scale-of-the-ai-build-out | High | Capex scale, $15-20M/MW build cost, 2-3 vs 5-6yr depreciation, ROI-decay, revenue-vs-capex gap. |
| Scale-up/out standards & forecasts (NVLink/UALink/SUE/UEC) | SemiAnalysis / 650 Group / Chipstrat / Arista / Marvell / Broadcom | https://newsletter.semianalysis.com/p/the-new-ai-networks-ultra-ethernet-uec-ualink-vs-broadcom-scale-up-ethernet-sue | High | Standards war and timelines (NVLink 6 shipping; UALink samples H2 2026; SUE/ESUN 2027), 2030 sizing. |
| Behind-the-meter, gas-to-power, SMRs, interconnect timelines | SemiAnalysis / DCD / RAND (Pilz/Heim) | https://newsletter.semianalysis.com/p/how-ai-labs-are-solving-the-power | High | Power-supply options/timelines: BTM gas (18-36 mo), ~6 Bcf/d AI gas by 2030, PJM queue growth, 24/7 CFE. |

---

### Notes on dedup & exclusions
- The SemiAnalysis "100,000 H100 Clusters," "Datacenter Anatomy," "AI Neocloud Playbook," "Inside the 800VDC Revolution," "GB200 Hardware/BOM," "ClusterMAX," and "Onsite Gas" pieces each appeared 2-6 times across research streams; merged to single canonical entries (listed under their primary topic, cross-referenced where multi-domain).
- NVIDIA DGX SuperPOD RA (GB200), GB200-NVL72-to-OCP contribution, OCP Open Data Center for AI, ASHRAE TC 9.9, Uptime Tiers, JLL 2026 Outlook, Llama 3 Herd, "Revisiting Reliability," Goldman "Tracking Trillions," FERC PJM colocation order, Ascend Analytics queue analysis, and "Enabling 1 MW Racks" each appeared multiple times; deduplicated to one entry, multi-domain ones noted.
- **Dropped:** the placeholder/junk record `{"title":"src1", ... "from":"test"}` (no real content).
- A handful of cited sources carried verification caveats from the original researchers (McKinsey neocloud fetch timed out; Data Center Watch and ITAD/decommissioning vendor figures advocacy/vendor-sourced; several Introl/Spheron synthesis pieces marked Medium pending cross-check). These are retained as Medium with the caveat preserved in the "Good for" column.