Team UNM · SC26

HPC Student Cluster Competition

Chicago, IL · June – November 2026

Competition Overview

Timeline

May 15 Application Deadline

Jun 19 SC26 Teams Announced Cluster stand-up at UNM

Nov 6 Ship Cluster to Chicago

Nov 14 Cluster Setup in Chicago

Nov 16–18 Competition

Team Manager

Beckett Dunlavy

Team Manager

Senior - Computer Science, incoming Software Engineer Intern at Tesla (ML & HPC Infra). Competed in indy-SC25 last year.

Team

Kiana Tarter

Team Member

Ethan Hoover

Team Member

Nevaeh Martinez

Team Member

Amber Smith

Team Member

Abdullah Ismail

Team Member

Coaches

Ryan Scherbarth

Coach

Sr. Software Engineer at Tesla (ML & HPC Infra). Led UNM's team at SC23 and SC24, and multiple other HPC competitions.

Alex Knigge

Coach

Software Engineer at Sandia National Laboratories (HPC monitoring & perf). Led UNM's team at SC25 and multiple other HPC competitions.

Dr. Matthew Fricke

Faculty Advisor

Research Associate Professor at the University of New Mexico, and faculty sponsor of UNM's HPC team since it's founding in 2023.

Cluster Configuration (Tentative)

3× Nodes

GPU 4× H200 NVL

CPU 2× AMD EPYC 9455

Memory 12× 48GB DDR5 / socket · 1.15TB / node

Network 4× ConnectX-7 · 2-port · 400Gb/s / port

Topology Switchless full mesh · direct P2P

Network Architecture

Rail 0 · CX7-0 · GPU 0

Rail 1 · CX7-1 · GPU 1

Rail 2 · CX7-2 · GPU 2

Rail 3 · CX7-3 · GPU 3

With only 3 nodes, every pair connects directly via 4× 400Gb/s cables — 2 ConnectX-7 NICs with both ports populated per link — giving 1.6Tb/s bidirectional bandwidth per pair with zero switch hops. A 400G InfiniBand switch draws an estimated 2–4kW and adds one hop of latency; eliminating it returns that power directly to the GPU budget.

We configure as to have a rail optimized network topology as well; each NIC by slot index (CX7-0 through CX7-3) forms an independent IB rail connecting GPU N of each node. NCCL AllReduce traffic stays within a single rail, which is the condition that makes rail optmimzation meaningful. Note that this can only work here because the 2-port NIC can reach both other ports with one port each, completing the loop without a switch.

This doesn't generalize — a 4th node would exhaust all NIC ports and require a switch. But within the fixed 3-node competition constraint, it's a straightforward perf-per-watt gain worth taking.

In further pursuit of optimal power-per-watt efficiency, we plan to run IP over IB rather than maintaining a dedicated front end network. The latter is not strictly necessary, and saves the power for an additional switch as well.

Power Budget

Component Per Node ×3 Total

GPU · 4× H200 NVL @ rated TDP 2,800W 8,400W

CPU · 2× EPYC 9455 580W 1,740W

Memory · 24× 48GB DDR5 360W 1,080W

Network · 4× ConnectX-7 100W 300W

Board + misc 150W 450W

Total (uncapped) 3,990W

11,970W over limit

Running uncapped exceeds the 10kW limit by ~2kW. The plan: throttle each H200 NVL to ~535W (76% TDP), holding all other components at full draw.

GPU cap Total

Non-GPU baseline (3 nodes) — 3,570W

GPU · 12× @ 535W 535W each 6,420W

Target (throttled) 9,990W ✓

H200 NVL (and GPU's in general) efficiency doesn't scale linearly with power — HBM3e bandwidth and tensor throughput retain roughly 85–90% at 76% TDP. Running uncapped recovers ~10–15% raw compute but immediately blows the budget. The throttled configuration achieves better FLOPS/W overall, which is what matters under competition scoring. The no-switch topology partially offsets this by returning 2–4kW of switch power back to the GPU headroom, giving us a wider throttle margin than a conventional switched design would allow.

The H200 NVL was chosen over the H200 SXM for this competition specifically because of its efficiency characteristics at reduced TDP. Both are rated at 700W with 4.8TB/s HBM3e bandwidth at 100% TDP, but the SXM form factor requires a dedicated server tray/baseboard that draws an additional ~50–100W per GPU equivalent regardless of GPU throttle level — power that doesn't scale down when you cap the GPU. At our 535W target, an H200 SXM node effectively draws ~585–635W per GPU once baseboard overhead is included, compared to 535W flat for NVL. Across 12 GPUs that difference is roughly 600–1,200W — close to two full GPUs' worth of headroom. From SC24's H100 NVL benchmarking, efficiency continued improving linearly all the way down to 50% TDP; we expect similar behavior here, meaning the 535W cap is conservative and may be dropped further once we have the hardware to benchmark against competition workloads.

We also evaluated the H100 NVL, B200, and B300. The H100 NVL fits the power budget comfortably but gives up 2.4× memory bandwidth for no meaningful cost benefit at competition scale. The B200 and B300 have a number of performance improvements, but their TDP would require throttling to a sub-optimal threshold to meet the power budget.

The HPL benchmark quickly becomes a bandwidth benchmark after optimizing parameters, which is where we see a meaningful performance improvement in the rail optimized network topology & removing of the network switch.

HPCG, despite involving an entirely different dataset, still is most optimal in a throughput-maximized configuration. In this case, providing 4.8TB/s per GPU.

GROMACS multi-node runs GPU-accelerated PME where p2p latency is the primary limiter; also benefiting from the 0 hop network architecture.

WRF domains can be partuclarly large, as we've seen from past competitions. This further contributes to the justification of the slightly over 1TB of DDR5 memory.

MLPerf training throughput maps directly to BF16 tensor core utilization across all gpus, again benefitting from our bandwidth-maximized configuration.