Timeline
Team
Coaches
Ryan Scherbarth
Coach
Sr. Software Engineer at Tesla (ML & HPC Infra). Led UNM's team at SC23 and SC24, and multiple other HPC competitions.
Alex Knigge
Coach
Software Engineer at Sandia National Laboratories (HPC monitoring & perf). Led UNM's team at SC25 and multiple other HPC competitions.
Dr. Matthew Fricke
Faculty Advisor
Research Associate Professor at the University of New Mexico, and faculty sponsor of UNM's HPC team since it's founding in 2023.
Cluster Configuration (Tentative)
3× Nodes
Network Architecture
With only 3 nodes, every pair connects directly via 4× 400Gb/s cables — 2 ConnectX-7 NICs with both ports populated per link — giving 1.6Tb/s bidirectional bandwidth per pair with zero switch hops. A 400G InfiniBand switch draws an estimated 2–4kW and adds one hop of latency; eliminating it returns that power directly to the GPU budget.
We configure as to have a rail optimized network topology as well; each NIC by slot index (CX7-0 through CX7-3) forms an independent IB rail connecting GPU N of each node. NCCL AllReduce traffic stays within a single rail, which is the condition that makes rail optmimzation meaningful. Note that this can only work here because the 2-port NIC can reach both other ports with one port each, completing the loop without a switch.
This doesn't generalize — a 4th node would exhaust all NIC ports and require a switch. But within the fixed 3-node competition constraint, it's a straightforward perf-per-watt gain worth taking.
In further pursuit of optimal power-per-watt efficiency, we plan to run IP over IB rather than maintaining a dedicated front end network. The latter is not strictly necessary, and saves the power for an additional switch as well.
Power Budget
Running uncapped exceeds the 10kW limit by ~2kW. The plan: throttle each H200 NVL to ~535W (76% TDP), holding all other components at full draw.
H200 NVL (and GPU's in general) efficiency doesn't scale linearly with power — HBM3e bandwidth and tensor throughput retain roughly 85–90% at 76% TDP. Running uncapped recovers ~10–15% raw compute but immediately blows the budget. The throttled configuration achieves better FLOPS/W overall, which is what matters under competition scoring. The no-switch topology partially offsets this by returning 2–4kW of switch power back to the GPU headroom, giving us a wider throttle margin than a conventional switched design would allow.
The H200 NVL was chosen over the H200 SXM for this competition specifically because of its efficiency characteristics at reduced TDP. Both are rated at 700W with 4.8TB/s HBM3e bandwidth at 100% TDP, but the SXM form factor requires a dedicated server tray/baseboard that draws an additional ~50–100W per GPU equivalent regardless of GPU throttle level — power that doesn't scale down when you cap the GPU. At our 535W target, an H200 SXM node effectively draws ~585–635W per GPU once baseboard overhead is included, compared to 535W flat for NVL. Across 12 GPUs that difference is roughly 600–1,200W — close to two full GPUs' worth of headroom. From SC24's H100 NVL benchmarking, efficiency continued improving linearly all the way down to 50% TDP; we expect similar behavior here, meaning the 535W cap is conservative and may be dropped further once we have the hardware to benchmark against competition workloads.
We also evaluated the H100 NVL, B200, and B300. The H100 NVL fits the power budget comfortably but gives up 2.4× memory bandwidth for no meaningful cost benefit at competition scale. The B200 and B300 have a number of performance improvements, but their TDP would require throttling to a sub-optimal threshold to meet the power budget.
The HPL benchmark quickly becomes a bandwidth benchmark after optimizing parameters, which is where we see a meaningful performance improvement in the rail optimized network topology & removing of the network switch.
HPCG, despite involving an entirely different dataset, still is most optimal in a throughput-maximized configuration. In this case, providing 4.8TB/s per GPU.
GROMACS multi-node runs GPU-accelerated PME where p2p latency is the primary limiter; also benefiting from the 0 hop network architecture.
WRF domains can be partuclarly large, as we've seen from past competitions. This further contributes to the justification of the slightly over 1TB of DDR5 memory.
MLPerf training throughput maps directly to BF16 tensor core utilization across all gpus, again benefitting from our bandwidth-maximized configuration.