Arhitektura HPC Vega
Spodaj najdete tabelo, v kateri sta povzeti vrsta in količina večjih komponent strojne opreme predlagane rešitve za sistem Vega:
Računanje
Particija GPU
| Kategorija | Komponenta | Količina | Opis |
|---|---|---|---|
| Infrastruktura | Predal | 2 | XH2000 DLC predal s komponentami: PSU, HYC in stikali IB HDR |
| Računska | Vozlišče GPU | 60 | 4 x Nvidia 100, 2 x AMD Rome 7H12, 512 GB RAM, 2 x HDR dual port mezzanine, 1 x 1.92TB M.2 SSD |
Particija CPU
| Kategorija | Komponenta | Količina | Opis |
|---|---|---|---|
| Infrastruktura | Predal | 10 | XH2000 DLC predal s komponentami PSUs, HYC in stikali IB HDR |
| Računska | Vozlišče CPU standard | 768 | 201 rezina 3 računskih vozlišč (2 x AMD Rome 7H12 (64c, 2.6GHz, 280W) 256GB RAM 1x HDR100 single port mezzanine 1x 1.92TB M.2 SSD) |
| Računska | Vozlišče CPU velik pomnilnik | 192 | 64 rezin 3 računskih vozlišč (2 x AMD Rome (64c, 2.6GHz, 280W) 1TB RAM 1x HDR100 single port mezzanine 1x 1.92TB M.2 SSD) |
Pomnilnik
HPST – High-performance storage tier (nivo visokozmogljivega pomnilnika)
| Kategorija | Komponenta | Količina | Opis |
|---|---|---|---|
| Pomnilnik | Gradnik na podlagi hitrega pomnilnika | 10 | 2U ES400NVX (na napravo: 23 x 6.4 TB NVMe, 8 InfiniBand HDR100, 4 vdelani Lustre VMs, 1 OST in MDT na VM). |
LCST – Large Capacity Storage tier (nivo pomnilnika z veliko kapaciteto)
| Kategorija | Komponenta | Količina | Opis |
|---|---|---|---|
| Pomnilnik | Vozlišče pomnilnika | 61 | Supermicro SuperStorage 6029P-E1CR24L z 2 x Intel Xeon Silver 421R, 12c, 2.4GHz, 100W, 256GB RAM DDR4 RDIMM 2933MT/s, 1 x 240GB SSD, 2 x 6.4TB NVMe, 24 x 16TB HDD, 2 x 25GbE Mellanox ConnectX-4 DP, 1 x 1GbE IPMI |
| Notranje omrežje Ceph | Stikalo Ethernet | 8 | Mellanox SN2010. Na stikalo: 18x 25GbE + 4x 100GbE vrata |
Prijava in virtualizacija
| Kategorija | Komponenta | Količina | Opis |
|---|---|---|---|
| Prijava CPU | Vozlišča prijave | 4 | Atos BullSequana X430-A5 z 2 x AMD EPYC 7H12, 256GB RAM DDR4 3200MT/s, 2 x 7.6TB U.2 SSD, 1 x 100GbE DP ConnectX5, 1 x 100Gb IB HDR ConnectX-6 SP |
| Prijava GPU | Vozlišča prijave | 4 | Atos BullSequana X430-A5 z 1 x NVIDIA Ampere A100 PCIe GPU in 2 x AMD EPYC 7452 (32c, 2.35GHz, 155W), 256GB RAM DDR4 3200MT/s, 2 x 7.6TB U.2 SSD, 1 x 100GbE DP ConnectX5, 1 x 100Gb IB HDR ConnectX-6 SP |
| Servis | Virtualizacija/servisna vozlišča | 30 | Atos BullSequana X430-A5 z 2 x AMD EPYC 7502 (32c, 2.5GHZ, 180W) 512GB RAM DDR4 3200MT/s, 2 x 7.6TB U.2 SSD, 1x 100GbE DP ConnectX5, 1 x 100Gb IB HDR ConnectX-6 SP |
Infrastruktura omrežja in medsebojne povezave
| Kategorija | Komponenta | Količina | Opis |
|---|---|---|---|
| Medsebojno omrežje | Stikalo IB | 68 | 40-port Mellanox HDR stikalo, Dragonfly+ topologija |
| Medsebojne povezave | IB HDR100/200 vrata na kartici IB | 1230 | 960 računskih, 60 (x2) GPU, 8 za prijavo, 30 za virtualizacijo, 10 (x8) HCST in 8 (x4) Skyway Gateways z Mellanox ConnectX-6 (enojna ali dvojna vrata) |
| IPoIB Gateway | IB/Ethernet Data Gateway | 4 | Mellanox Skyway IB do Ethernet Gateway Appliance (na gateway: 8x IB in 8x 100GbE vrat) |
| Ethernet podatkovno omrežje | Stikala Top-Level | 2 | Cisco Nexus N3K – C3408-S, 192 vrat 100GE aktivirano |
| Povezljivost WAN | IP usmerjevalniki | 2 | Cisco Nexus N3K – C3636C-R, 5x 100GbE do WAN (na voljo do konca 2021) |
| Omrežje za glavno upravljanje | 10GbE stikalo | 2 | Mellanox 2410 stikala (na stikalo 48 x 10GbE vrat) |
| Vhod/izhod omrežja upravljanja pasovne širine | 1GbE stikalo | 4 | Mellanox 4610 stikal (na stikalo 48 x 1GbE + 2 x 10GbE vrata) |
| Omrežje za upravljanje predalov | WELB stikalo | 24 | Dve integrirani stikali na predal WELB (sWitch Ethernet Leaf Board) s tremi 24-vratnimi Ethernet stikali in enim upravljalnikom za Ethernet (EMC) |
Arhitektura GPU
Specifikacije GPU
| NVIDIA Datacenter GPU | NVIDIA A100 |
|---|---|
| GPU codename | GA100 |
| GPU architecture | Ampere |
| Launch date | May 2020 |
| GPU process | TSMC 7nm |
| Die size | 826mm2 |
| Transitor count | 54 bilion |
| FP64 CUDA cores | 3,456 |
| FP32 CUDA cores | 6,912 |
| Tensor cores | 432 |
| Streaming Multiprocessors | 108 |
| Peak FP64 | 9.7 teraflops |
| Peak FP64 Tensor Core | 19.5 teraflos |
| Peak FP32 | 19.5 teraflos |
| Peak FP32 Tensor Core | 156 teraflos/312 teraflops* |
| Peak BFLOAT16 Tensor Core | 312 teraflos/624 teraflops* |
| Peak FP16 Tensor Core | 156 teraflos/624 teraflops* |
| Peak INT8 Tensor Core | 156 teraflos/1,248 teraflops* |
| Peak INT4 Tensor Core | 156 teraflos/2,496 teraflops* |
| Mixed-precision Tensor Core | 156 teraflos/642 teraflops* |
| Max TDP | 400 watts |
Vmesnik za upravljanje sistema NVIDIA
Program lahko zaženete z ukazom nvidia-smi, za splošne možnosti dodajte stikalo --help.
Na HPC Vega trenutno funkcionalnost Multi-Instance GPU (MIG) ni omogočena.
[root@gn01 ~]# nvidia-smi
Wed Jul 12 11:50:30 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:03:00.0 Off | 0 |
| N/A 50C P0 140W / 400W | 2584MiB / 40960MiB | 51% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:44:00.0 Off | 0 |
| N/A 43C P0 56W / 400W | 8MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:84:00.0 Off | 0 |
| N/A 44C P0 56W / 400W | 8MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:C4:00.0 Off | 0 |
| N/A 49C P0 83W / 400W | 2818MiB / 40960MiB | 51% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1503150 C ... 2570MiB |
| 3 N/A N/A 1510286 C ... 2802MiB |
+---------------------------------------------------------------------------------------+
Topologija vozlišča

Preverite topologijo vozlišča GPU z ukazom nvidia-smi.
[root@gn01 ~]# nvidia-smi topo -mp
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS SYS SYS SYS 48-63,176-191 3 N/A
GPU1 SYS X SYS SYS PIX SYS 16-31,144-159 1 N/A
GPU2 SYS SYS X SYS SYS PIX 112-127,240-255 7 N/A
GPU3 SYS SYS SYS X SYS SYS 80-95,208-223 5 N/A
NIC0 SYS PIX SYS SYS X SYS
NIC1 SYS SYS PIX SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NUMA ID najbližjega CPU je na voljo s stikalom -i, z identifikatorjem GPU[0-3].
[root@gn01 ~]# nvidia-smi topo -C -i 0
NUMA IDs of closest CPU: 3
Prikaži najbolj neposredno pot za izbran par grafičnih kartic.
[root@gn01 ~]# nvidia-smi topo -p -i 0,2
Device 0 is connected to device 2 by way of an SMP interconnect link between NUMA nodes.