Essentially the most thrilling factor concerning the Top500 rankings of supercomputers that come out every June and November is just not who’s on the highest of the listing. That’s enjoyable and fascinating, in fact, however the true factor concerning the Top500 is the architectural classes it provides us once we see new techniques emerge on the Top500 and we get to see how decisions of compute, reminiscence, interconnect, storage, and funds all play out at a system degree and throughout nations and industries.
We might usually stroll by the highest ten machines on the listing after which delve into the statistics embodied within the Top500. This time round, now we have assembled a extra usable feeds and speeds of salient traits of the highest thirty machines on the listing, which we really feel is consultant of the higher echelon of HPC supercomputing proper now.
However first, earlier than we do this, it’s acceptable to only assessment the efficiency growth of the highest and backside techniques which can be examined utilizing the Excessive Efficiency LINPACK benchmark and the entire capability represented in your entire listing between for the previous thirty years that the Top500 rankings have been compiled.
Now we have definitely strayed from the Moore’s Legislation curve that taught us to count on exponential efficiency will increase. And it’s honest to say that now we have a really lopsided market – a minimum of in case you suppose those that submit HPL check outcomes to the Top500 organizers are consultant – with some very massive machines comprising a big portion of the combination compute capability on the listing and a slew of comparatively small machines which can be nowhere close to as highly effective making up the remainder.
A treemap of techniques by structure, created by the very intelligent Top500 database and graphing software program, exhibits visually simply how lopsided that is, with every sq. representing a single machine and every colour representing a selected compute engine selection within the November 2023 rankings:
Going clockwise from the higher left, the large burnt orange sq. is the “Fugaku” system at RIKEN Lab in Japan primarily based on Fujitsu’s A64FX Arm chip with massive fats vectors. The olive inexperienced/brownish colour within the largest sq. is the Frontier machine (hey, I see in colour, I simply don’t essentially see the identical colours you may) and the lighter model of this colour is the AMD GPU neighborhood. The blue sq. at 5 o’clock is the half of the Aurora system that has been examined, and that purple space slightly below it’s the “TaihuLight” system primarily based on the Sunway SW26010 processor (comparable in structure to the A64FX in that it’s a CPU with numerous vector engines) on the Nationwide Supercomputing Heart in Wuxi, China. The Nvidia A100 neighborhood is straight away to the left, with the “Leonardo” system at CINECA in Italy constructed by Atos being the most important sq. on this hood. The blue neighborhood at 6:30 as we transfer round isn’t just IBM, however contains the pre-exascale “Summit” machine at Oak Ridge and the companion “Sierra” machine at Lawrence Livermore Nationwide Laboratory. On the decrease left you see {that a} new Microsoft machine known as “Eagle,” which is operating in its Azure cloud, is just not solely the third largest supercomputer within the official Top500 rankings however can also be the most important machine utilizing Nvidia’s “Hopper” H100 GPU accelerators.
As you see within the efficiency growth chart above, the addition of some new machines among the many prime thirty machines has helped pull the typical combination 64-bit floating level efficiency of the largest 500 techniques to submit outcomes upwards. Notably, this contains the long-awaited “Aurora” supercomputer construct by Hewlett Packard Enterprise with CPU and GPU compute engines from Intel and the Slingshot interconnect from HPE. Or quite, it contains round half of that machine, which has only a tiny bit over 2 exaflops of peak theoretical efficiency and which is, as occurs with any new structure, going by a shakedown interval the place it’s being tuned to ship higher computational effectivity than its present outcomes present.
The ”Frontier” system at Oak Ridge Nationwide Laboratory, comprised of customized “Trento” Epyc CPUs and “Aldebaran” MI250X GPUs from AMD all lashed along with HPE’s Slingshot 11 interconnect, stays the primary system on the planet ranked by HPL efficiency. However there are two and presumably three completely different machines put in China which may rival Frontier and a totally examined Aurora machine. Now we have positioned them unofficially on the High 30 listing, utilizing the anticipated peak and HPL efficiency as assessed by Hyperion Analysis, so we will mirror actuality a bit higher.
Within the desk above, the machines with gentle blue bars are utilizing particular accelerators that put what’s in essence a CPU and a fats vector on a single-socket processor. Fugaku and TaihuLight do that. The machines on the grey bars are CPU-only machines or partitions of bigger machines which can be solely primarily based on CPUs. The remaining 22 machines (not together with the 2 Chinese language exascalers in daring pink italics and within the yellow bars) are primarily based on hybrid architectures that pair CPUs with accelerators, and more often than not the accelerator is a GPU from both Nvidia or AMD.
Contemplating that the compute engines in supercomputers are all costly, we’re all the time notably within the computational effectivity of every machine, by which we imply the ratio of HPL efficiency to peak theoretical efficiency. The upper the ratio, the higher we really feel about an structure. This is among the few datasets that permits us to calculate this ratio throughout a wide selection of architectures and system sizes, so we do what we will with what now we have, however we all know full properly the constraints of utilizing HPL as a sole efficiency metric for evaluating supercomputers.
That stated, we observe that at 55.3 p.c of peak, the HPL run on the brand new Aurora machine – or quite, about half of it – was not as environment friendly as Argonne, Intel, and HPE had most likely hoped. We estimated again in Could of this yr that with a geared down “Ponte Vecchio” Max GPU operating at 31.5 teraflops, with 63,744 of them within the Aurora machine, would ship 2.05 exaflops of peak. At that low computational effectivity, the Aurora machine totally scaled out would solely hit 1.13 exaflops on the HPL check, which is lower than the 1.17 exaflops that Frontier is delivering. At someplace round 65 p.c computational effectivity, Aurora ought to hit 1.31 exaflops, and at 70 p.c, it may hit 1.41 exaflops.
We predict Aurora will get extra of its peak floating level oomph chewing on HPL as Intel and HPE check the machine at a fuller scale. That is Intel’s first model of its Xe Hyperlink interconnect, which used to hook the Max GPUs to one another and to the “Sapphire Rapids” Xeon SP HBM processors in every Aurora node. Nvidia has shipped its fourth model of NVLink and AMD is on its third model of Infinity Cloth. These items take time.
There are a number of different machines within the prime thirty of the November 2023 listing which can be beneath the typical computational effectivity. And it’s not like different machines didn’t begin out at this degree (or decrease) earlier than and generally even after their first look on the rankings. For example, we heard a rumor as we have been awaiting Frontier’s look that it was properly beneath 50 p.c computational effectivity, which is why we didn’t see it once we anticipated it. On this case, there was a brand new CPU, a brand new GPU, and a brand new interconnect that every one needed to be tuned up collectively at scale for the primary time. Ditto for the accelerator cluster (ACC) portion of the “MareNostrum 5” system at Barcelona Supercomputing Heart in Spain or the “ISEG” system at Nebius AI within the Netherlands.
We count on for CPU-only machines to be extra environment friendly as a result of there may be one fewer layer of networking concerned. And certainly, in case you common the computational effectivity of the eight CPU-only machines within the prime thirty, you get 77.1 p.c of peak flops for HPL, whereas the accelerated machines common 70.3 p.c.
There doesn’t appear to be a discernable sample in case you plot computational effectivity throughout concurrency, so it’s not like increased orders of concurrency imply decrease computational effectivity:
In case you plot this on a log scale, there isn’t any sample that pops out, both.
We might additionally wish to know, in fact, if there may be any correlation between HPL efficiency and actual HPC simulation and modeling workload efficiency. For lots of workloads, the HPCG benchmark, which chews up exaflops and spits them out with simply horrible ranges of computational effectivity, might be a greater gauge. And that may be a bitter capsule to swallow.