

Sep 16, 2023
This story offers a information on how one can construct a multi-GPU system for deep studying and hopefully prevent some analysis time and experimentation.
Construct a multi-GPU system for coaching of laptop imaginative and prescient and LLMs fashions with out breaking the financial institution! 🏦
Let’s begin with the enjoyable (and costly 💸💸💸) half!
The primary concerns when shopping for a GPU are:
- reminiscence (VRAM)
- efficiency (Tensor cores, clock velocity)
- slot width
- energy (TDP)
Reminiscence
For deep studying duties these days we’d like a loooot of reminiscence. LLMs are enormous even to fine-tune and laptop imaginative and prescient duties can get memory-intensive particularly with 3D networks. Naturally an important facet to search for is the GPU VRAM. For LLMs I like to recommend a minimum of 24 GB reminiscence and for laptop imaginative and prescient duties I wouldn’t go beneath 12 GB.
Efficiency
The second criterion is efficiency which will be estimated with FLOPS (Floating-point Operations per Second):
The essential quantity up to now was the variety of CUDA cores within the circuit. Nevertheless, with the emergence of deep studying, NVIDIA has launched specialised tensor cores that may carry out many extra FMA (Fused Multiply-Add) operations per clock. These are already supported by the primary deep studying frameworks and are what you need to search for in 2023.
Beneath yow will discover a chart of uncooked efficiency of GPUs grouped by reminiscence that I compiled after fairly some handbook work:
Be aware that it’s a must to be further cautious when evaluating efficiency of various GPUs. Tensor cores of various generations / architectures are usually not comparable. As an illustration, the A100 performs 256 FP16 FMA operations / clock whereas the V100 “solely” 64. Moreover, older architectures (Turing, Volta) don’t assist 32-bit tensor operations. What makes the comparability harder is that NVIDIA doesn’t at all times report the FMA, not even within the whitepapers, and GPUs of the identical structure can have totally different FMAs. I saved banging my head with this 😵💫. Additionally notice that NVIDIA usually advertises the tensor FLOPS with sparsity which is a characteristic usable solely at inference time.
With the intention to establish the perfect GPU with respect to cost, I collected the ebay costs utilizing the ebay API and computed the relative efficiency per greenback (USD) for brand spanking new playing cards:
I did the identical for used playing cards however because the rankings don’t change an excessive amount of I omit the plot.
To pick the perfect GPU in your funds, you’ll be able to choose one of many prime GPUs for the biggest reminiscence you’ll be able to afford. My advice could be:
If you wish to dive into extra technical elements I counsel to learn Tim Dettmers’ wonderful information on Which GPU(s) to Get for Deep Studying.
Slot width
When constructing a multi-GPU system, we have to plan how one can bodily match the GPUs right into a PC case. Since GPUs develop bigger and bigger, particularly the gaming collection, this turns into extra of a problem. Shopper motherboards have as much as 7 PCIe slots and PC instances are constructed round this setup. A 4090 can simply take up 4 slots relying on producer, so you’ll be able to see why this turns into a problem. Moreover we must always depart a minimum of 1 slot between GPUs that aren’t blower model or watercooled to keep away from overheating. We’ve the next choices:
Watercooling
Watercooled variants will take as much as 2 slots however they’re dearer. You’ll be able to alternatively convert an air-cooled GPU however it will void the guarantee. For those who don’t get All-in-One (AIO) options you have to to construct a customized watercooling loop. That is additionally true if you wish to match a number of watercooled GPUs because the AIO radiators could not match within the case. Constructing your individual loop is dangerous and I wouldn’t personally do it with costly playing cards. I might solely purchase AIO options straight from the manufactures (danger averse 🙈).
Aircooled 2–3 slot playing cards and PCIe risers
On this state of affairs you interleave playing cards on PCIe slots and playing cards related with PCIe riser cables. The PCIe riser playing cards will be positioned someplace contained in the PC case or within the open air. In both case you need to ensure the GPUs are secured (see additionally the part about PC instances).
Energy (TDP)
Fashionable GPUs get increasingly energy hungry. As an illustration, A 4090 requires 450 W whereas a H100 can rise up to 700 W. Other than the ability invoice, becoming three or extra GPUs turns into a problem. That is very true within the US that the ability sockets can ship as much as round 1800w.
An answer to this drawback in case you are getting near the max energy you’ll be able to draw out of your PSU / energy socket is power-limiting. All it’s good to cut back the max energy a GPU can draw is:
sudo nvidia-smi -i <GPU_index> -pl <power_limit>the place:
GPU_index: the index (quantity) of the cardboard because it proven with nvidia-smi
power_limit: the ability in W you need to use
Energy-limiting by 10-20% has been proven to scale back efficiency by lower than 5% and retains the playing cards cooler (experiment by Puget Programs). Energy-limiting 4 3090s as an example by 20% will cut back their consumption to 1120w and might simply slot in a 1600w PSU / 1800w socket (assuming 400w for the remainder of the elements).