Overview of the benchmarked GPUs
Although we only tested a small selection of all the available GPUs, we think we covered all GPUs that are currently best suited for deep learning training and development due to their compute and memory capabilities and their compatibility to current deep learning frameworks, namely Pytorch and Tensorflow.
For an update version of the benchmarks see the: Deep Learning GPU Benchmark
For reference also the iconic deep learning GPUs: Geforce GTX 1080 Ti, RTX 2080 Ti and Tesla V100 are included to visualize the increase of compute performance over the recent years.
GTX 1080TI
Suitable for: Workstations
Launch Date: 2017.03
Architecture: Pascal
VRAM Memory (GB): 11 (GDDR5X)
Cuda Cores: 3584
Tensor Cores: -
Power Consumption (Watt): 250
Memory Bandwidth (GB/s): 484
Geforce RTX 2080TI
Suitable for: Workstations
Launch Date: 2018.09
Architecture: Turing
VRAM Memory (GB): 11 (DDR6)
Cuda Cores: 5342
Tensor Cores: 544
Power Consumption (Watt): 260
Memory Bandwidth (GB/s): 616
QUADRO RTX 5000
Suitable for: Workstations/Servers
Launch Date: 2018.08
Architecture: Turing
VRAM Memory (GB): 16 (GDDR6)
Cuda Cores: 3072
Tensor Cores: 384
Power Consumption (Watt): 230
Memory Bandwidth (GB/s): 448
Geforce RTX 3090
Suitable for: Workstations/Servers
Launch Date: 2020.09
Architecture: Ampere
VRAM Memory (GB): 24 (GDDR6X)
Cuda Cores: 10496 Tensor Cores: 328
Power Consumption (Watt): 350
Memory Bandwidth (GB/s): 936
RTX A5000
Suitable for: Workstations/Servers
Launch Date: 2021.04
Architecture: Ampere
VRAM Memory (GB): 24 (GDDR6)
Cuda Cores: 8192
Tensor Cores: 256
Power Consumption (Watt): 230
Memory Bandwidth (GB/s): 768
RTX A5500
Suitable for: Workstations/Servers
Launch Date: 2022.03
Architecture: Ampere
VRAM Memory (GB): 24 (GDDR6)
Cuda Cores: 10240
Tensor Cores: 220
Power Consumption (Watt): 230
Memory Bandwidth (GB/s): 768
RTX A6000
Suitable for: Workstations/Servers
Launch Date: 2020.10
Architecture: Ampere
VRAM Memory (GB): 48 (GDDR6)
Cuda Cores: 10752
Tensor Cores: 336
Power Consumption (Watt): 300
Memory Bandwidth (GB/s): 768
Geforce RTX 4090
Suitable for: Workstations
Launch Date: 2022.10
Architecture: Ada Lovelace
VRAM Memory (GB): 24 (GDDR6X)
Cuda Cores: 16384
Tensor Cores: 512
Power Consumption (Watt): 450
Memory Bandwidth (GB/s): 1008
RTX 6000 Ada
Suitable for: Workstations/Servers
Launch Date: 2022.09
Architecture: Ada Lovelace
VRAM Memory (GB): 48 (GDDR6)
Cuda Cores: 18176
Tensor Cores: 568
Power Consumption (Watt): 300
Memory Bandwidth (GB/s): 960
Tesla V100
Suitable for: Servers
Launch Date: 2017.05
Architecture: Volta
VRAM Memory (GB): 16 (HBM2)
Cuda Cores: 5120
Tensor Cores: 640
Power Consumption (Watt): 250
Memory Bandwidth (GB/s): 900
A100
Suitable for: Servers
Launch Date: 2020.05
Architecture: Ampere
VRAM Memory (GB): 40/80 (HBM2)
Cuda Cores: 6912
Tensor Cores: 512
Power Consumption (Watt): 300
Memory Bandwidth (GB/s): 1935 (80 GB PCIe)
H100
Suitable for: Servers
Launch Date: 2022.10
Architecture: Grace Hopper
VRAM Memory (GB): 80 (HBM2)
Cuda Cores: 14592
Tensor Cores: 456
Power Consumption (Watt): 350
Memory Bandwidth (GB/s): 2000
The Deep Learning Benchmark
The visual recognition ResNet50 model (version 1.5) is used for our benchmark. As the classic deep learning network with its complex 50 layer architecture with different convolutional and residual layers, it is still a good network for comparing achievable deep learning performance. As it is used in many benchmarks, a close to optimal implementation is available, driving the GPU to maximum performance and showing where the performance limits of the devices are.
The comparison of the GPUs have been made using synthetic random image data, to avoid the influence of external elements like the type of dataset storage (SSD or HDD), data loader and data format.
Regarding the setup used, we have to remark two important points. The first one is the XLA feature. A Tensorflow performance feature that was declared stable a while ago, but is still turned off by default. XLA (Accelerated Linear Algebra) does optimize the network graph by dynamically compiling parts of the network to kernels optimized for the specific device. This can have performance benefits of 10% to 30% compared to the static crafted Tensorflow kernels for different layer types. This feature can be turned on by a simple option or environment flag and maximizes the execution performance.
The second one is the employment of mixed precision. Concerning inference jobs, a lower floating point precision is the standard way to improve performance. For most training situation float 16bit precision can also be applied for training tasks with neglectable loss in training accuracy and can speed-up training jobs dramatically. Applying float 16bit precision is not that trivial as the model layers have to be adjusted to use it. As not all calculation steps should be done with a lower bit precision, the mixing of different bit resolutions for calculation is referred as mixed precision.
The Python scripts used for the benchmark are available on Github here.
The Testing Environment
As AIME offers server and workstation solutions for deep learning tasks, we used our AIME A4000 server and our AIME T600 Workstation for the benchmark.
The AIME A4000 server and AIME T600 workstation are elaborated environments to run high performance multiple GPUs by providing sophisticated power and cooling, necessary to achieve and hold maximum performance and the ability to run each GPU in a PCIe 4.0 x16 slot directly connected to the CPU.
The technical specs to reproduce our benchmarks are:
For server compatible GPUs: AIME A4000, AMD EPYC 7543 (32 cores), 128 GB ECC RAM
For GPUs only available for workstations: T600, AMD Threadripper Pro 5955WX (16 cores), 128 GB ECC RAM
Using the AIME Machine Learning Container (MLC) management framework with the following setup:
- Ubuntu 20.04
- NVIDIA driver version 520.61.5
- CUDA 11.2
- CUDNN 8.2.0
- Tensorflow 2.9.0 (official build)
As the NVIDIA H100 and the RTX 4090 are not supported by the official Tensorflow build, below configuration was used for the NVIDIA H100 and the RTX 4090 GPU:
- CUDA 11.8
- CUDNN 8.6.0
- Tensorflow 2.9.1 (NVIDIA build)
Single GPU Performance
The results of our measurements is the average of images per second that could be trained while running for 50 steps at the specified batch size. The average of three runs were taken, the start temperature of all GPUs was below 50° Celsius.
The GPU speed-up compared to a 32-core-CPU rises here several orders of magnitude, making GPU computing not only feasible but mandatory for high performance deep learning tasks.
Next the results using mixed precision.
One can see that using mixed precision option can increase the performance up to three times.
Multi GPU Deep Learning Training Performance
The next level of deep learning performance is to distribute the work and training loads across multiple GPUs. The AIME A4000 and the AIME T600 support up to four server capable GPUs.
Deep learning does scale well across multiple GPUs. The method of choice for multi GPU scaling is to spread the batch across the GPUs. Therefore the effective (global) batch size is the sum of the local batch sizes of each GPU in use. Each GPU does calculate the backpropagation for the applied inputs of the batch slice. The backpropagation results of each GPU are then summed and averaged. The weights of the model are adjusted accordingly and have to be distributed back to all GPUs.
Concerning the data exchange, there is a peak of communication happening to collect the results of a batch and adjust the weights before the next batch can be calculated. While the GPUs are working on calculation a batch not much or no communication at all is happening across the GPUs.
In this standard solution for multi GPU scaling one has to make sure that all GPUs run at the same speed, otherwise the slowest GPU will be the bottleneck for which all GPUs have to wait for! Therefore mixing of different GPU types is not useful.
The next two graphs show how well the RTX 3090 scales by using single and mixed precision.
A good linear and constant scale factor of around 0.9 is reached, meaning that each additional GPU add around 90% of its theoretical linear performance. The similar scale factor is obtained employing mixed precision.
Conclusions
Mixed Precision can speed-up the training by more than factor 2
A feature definitely worth a look in regards of performance is to switch training from float 32 precision to mixed precision training. Getting a performance boost by adjusting software depending on your constraints could probably be a very efficient move to double the performance.
Multi GPU scaling is more than feasible
Deep Learning performance scales well with multi GPUs for at least up to 4 GPUs: 2 GPUs can often outperform the next more powerful GPU in regards of price and performance.
Mixing of different GPU types is not useful
Best GPU for Deep Learning?
As in most cases there is not a simple answer to the question. Performance is for sure the most important aspect of a GPU used for deep learning tasks but not the only one.
So it highly depends on what your requirements are. Here are our assessments for the most promising deep learning GPUs:
RTX 3090
The RTX 3090 is still the flagship GPU of the RTX Ampere generation. It has an unbeaten price/performance ratio that still holds at the end of the year 2022. Only performance improvements or a price adjustment of the latest GPU generation will change this.
RTX A5000
The little brother of the RTX 3090. Now available in a similair price range as the RTX 3090 and with an impressive Performance / Watt ratio the RTX A5000 has become a very interessting alternative to the RTX 3090.
RTX A6000
The bigger brother of the RTX 3090. With its 48 GB GDDR6 memory it is a more future proof version of the RTX 3090 which will also be able to load increasingly larger models.
NVIDIA A100 (40 GB and 80GB)
In case the most performance regardless of price and highest performance density is needed, the NVIDIA A100 is first choice: it delivers high end deep learning performance.
The lower power consumption of 250/300 Watt compared to the 700 Watt of a dual RTX 3090 setup with comparable performance reaches a range where under sustained full load the difference in energy and cooling costs might become a factor to consider.
Moreover, concerning solutions with the need of virtualization to run under a Hypervisor, for example for cloud renting services, it is currently the best choice for high-end deep learning training tasks.
A octa NVIDIA A100 setup, like possible with the AIME A8000, catapults one into multi petaFLOPS HPC computing area.
RTX 4090
The first available NVIDIA GPU of the Ada Lovelace generation. The first results are promising but compatibility to current Deep Learning frameworks is a work in progress. Especially the multi-GPU support is not working yet reliable (December 2022). So currently the RTX 4090 GPU is only recommendable as a single GPU system.
NVIDIA H100
The NVIDIA H100 just became available in late 2022 and therefore the integration in Deep Learning frameworks (Tensorflow / Pytorch) is still lacking. As NVIDIA promised larger performance improvements with CUDA version 12 the full potential of the H100 has still to be discovered. We are curious.
For an updated version of the benchmarks see the: Deep Learning GPU Benchmark
Questions or remarks? Please contact us under: hello@aime.info