RTX 3090, 3080, 2080Ti Resnet benchmarks on Tensorflow containers
There’s still a huge shortage of NVidia RTX 3090 and 3080 cards right now (November 2020) and being in the AI field you are wondering how much better the new cost-efficient 30-series GPUs are compared to the past 20-series. With the scarcity of cards comes a shortage of benchmark reports specific to our field as few people can benchmark, and the benchmarks you find sometimes get weird, in part because the framework support for the new cards is very poor.
The numbers mean images processed per second during mixed precision training.
I also ran AI-Benchmark.com's benchmark from inside NGC 20.10-tf1 container
I only tested for TensorFlow even though I generally use PyTorch, because at the time I embarked on making these benchmarks, PyTorch’s support for 3080 and 3090 cards was nonexistent and a bit later my own networks performed on a level comparable to being ran on 2060 Super GPU. Right now things aren’t so grim for PyTorch nightly’s 30-series performance anymore. I decided to only test TensorFlow since I thought that somewhere in those benchmarks one can still find the relative comparison of training performance difference expected between 2080Ti and 3090/3080 cards in a long run. And, because, I decided to explore multiple dimensions inside TensorFlow.
One person requested that I also benchmark Waifu2x speed and I used this cool Waifu2x ncnn Vulcan version to benchmark the speed of smart image upscaling. Can't say if it uses FP32 or mixed precision, I would guess, FP32. I ran this test on a folder of 5000 images like
time ./waifu2x-ncnn-vulkan -g 0 -i 512x512_images/ -o o/ -n 2
And took notice of the "real" time to calculate time per image with a 512x512 resolution and 140kB average size. I did benchmark 1080 Ti on a different CPU, though, with less cores, but much faster single core and few-core performance - i7-9700k vs i9-7980XE for 3080/3090. So here are the rates taken from "real" time output:
|1080 Ti||4.9||i7-9700k||Browsing and stuff.|
Now let's push further from 1024x1024 to 2048x2048:
time ./waifu2x-ncnn-vulkan -g 0 -i o/ -o o2/ -n 2
I also tried running 2 instances at once, but it only made things slightly worse. I later discovered there was an additional parameter "-t" for some sort of batching, so I tried using it and had a slight boost.
time ./waifu2x-ncnn-vulkan -g 0 -i o/ -o o2/ -n 2 -t 512
From what I saw, when it comes to CPU, most of the work was done in 2 threads on the 18-core i9-7980XE. That's why few-core CPU performance matters so much when choosing a CPU.
NVidia’s TensorFlow Docker containers
Performance between Nvidia’s TensorFlow containers differs vastly and puzzles greatly. Only since container 20.10 NGC container did TensorFlow start to officially support 3090 and 3080 cards, but very often a much better performance can actually be seen on the 20.08 container. This is especially so for 3080 - look at se-resnext101 or resnext101 benchmark. If like every other person who posted AI training benchmarks previously, you’ll simply restrain to testing a single version, you won’t get much useful information. This is why I’ve benchmarked over 3 to 7 different TensorFlow containers, and tested 4 different ResNet variants one of them in both TensorFlow 1.x and TensorFlow 2.x NGC containers.
All results are for Automatic Mixed Precision training aka FP16 training. Nowadays it doesn’t make sense to train otherwise, at least on Volta-series cards and beyond.
The tests were run under Ubuntu 20.04, Nvidia Driver Version 455.28, native CUDA Version was 11.1, although, of course, containers have their own CUDA versions even though they will report the native version when you launch nvidia-smi from inside the container.
For example, on NGC’s TFv1 20.10 container’s se-resnext101–32x4d training benchmark, a batch of 128 yielded 2.5x better training speed than for a minibatch of 96 images and 1.6x better than even for a minibatch of 1.5x sized - 192 image batch.
It’s important to explore how much the 2.4x memory advantage of 3090 vs 3080 matters as it’s where most of the price premium lies. At similar batch sizes, 3090 is sometimes not that much faster than 3080, both unable to fully satisfy the hunger of roughly ten thousand cores. But give 3090 a batch size boost — and batch size can get to much more than 2.4x, as while batch size changes how much space the activations occupy, the model and gradient sizes have a fixed memory cost — and the 3090 shines.
Another additional parameter a explore is the effect of XLA optimizations on training performance. This is as much to check on the work TensorFlow team is doing. For some reason, on NGC 20.09 TF1 container RTX 3080/3090 performs worse in the XLA optimization case. In some cases, the performance on a particular case was up to 9x lower than can be expected based on neighboring cases. I’ll alert TensorFlow devs to this.
30-series GPUs have a seriously speedy memory which yields 3090 a 2x improvement in Ethereum mining speed — a memory-bottlenecked algorithm — over 2080Ti. This is where the bulk of speedup is. 3090’s 24 GB also provides a very considerable boost, basically, up to 20%.
Benchmarks are from the 20.10 version of code, mostly that which can be found in NVidia’s Deep Learning Examples on GitHub. I did slightly change the Resnet-50 code run with the container’s workspace/nvidia-examples/cnn/resnet.py though, as NVidia’s example code was restrained to using a very small portion, like 80%, of GPU’s memory at max. This is done in code via
… = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
and should be fixed. To
… = tf.GPUOptions(per_process_gpu_memory_fraction=0.99)
in this case. Otherwise, you unnecessarily limit your max batch size.
Interestingly, whether I launched this code on i7-4960X or i9-7980XE CPU didn't matter. If anything, most benchmarks were slightly better on i7-4960X. So, don't bother much about upgrading your CPU.
As for the GPU, when it comes to deep-learning-specific maths, the 30 series is only marginally faster than 20 series, both having Tensor Core 32-bit accumulate operation performance cut in half vs same-chip unlocked versions like RTX Titan and Quadro cards (RTX 6000, A6000, etc.), going at half the FP16-accumulate Tensor Core computation rate deemed by NVidia’s own researchers on mixed-precision training to be adequate for inference, but not for training. Compute is even more strangled this time as while 2080 Ti’s Cuda Core (not Tensor Core) FP16 compute wasn’t cut in half and was 2x the FP32 FLOP performance, this time around NVidia has decided to cut it too, so in the AI compute field the improvement is almost marginal as FP16 compute has as many FLOPS as FP32 compute and only marginally more than that of 2080 Ti. You can find this in NVidia’s PDF on 30-series specs.
So, as opposed to memory-bottlenecked low-FLOPS training with 1x1 and 3x3 convolutions and stuff, if you’re looking for some compute-intensive training with some large “dense” / “fully-connected” non-convolutional layers, a better option is to buy used RTX Titan cards now that their and 20-series NVLink price is falling rapidly thanks to 3090. Wait out till spring 2021 for more 3090’s to saturate demand if you can, RTX Titan’s price should get smashed down to the $1000–$1500 range, as for most other tasks 3090 is superior.
If you're planning to train large convolutional neural networks, 3090 is probably even better than V100 and certainly more cost-effective. Why better? Because with larger memory you can run even larger networks than you can on V100. And if you're running a large network, even if you can run them at very small batch sizes on V100, you'll be able to run them much much faster on 3090, as the batch size won't have to be as tiny.
If you're sure you will only be training small networks and you don't need to cram too much compute on a single computer, 3080 will be more cost-efficient.
At least in the case of NGC containers, for some reason, TensorFlow v2 has a crappy performance.
What to buy?
This will depend on your budget and needs. If your GPU budget is in the 1080 Ti - 2080 realm, go for 2060 Super if 8GB in mixed precision is going to be enough for your needs. From what I previously tested under PyTorch, it's around 30% faster than 1080 Ti because of 2060 Super's good native support of mixed-precision training and pretty decent memory speed. My 2060 Super's memory was also able to overclock well. Seemingly, up to the speed of 2080 Super.
If you can only afford 2080 Ti - it's a great card for deep learning.
Now, when you can afford more, and when you have to pay a premium to get 30-series is when things will get tricky. If you need more than 11GB of memory - go for 3090. Also, don't forget that 3080 only has 10GB, not 11GB. If your networks don't require a high amount of memory, you'll have to weigh things. Weigh the costs of cards, which will be different depending on where you live and the exact situation on your local markets against the ability to plug multiple cards into your rig. As for me personally, I run my GPU server separately from my desktop system. This way I don't have to listen to the noise that this hardcore rig is making and this way I was able to build that rig in the style of Ethereum miners. I'm using PCIe 3 x16 raiser cables instead of directly plugging cards into the motherboard. For 30 series cards, this would not even be a possibility, as they take up whole 3 PCIe slots in height. And so I have no issue with placement and no issue with heat, running that server without any computer case, bare handmade skeletal wireframe that doesn't trap any heat, in my garage where I don't have any additional heating.
Gathering and compiling this data took A LOT of time and effort as well as getting my hands on all 3 GPUs most coveted by neural network training amateurs all around the globe for a very good reason. More than a month since it’s release, 3080 can still only be bought at a hefty % premium, sheer luck, or possibly, programmer trickery. As for me, I got mine from a combination of % premium and luck, waited out for decent deals on second-hand market and jumped at chances to get my cards. This took up A LOT of time, so I'm not sure I generally recommend doing the same.