SOLVED: Stumped Benchmarking GPUs

rileycoyote · February 29, 2020, 11:54am

I’ve been running simulations and render tests with TFD and Redshift in C4D and getting results that stump me a bit. I use both my local machine and various remote workstations in the cloud for doing sims and renders. Because I have project coming up soon that will require some speed, I’ve been testing various workflows. Perhaps someone can shed light on the results below. I used the same scene and settings for a control and only switch machines for variables. I have not tried turning off defender scanning of cache files. Windows 10 / Windows Server 2016. Timeline realtime update is set unchecked, disabled.

Sim is running the simulation to 240 frames
RSR is the redshift render of a single frame

Local - core i5 8250U, 8G Ram, GTX 1050 4G, caching tested on both on NVME and USB-C thunderbolt Samsung T5 - same speed
ultrabook with manually set wattage to keep temp and stability
SIM: 240 frames - 4Min
RSR: 800x800 1 frame - 1Min

Server 1 - 8 Core Xeon Intel, 30G Ram, Nvidia V100 (not sure if 16G or 32G), caching on persistent ssd in server
(virtualized - nvidia Grid)
SIM: 240 frames - 6Min
RSR: 800x800 1 frame - 10sec

Server 2 - 16 Core Xeon Intel, 60G Ram, Nvidia P100 16G, caching on persistent ssd in server
(virtualized - nvidia Grid)
SIM: 240 frames - 6Min
RSR: 800x800 1 frame - 10 -14sec

Server 3 - 16 Core Xeon Intel, 64G Ram, 4xNvidia 1080ti, caching on persistent ssd in server
bare metal server, no other users (talking with this server about switching me to a bare metal 2xP100 w/ 100G Ram
SIM: 240 frames - (1x1080ti) 6Min
RSR: 800x800 1 frame - (4x1080ti) 26sec

Server 4 - 16 Core Xeon Intel, 100G Ram, Nvidia T4 16G, caching on persistent ssd in server
(virtualized - nvidia Grid)
SIM: 240 frames - 5-6 min
RSR: 800x800 1 frame - 10 sec

Clearly the higher end GPUs are doing fine in the rendering department. But I’m really stumped as to why running the simulations are taking longer or hanging in some spots. I did find a range of keyframes in the project where things start to choke a bit between frames 90 and 190. I went and changed some settings in my sim, such as voxel size. That made the bottlenecks/speed better, but they didn’t all together go away. Doing this also changes the aesthetics of the sim not to the project’s liking.

Also it doesn’t explain why my local GTX1050 is doing better. The only thing I can think of is there is some bottle neck in terms of drive, virtualization, or something that causes them to be slower.

Any ideas of where my blind spots or bottle necks might be so that I can plan the production schedule and choose a machine accordingly?

thanks in advance

Jascha_Wetzel · February 29, 2020, 12:30pm

Interesting, thanks for sharing these results.
Do you still have the log files from the simulations? They contain more detailed timings that can shed some light on where the time is spent. After the sim, make a copy of the file %USERPROFILE%\AppData\Roaming\jawset\turbulence.log and rename it to indicate which test run it belongs to. The files already contain the hardware specs, so a simple machine name is enough.

Looking at the exact timings will be necessary to draw conclusions. But here are possible reasons for these results.

Cache storage takes a big chunk of the per-frame time. All cached grids have to be transferred from GPU memory to CPU memory and on to the SSD. Depending on grid size and sim complexity, the actual sim compute can be less than the transfer. Consequently, the effect of making compute 2x faster will not result in the per-frame sim time to be 2x faster.
Virtualization is a potentially big variable. It can affect both GPU and CPU compute as well as bandwidth between all components including the SSD. Typically, this shows as high variance between multiple sim runs. The higher the variance, the more test runs you’ll need on the same setup to get good benchmarks.
In TFD, there are parts of the simulation setup that don’t run on the GPU. Mostly bridging emitter data from C4D’s scene to the TFD simulation engine. Similar to high storage times, costly C4D scenes can shadow the net sim compute time as well.
The fact that the cloud-based sim times are essentially the same, regardless of the GPU type, indicates that it’s likely storage or C4D processing that is bottlenecking these sims.

Again, the log files would provide additional information that would answer most of these questions.

rileycoyote · February 29, 2020, 12:42pm

This is my thought as well. I’m trying to run the cache on the same drive. (Aside from the bare metal server, the virtualization only gives you a 50G drive to start. Mounting extra drives has me thinking they are elsewhere on the datacenter network)

I’m going to re-run these tests again after first disabling defender scanning the cache folder. I will take screenshots of each timing to upload along with the scene file (without cache) and logs. I will skip the render time benchmarks since we can clearly see the higher end gpus are handling that fine. It’s just the simulations I’m trying to work up a solution/benchmark for.

When I changed the voxel size from 0.65cm to 1cm the bottlenecking was significantly reduced but of course changes the aesthetics. The bottle necking happens every time in the 90 to 190 frame range. This is the area with some complexity. So I have a feeling it has to do with the virtual machines not handling the stack well in those frames.

Jascha_Wetzel · February 29, 2020, 12:51pm

Sounds like a storage bottleneck indeed - we’ll see.
An important storage related optimization is to cache only the channels you need as shader inputs. Esp. Velocity caching is costly since it consists of three grids. Unless you use Fluid Motion Blur or Velocity Displacement, you don’t need Velocity at render time. Often, only one or two channels (density and temp or burn) are used for shading but up to four channels are simulated.
Select the cached channels in the Container/Cache group.

rileycoyote · February 29, 2020, 2:21pm

disabled velocity caching, temp/density only
disabled defender scanning cache folders
no change to project file, used ssd/os drive on remote machine

Benchmark 1
Local GTX 1050 4G - core i5 8250U - 8G RAM - USB C drive for cache
SIM: 2min14sec

Benchmark 2
Nvidia virtualized Telsa T4 16G - 16 thread x Intel Xeon - 100G RAM
SIM: 5min18sec

Benchmark 3
Bare metal server
4x1080TI 12G (only one used for sim) - 2 x Intel Xeon E5-2609v4 1.7 GHz - 64G RAM
*doesn’t provide opengl feedback in c4d, but fine for sims and render, working with server to switch to 2xP100 with gpu pass through support for remote work
SIM: 3min52sec

(620.0 KB)
logs, project file, and screenshots linked here

Making the change you suggested burned right through the bottleneck on the local machine.

Jascha_Wetzel · February 29, 2020, 6:02pm

Great, thanks.

Benchmarks 2 and 3 are clearly storage bound.
The virtualized server from Benchmark 2 is storing to a network drive. You can see the path \\tsclient/E\benchmark\local-1050\1050-cache in the log file. As a result, storing every 3rd frame takes ~4sec or ~95% of the total frame time. It’s only every 3rd because this is when the write-cache is full and the pipeline has to wait until previous frames have been written.
Compare that to Benchmark 3 where storing every 6th frames takes 0.6sec or ~50% of the total frame time. Much better, though still storage bound.
And finally Benchmark 1 has no such catch-up frames at all. Storage times remain mostly around 0.04sec or ~10% of total frame time. That means that the storage system can drain the write-cache fast enough to avoid congestion. It probably would get those catch-up frames if you cache more grids.

In conclusion, make sure to use a server that has a local SSD. In datacenter lingo “dedicated” does not mean local. If in doubt, you can let TFD sim just one frame to cache and check the log file. Search for “Using cache directory”. If the path next to it begins with \\, it’s not a local drive.

rileycoyote · February 29, 2020, 8:23pm

So I was finally actually able to temporarily attach a local ssd of the machine that the vm is created on exactly for cache purposes. This brought benchmark 2 just under benchmark 1 at 1min58sec

Going to keep looking through my server documentation about optimizing this.

log and ss attached t4-2.zip (154.3 KB)