2022-05-26 GPU, optimization

Increasing Gromacs Throughput with CUDA MPS

Summary: Running multiple copies of Gromacs on a GPU can roughly double the throughput obtainable from one (for the case considered), although one copy of Amber is nearly as fast, and there is considerable variation run-to-run.

A user is interested in large-scale molecular dynamics computation involving very many cases and speeding that up as much as possible with either Gromacs or Amber partially on a system with Summit-like IBM AC922 compute nodes using NVIDIA V100 GPUs (and their proprietary software, sigh). They were looking at an NVIDIA tutorial demonstrating speed-up with the CUDA multi-process service (MPS) and MPI running multiple processes per GPU on multiple GPUs. However, Gromacs, has been found not to scale so well to multiple GPUs, at least for the sort of systems of interest; the best throughput is from single-GPU runs, and it was worth investigating running multiple processes per GPU under MPS.MPS seems to be intended and advertised for use with MPI, but it can be used for multiple independent processes.

Their system of interest is “~450 k atoms that contains proteins, lipid membrane, and solvent” with many runs to be done from the same starting coordinates and different starting velocities. They are thus expected to take similar times, and so probably don’t require any fancier scheduling for multiple processes on a GPU than starting several together.

There doesn’t seem to be much published with results like these, and it’s not clear how general they are in terms of the molecular system and the computer system involved, but it seems worth at least presenting some data.

Amber

Since Amber was reported to be faster than Gromacs on a relevant example, I looked at that initially. This was with the installed version, described as an Amber 20 version ‘modified for large systems’. I don’t know in what way it was modified, and it was slightly worrying that it reports floating point underflow errors with the input I was given, but I’ve assumed that’s not relevant for scaling tests, at least.

The single-GPU jobs of interest are constrained to ¼ of the resources of the 32-core, 4-GPU nodes. For convenience I started multiple independent processes with mpirun, though they don’t use the MPI library. The environment module involved forces openmpi (fine). In general it’s important not to have the instance run confined to a single GPU (although Amber seems to be single-threaded on the CPU), so multiple processes were started with a permissive binding like (perhaps -bind-to none would be better)

nvidia-cuda-mps-control -d
  ...
mpirun -bind-to socket -n 4 ./amber.sh

where the script takes care of running the pmemd.cuda binary in a separate sub-directory per rank for the output files. Each instance ran 2000 steps to get a reasonable run time.

A single process per GPU ran at 35.7 ns/day.Varying by a few percent in different jobs.

When running more on the GPU the sum of the throughput from all the process was very close to the single-process value up to the maximum of six tried, e.g. 9.0 ns/day for the four-process case. It obviously makes efficient use of the GPU, and there’s nothing to be gained from using MPS, so I looked at improvements for Gromacs.

Gromacs

Profiling

It’s obviously sensible to get some basic numbers initially for an example run.

Memory usage when running multiple processes isn’t an issue in this case. A run only used ~500 MB of the GPU’s 32 GB (as observed with nvidia-smi -l).

Here’s a sample timing summary from md.log, suggesting the GPU at least isn’t saturated. When looking at overheads, like i/o, it’s worth knowing this run was for 4 ps of simulation and the production length is intended to be ~50 ps.

On 1 MPI rank, each using 8 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Neighbor search        1    8         21       0.708          2.899   4.5
 Launch GPU ops.        1    8       2001       0.309          1.266   2.0
 Force                  1    8       2001       0.112          0.457   0.7
 Wait PME GPU gather    1    8       2001       1.406          5.760   8.9
 Wait Bonded GPU        1    8         21       0.000          0.001   0.0
 Reduce GPU PME F       1    8       2001       0.636          2.605   4.0
 Wait GPU NB local                              1.399          5.729   8.9
 NB X/F buffer ops.     1    8       3981       1.999          8.188  12.7
 Write traj.            1    8          3       1.632          6.685  10.4
 Update                 1    8       2001       0.660          2.703   4.2
 Constraints            1    8       2003       2.704         11.075  17.2
 Rest                                           4.153         17.013  26.4
-----------------------------------------------------------------------------
 Total                                         15.718         64.382 100.0
-----------------------------------------------------------------------------

A sample nsys profile (below) does suggest scope for improvement with multiple processes. The GPU isn’t more than ~50% busy and the subsidiary threads show gaps between CPU activity spikes. There’s also a long setup hiatus for them, but this run is a lot shorter than a production one.

Profile for a four-thread test run generated with ‘Nsight Systems’ nsys profile .... [expand]


Expansion of the profile above around one of CPU gaps in the bottom threads (associated with FFT setup according to the backtrace). [expand]


Results

The Gromacs version used was the one installed on the system, version 2021.5, with CUDA, and without the MPI library, but otherwise with no details on the build.

The recommendation for running it on the system is to use

gmx mdrun -pin on -ntmpi 1 -ntomp 32 -nb gpu -bonded gpu -pme gpu

but only -bonded was useful with gpu, and pinning wasn’t a help, at least with multiple processes. So in the end, the script to run each rank in a separate directory, as above, usedIt turns out that using -nstlist 400 provides ~10% improvement, but one hopes it won’t affect the scaling tests.

gmx mdrun -bonded gpu -ntomp $OMP_NUM_THREADS ...

where OMP_NUM_THREADS was varied appropriately.

The environment module for Gromacs forces use of mvapich2, though again the binary doesn’t use MPI, but the same mpirun invocation as for Amber happens to work to run multiple processes.

2000 steps were run in each case to take a reasonable runtime of 10–20 s. Eight cores are available for each single-GPU run (32 total split between four GPUs).

Although it’s not immediately obvious from the profile, at least the case in question (the same as the Amber one) has something of a CPU bottleneck, and using the four SMT threads on each core helps a bit. Pinning isn’t an obvious improvement here, and hurts with multiple processes. See the following table for a single process.

Gromacs performance (ns/day) with varying number of threads (T) and pinning on a V100 GPU
T=2 T=4 T=8 T=16 T=32
-pin off 10.9 16.7 18.2 18.4 22.9
-pin on 10.8 15.8 20.9 18.2 21.1

Turning to multiple processes with multiple threads, the following table shows that it is possible basically to double the performance of Gromacs with multiple processes under MPS. However, the optimum is unclear, and there is variability at ~10% level from run to run. Compare the initial T=4–6 results for four and six processes with the results at the bottom of the table; they were from a separate run to investigate scaling with more threads, as performance initially seemed still to be improving at six threads. Perhaps better statistics are called for.

Gromacs performance (ns/day) varying number of threads (T) and processes on a V100. Bottom rows from a separate job to investigate continued increase with more threads.
procs T=1 T=2 T=3 T=4 T=5 T=6
2 11.2 20.6 27.7 37.3 40.6 33.6
4 17.1 28.4 31.8 37.4 40.3 43.1
6 18.3 34.7 39.0 39.0 42.6 44.5
8 20.8 29.8 34.3 40.3 39.8 38.1
16 21.5 31.2 28.5 24.9 21.9 20.5
T=7 T=8
4 34.8 34.6 35.2 36.3 36.9
6 39.9 39.9 40.3 40.6 37.4

Perhaps six processes and six threads is a reasonable recommendation, although it’s perhaps surprising that so many total threads is worthwhile on the 32 SMT threads of eight cores allocated to one GPU.

There aren’t obvious significant periods in the profile where overlapping CPU in one process and GPU in another would be particularly useful. However, it seemed worth trying without the processes in lock-step. The script started for each MPI rank was modifed to sleep for an extra second for each rank, running for 5000 steps instead of 1000 to reduce edge effects. The effect with four processes and eight threads is shown below. Another run with four processes and four threads gave totals of 36.9 and 37.0 synced and staggered respectively. Doing this doesn’t seem to make a significant difference, and probably there is enough jitter between the processes for different cases in practice to avoid any inefficiency of synchronization.

Effect of staggering start times of each of four processes with eight threads, performance per rank and total (two different runs)
1 2 3 4 total
synced 9.6 9.6 9.3 9.3 37.8
1s stagger 9.9 9.4 9.7 9.7 38.7
 
synced 9.9 9.8 9.8 9.8 39.3
1s stagger 9.7 9.8 10.0 10.2 39.7

T4 GPUs

As well as the V100s, the system in question has nodes with T4 GPUsRated at 8.1 TFLOPS SP and 70 W, c.f. V100’s 15.7 TFLOPS SP and 300W.

for machine learning inference, and 40 cores compared with the 32 of the V100 nodes. Since they’re often unused, it’s worth seeing what performance they can provide.

With Amber, the performance in one run on a T4 was 10.7 ns/day, or < ⅓ of the V100’s.

The results for Gromacs as a function of number of processes and threads are tabulated below.

Gromacs performance (ns/day) varying number of threads (T) and processes on a T4 GPU
procs T=4 T=8 T=10 T=20
1 10.2 11.0 11.0 11.1
2 15.2 14.3 14.6 14.7
3 16.4 15.7 15.9 15.5
4 16.8 16.0 16.4 16.0
5 16.6 16.0 16.5 16.0
6 16.4 16.6 16.2 16.6

It may or may not be worth tying up these GPUs for a while, getting in this case ~⅓ the performance of the V100s, while using < ⅓ of the power (according to the published power envelopes for the two types — not measured).

MPS Issues

MPS caused problems when run interactively for tests on the login node. We’ve seen instances stuck, blocking others’ access to the GPUs, and in some cases using 100% of a core and restarting in the same condition after kill -9 from the user. SLURM cgroups do at least seem to be effective in killing it at the end of a batch job. (Although SLURM can start MPS, that’s limited to a single GPU per node and isn’t configured on the system.)

We have security worries about MPS which may or may not be justified. The implications of one user being able to use a process started by another, at least on the login node for testing, are unclear. Also it uses fixed names in /tmp for the world-writable sockets it creates.