2022-05-26
Increasing Gromacs Throughput with CUDA MPS
Summary: Running multiple copies of Gromacs on a GPU can roughly double the throughput obtainable from one (for the case considered), although one copy of Amber is nearly as fast, and there is considerable variation run-to-run.
A user is interested in large-scale molecular dynamics computation involving
very many cases and speeding that up as much as possible with either
Gromacs or Amber
partially on a system with
Summit-like
IBM AC922 compute nodes using NVIDIA V100 GPUs (and their proprietary
software, sigh). They were looking at an NVIDIA tutorial demonstrating
speed-up with the CUDA
multi-process service
(MPS) and MPI running multiple processes per GPU on multiple GPUs. However,
Gromacs, has been found not to scale so well to multiple GPUs, at least for
the sort of systems of interest; the best throughput is from single-GPU runs,
and it was worth investigating running multiple processes per GPU under
MPS.MPS seems to be intended and advertised for use with MPI, but it
can be used for multiple independent processes.
Their system of interest is “~450 k atoms that contains proteins, lipid membrane, and solvent” with many runs to be done from the same starting coordinates and different starting velocities. They are thus expected to take similar times, and so probably don’t require any fancier scheduling for multiple processes on a GPU than starting several together.
There doesn’t seem to be much published with results like these, and it’s not clear how general they are in terms of the molecular system and the computer system involved, but it seems worth at least presenting some data.
Amber
Since Amber was reported to be faster than Gromacs on a relevant example, I looked at that initially. This was with the installed version, described as an Amber 20 version ‘modified for large systems’. I don’t know in what way it was modified, and it was slightly worrying that it reports floating point underflow errors with the input I was given, but I’ve assumed that’s not relevant for scaling tests, at least.
The single-GPU jobs of interest are constrained to ¼ of the resources
of the 32-core, 4-GPU nodes. For convenience I started multiple independent
processes with mpirun
, though they don’t use the MPI library. The
environment module involved forces openmpi (fine). In general it’s important
not to have the instance run confined to a single GPU (although Amber seems to
be single-threaded on the CPU), so multiple processes were started with a
permissive binding like (perhaps -bind-to none
would be better)
nvidia-cuda-mps-control -d
...
mpirun -bind-to socket -n 4 ./amber.sh
where the script takes care of running the pmemd.cuda
binary in a
separate sub-directory per rank for the output files. Each instance ran 2000
steps to get a reasonable run time.
A single process per GPU ran at 35.7 ns/day.Varying by a few percent
in different jobs.
When running more on the GPU the sum of the throughput
from all the process was very close to the single-process value up to the
maximum of six tried, e.g. 9.0 ns/day for the four-process case. It obviously
makes efficient use of the GPU, and there’s nothing to be gained from using
MPS, so I looked at improvements for Gromacs.
Gromacs
Profiling
It’s obviously sensible to get some basic numbers initially for an example run.
Memory usage when running multiple processes isn’t an issue in this case. A
run only used ~500 MB of the GPU’s 32 GB (as observed with
nvidia-smi -l
).
Here’s a sample timing summary from md.log
, suggesting the GPU at
least isn’t saturated. When looking at overheads, like i/o, it’s worth
knowing this run was for 4 ps of simulation and the production length is
intended to be ~50 ps.
On 1 MPI rank, each using 8 OpenMP threads
Computing: Num Num Call Wall time Giga-Cycles
Ranks Threads Count (s) total sum %
-----------------------------------------------------------------------------
Neighbor search 1 8 21 0.708 2.899 4.5
Launch GPU ops. 1 8 2001 0.309 1.266 2.0
Force 1 8 2001 0.112 0.457 0.7
Wait PME GPU gather 1 8 2001 1.406 5.760 8.9
Wait Bonded GPU 1 8 21 0.000 0.001 0.0
Reduce GPU PME F 1 8 2001 0.636 2.605 4.0
Wait GPU NB local 1.399 5.729 8.9
NB X/F buffer ops. 1 8 3981 1.999 8.188 12.7
Write traj. 1 8 3 1.632 6.685 10.4
Update 1 8 2001 0.660 2.703 4.2
Constraints 1 8 2003 2.704 11.075 17.2
Rest 4.153 17.013 26.4
-----------------------------------------------------------------------------
Total 15.718 64.382 100.0
-----------------------------------------------------------------------------
A sample nsys
profile (below) does suggest scope for improvement with
multiple processes. The GPU isn’t more than ~50% busy and the subsidiary
threads show gaps between CPU activity spikes. There’s also a long setup
hiatus for them, but this run is a lot shorter than a production one.
Results
The Gromacs version used was the one installed on the system, version 2021.5, with CUDA, and without the MPI library, but otherwise with no details on the build.
The recommendation for running it on the system is to use
gmx mdrun -pin on -ntmpi 1 -ntomp 32 -nb gpu -bonded gpu -pme gpu
but only -bonded
was useful with gpu
, and pinning wasn’t a
help, at least with multiple processes. So in the end, the script to run each
rank in a separate directory, as above, usedIt turns out that using
-nstlist 400
provides ~10% improvement, but one hopes it won’t
affect the scaling tests.
gmx mdrun -bonded gpu -ntomp $OMP_NUM_THREADS ...
where OMP_NUM_THREADS
was varied appropriately.
The environment module for Gromacs forces use of mvapich2, though again the
binary doesn’t use MPI, but the same mpirun
invocation as for Amber
happens to work to run multiple processes.
2000 steps were run in each case to take a reasonable runtime of 10–20 s. Eight cores are available for each single-GPU run (32 total split between four GPUs).
Although it’s not immediately obvious from the profile, at least the case in question (the same as the Amber one) has something of a CPU bottleneck, and using the four SMT threads on each core helps a bit. Pinning isn’t an obvious improvement here, and hurts with multiple processes. See the following table for a single process.
T=2 | T=4 | T=8 | T=16 | T=32 | |
---|---|---|---|---|---|
-pin off |
10.9 | 16.7 | 18.2 | 18.4 | 22.9 |
-pin on |
10.8 | 15.8 | 20.9 | 18.2 | 21.1 |
Turning to multiple processes with multiple threads, the following table shows that it is possible basically to double the performance of Gromacs with multiple processes under MPS. However, the optimum is unclear, and there is variability at ~10% level from run to run. Compare the initial T=4–6 results for four and six processes with the results at the bottom of the table; they were from a separate run to investigate scaling with more threads, as performance initially seemed still to be improving at six threads. Perhaps better statistics are called for.
procs | T=1 | T=2 | T=3 | T=4 | T=5 | T=6 | ||
---|---|---|---|---|---|---|---|---|
2 | 11.2 | 20.6 | 27.7 | 37.3 | 40.6 | 33.6 | ||
4 | 17.1 | 28.4 | 31.8 | 37.4 | 40.3 | 43.1 | ||
6 | 18.3 | 34.7 | 39.0 | 39.0 | 42.6 | 44.5 | ||
8 | 20.8 | 29.8 | 34.3 | 40.3 | 39.8 | 38.1 | ||
16 | 21.5 | 31.2 | 28.5 | 24.9 | 21.9 | 20.5 | ||
T=7 | T=8 | |||||||
4 | 34.8 | 34.6 | 35.2 | 36.3 | 36.9 | |||
6 | 39.9 | 39.9 | 40.3 | 40.6 | 37.4 |
Perhaps six processes and six threads is a reasonable recommendation, although it’s perhaps surprising that so many total threads is worthwhile on the 32 SMT threads of eight cores allocated to one GPU.
There aren’t obvious significant periods in the profile where overlapping CPU in one process and GPU in another would be particularly useful. However, it seemed worth trying without the processes in lock-step. The script started for each MPI rank was modifed to sleep for an extra second for each rank, running for 5000 steps instead of 1000 to reduce edge effects. The effect with four processes and eight threads is shown below. Another run with four processes and four threads gave totals of 36.9 and 37.0 synced and staggered respectively. Doing this doesn’t seem to make a significant difference, and probably there is enough jitter between the processes for different cases in practice to avoid any inefficiency of synchronization.
1 | 2 | 3 | 4 | total | |
---|---|---|---|---|---|
synced | 9.6 | 9.6 | 9.3 | 9.3 | 37.8 |
1s stagger | 9.9 | 9.4 | 9.7 | 9.7 | 38.7 |
synced | 9.9 | 9.8 | 9.8 | 9.8 | 39.3 |
1s stagger | 9.7 | 9.8 | 10.0 | 10.2 | 39.7 |
T4 GPUs
As well as the V100s, the system in question has nodes with T4
GPUsRated at 8.1 TFLOPS SP and 70 W, c.f. V100’s 15.7 TFLOPS SP and
300W.
for machine learning inference, and 40 cores compared with the 32 of
the V100 nodes. Since they’re often unused, it’s worth seeing what
performance they can provide.
With Amber, the performance in one run on a T4 was 10.7 ns/day, or < ⅓ of the V100’s.
The results for Gromacs as a function of number of processes and threads are tabulated below.
procs | T=4 | T=8 | T=10 | T=20 |
---|---|---|---|---|
1 | 10.2 | 11.0 | 11.0 | 11.1 |
2 | 15.2 | 14.3 | 14.6 | 14.7 |
3 | 16.4 | 15.7 | 15.9 | 15.5 |
4 | 16.8 | 16.0 | 16.4 | 16.0 |
5 | 16.6 | 16.0 | 16.5 | 16.0 |
6 | 16.4 | 16.6 | 16.2 | 16.6 |
It may or may not be worth tying up these GPUs for a while, getting in this case ~⅓ the performance of the V100s, while using < ⅓ of the power (according to the published power envelopes for the two types — not measured).
MPS Issues
MPS caused problems when run interactively for tests on the login node. We’ve
seen instances stuck, blocking others’ access to the GPUs, and in some cases
using 100% of a core and restarting in the same condition after
kill -9
from the user. SLURM cgroups do at least seem to be effective
in killing it at the end of a batch job. (Although SLURM can start MPS,
that’s limited to a single GPU per node and isn’t configured on the system.)
We have security worries about MPS which may or may not be justified. The
implications of one user being able to use a process started by another, at
least on the login node for testing, are unclear. Also it uses fixed names in
/tmp
for the world-writable sockets it creates.