2022-01-11 optimization, GPU, linear algebra, shared libraries

Performance of CUDA-based Substitute BLAS

Table of Contents

It is attractive to be able to improve the performance of BLAS-intensive code by substituting an alternative dynamically-linked BLAS library, as in a previous item. If you have NVIDIA GPUs, obviously you want to do that with a CUDA-based BLAS which uses host memory (while sighing about proprietary software). As I’m working on a POWER9 system, with nodes similar to Summit’s (but with four 32GB V100s per node, not six) the relevant BLAS implementations are NVIDIA’s NVBLAS and IBM’s ESSL.A descendent of what was once available to me on a ‘Blue box’ (System/370)…

I couldn’t find data on how they performed relative to one another and with varying parameters,Except for some information from OLCF on POWER8 systems, with the previous generation of the NVLINK fabric, and possibly older libraries.

so I made measurements. Here are some results.

Only DGEMM performance is considered.While NVBLAS appears to supply the set of GEMM, SYMM, SYR2K, SYRK, TRMM, and TRSM operations, it isn’t clear how they are all accelerated, and it would be worth measuring them all and comparing with ESSL similarly to BLIS’ comparisons of CPU BLAS.

What’s of interest is performance scaling as a function of matrix dimensions and number of GPUs, and any possible tuning. Only large square matrices are considered; thin ones tend to need special treatment in CPU implementations, but presumably they would need to be relatively extremely thin, given enough elements to make the GPU offload worthwhile.

Results were obtained on a AC922 POWER9 node, with one to four GPUs used. The DGEMM benchmark from OpenBLAS was used, with OPENBLAS_LOOPS=3 in the environment, i.e three repetitions, to attempt to smooth the results. The benchmark takes arguments to specify a range of dimensions of a square matrix filled with random values, printing performance in MFLOPs, and time taken, for the DGEMM calculation. (The generation of the random fills is slow.) In initial measurements performance varied from run to run by up to about 10% percent in some cases, possibly due to variation between nodes, so the final results were all obtained in a single batch run on RHEL 7. NVBLAS was from CUDA 10.2.89, and ESSL was version 6.2 (which required CUDA 10.1.243). The OpenMP components could use the full 32 cores with no specified core binding.

ESSL

ESSL has more GPU-accelerated BLAS operations than NVBLAS, and also accelerates some of LAPACK. Of the multiple ESSL shared libraries, libesslsmpcuda provides the (hybrid) GPU support with 32-bit integers.

The relevant tuning for ESSL is through the environment variables ESSL_CUDA_HYBRID, to specify hybrid CPU/GPU use or not, and ESSL_CUDA_PIN, to specify whether or not to pin memory. Results were compared without pinning and with ESSL_CUDA_PIN=yes to pin on entry to DGEMM. The GPUs used by ESSL are determined by CUDA_VISIBLE_DEVICES as normal.

The wild swings in performance are not reproducible from run to run in position and magnitude, but are generally present. They deserve investigation; it’s not obvious what might cause them.

ESSL performance with varying GPU numbers, memory pinned and not. [Gnuplot input]

NVBLAS

NVBLAS (annoyingly) requires a configuration file at runtime to control its behaviour, rather than environment variables.Except that it consults NVBLAS_CONFIG_FILE for the file’s location and prints an annoying message if that variable isn’t defined and it uses the default.

Apart from logging, it determines the list of GPUs used, whether to pin memory, a tile dimension, and a fallback CPU BLAS to use for small dimensions at which the overhead of GPU use is too large. OpenMP-based OpenBLAS 0.3.10 was used for the fallback; it might have been better to use CPU OpenMP-based ESSL,ESSL provides 550 GFLOPs on 32 CPU cores, while OpenBLAS gets 450 GFLOPS.

but the ‘small’ matrix performance isn’t of much interest here. It seems generally advantageous to use memory pinning, and so the measurements just vary the GPUs and tile size. Thus a configuration file nvblas.conf might look like

# with LD_LIBRARY_PATH set if necessary for openblas
NVBLAS_CPU_BLAS_LIB libopenblas.so.0
NVBLAS_AUTOPIN_MEM_ENABLED
NVBLAS_TILE_DIM 2048
NVBLAS_GPU_LIST 0 1

I couldn’t find the default tile size documented, but it is probably 1024, since the results are very similar when that is set explicitly.

Clearly performance doesn’t scale to all four GPUs with the 2048 tile size (and not perfectly to three). Increasing the tile size with four GPUs helps that case, but still only gives similar performance to three.

NVBLAS performance with varying number of GPUs and tile size. [Gnuplot input]
NVBLAS performance with varying number of GPUs and tile size fixed at 2048. [Gnuplot input]

Conclusions

Pinning memory on entry to BLAS (and unpinning on exit) is generally advantageous. According to the ESSL documentation, better performance may be obtained by pinning the matrix data in the caller, but that isn’t relevant if we’re just interested in substituting CPU BLAS.

NVBLAS performance is somewhat better than ESSL generally, but setting the tile size is important.

It isn’t clear where the GPU/CPU cut-off is with either implementation. One could see at what dimension CPU DGEMM is called, or compare results with slow reference BLAS.

Using all the GPUs together effectively in one program requires two or four MPI ranks since performance doesn’t scale properly past two. That may be due to crossing the X-bus between sockets, rather than NVLINK. It should be checked whether binding a thread per GPU to each socket makes a difference, or whether the library adjusts that sort of thing.

The peak performance of a V100 quoted by NVIDIA is ‘up to 7.8 TFLOPs’ double precision. While the figures are variable, the maximum DGEMM performance observed for a single GPU with ESSL was 6.6 TF, and (more consistently) 6.9 with NVBLAS.

The matrix data are in host memory, which is ‘unified’ with GPU memory over the NVLINK, though the exact mechanism for that and performance relative to x86_64 systems is unclear and probably worth investigating. With the maximum matrix size used here, chosen for more reasonable measurement times, the data still fit in GPU memory (15GB total in 32GB). It might be worth investigating larger dimensions, exceeding GPU memory, in case there’s a need for such sizes.