Porting the code to NVidia GPU using the openACC programming model
Currently, PALM-GPU usage has following restrictions / requirements:
- 2d domain decomposition (or 1PE, single-core)
- cyclic lateral boundary conditions
- no humidity / cloud physics
- no canopy model
- no Lagrangian particle model
- random number generator needs to be ported
- most of I/O does not work
- MG solver does not work / needs to be ported
Tests can be done on our GPU workstation (2*8 core Intel-CPU, 2 NVidia Kepler K40 boards), which runs as a node of the cluster-system at LUIS. Access as follows:
ssh -X <your cluster username>@login.cluster.uni-hannover.de
After login, switch to our GPU-node by starting an interactive job:
qsub -X -I -l nodes=1:ppn=2,walltime=10000 -l mem=2GB -W x=PARTITION:muk
The session uses 2 Intel-CPU-cores for 10000 seconds. You will need two cores if you want to run PALM on both of the two K40 boards.
Configuration file settings should be as followed:
%remote_username <replace> #%modules pgi/14.6:mvapich2/2.0-pgi-cuda # mvapich doesn't work so far %modules pgi/14.6:openmpi/1.8.3-pgi-cuda %tmp_user_catalog /tmp %compiler_name mpif90 %compiler_name_ser pgf90 %cpp_options -Mpreprocess:-DMPI_REAL=MPI_DOUBLE_PRECISION:-DMPI_2REAL=MPI_2DOUBLE_PRECISION:-D__nopointer:-D__openacc:-D__cuda_fft:-D__lc %mopts -j:1 %fopts -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-fastsse:-Mcuda=cuda6.0 %lopts -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-fastsse:-Mcuda=cuda6.0:-lcufft
Please note settings of cpp-directives (-D__nopointer -D__openacc -D__cuda_fft + CUDA library path in lopts).
The nocache compiler switch is not required any more. (Earlier compiler versions, e.g. 13.6 gave a significant loss of performance in case of omitting this switch). The time-switch creates and outputs performance data at the end of a run. Very useful!
It might be necessary to load the modules manually before calling palmbuild or palmrun:
module load pgi/14.6 openmpi/1.8.3-pgi-cuda
Furthermore, it is required to set the environment variables
export OMPI_COMM_WORLD_LOCAL_RANK=1 export PGI_ACC_NOSYNCQUEUE=1
before calling palmrun! The second one is absolutely required in case of using the CUDA-fft (fft_method='system_specific' + -D__cuda_fft). If it is not used, the pressure solver does not reduce the divergence!
Compiler version 14.10 gives a runtime error when pres is called for the first time in init_3d_model:
cuEventRecord returned error 400: Invalid handle
I guess that this problem is also somehow connected with usage of streams. I got following informations from Mat Colgrove (NVidia/PGI):
We were able to determine the issue with calling cuFFT (TPR#20579). In 14.4 we stopped using stream 0 as the default stream for OpenACC since stream 0 has some special properties that made asynchronous behavior. The problem with that if combined with a calling a CUDA code, which still uses stream 0, the streams and hence the data can get out of sync. In 14.7, we'll change OpenACC to use stream 0 again if "-Mcuda" is used. In the meantime, you can set the environment variable "PGI_ACC_NOSYNCQUEUE=1" to work around the issue.
The workaround worked, e.g. for 14.6, but maybe not for 14.10? There might be a solution for this problem using explicit setting of CUDA streams, which is described under https://www.olcf.ornl.gov/tutorials/mixing-openacc-with-gpu-libraries/#Fortran
A test parameter-set can be found here:
/home/raasch/current_version/JOBS/acc_medium/INPUT/acc_medium_p3d
Please note that loop_optimization = 'acc' and psolver = 'poisfft' have to be set. fft_method = 'system-specific' is required to switch on the CUDA-fft. All other fft-methods do not run on the GPU, i.e. they are extremely slow.
palmrun-command to run on two GPU-devices:
palmrun -r acc_medium -c lcmuk -K "parallel pgigpu146" -X2 -T2 -a "d3#"
Runs on a single GPU without MPI (i.e. no domain decomposition) require this configuration:
%compiler_name pgf90 %compiler_name_ser pgf90 %cpp_options -Mpreprocess:-D__nopointer:-D__openacc:-D__cuda_fft:-D__lc %fopts -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-Mcray=pointer:-fastsse:-Mcuda=cuda6.0 %lopts -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-Mcray=pointer:-fastsse:-Mcuda=cuda6.0:-lcufft
Run it with
palmrun -r acc_medium -K pgigpu146 -a "d3#"
Some PGI-compiler options for debugging:
-O0 -C -g -Mbounds -Mchkstk -traceback
Report on current activities:
r1015
prognostic equations (partly: q and sa is missing), prandtl_fluxes, and diffusivities have been ported
additional versions for tendency subroutines have been created (..._acc)
statistics are not ported at all
speedup seems to be similar to what have been reported by Klaus Ketelsen
measurements with Intel compiler on inferno still have to be carried out
r1111
Pressure solver (including the tridiagonal solver) has been almost completely ported. Still missing are calculations in pres.
CUDA fft has been implemented.
GPU can also been used in the single-core (non-MPI-parallel) version.
r1113
In single-core mode, lateral boundary conditions completely run on device. Most loops in pres ported. Vertical boundary conditions (boundary_conds) ported.
r1221
Reduction operations in pres and flow_statistics ported.
r1749
Partial adjustments for new surface layer scheme. Version is (in principle) instrumented to run on multiple GPUs
Results for 256x256x64 grid (time in micro-s per gridpoint and timestep):
.1 | 1*Tesla, single-core (no MPI), pgi13.6 | 0.33342 | r1221 |
.2 | single-core (no MPI), pgi13.6 (cache, Temperton) | 2.34144 | r1221 |
The initialization time of the GPU (power up) can be avoided by running /muksoft/packages/pgi/2013-136/linux86-64/13.6/bin/pgcudainit in background.
For PGI compiler version 13.6, use "-ta=nocache" and set environment variable PGI_ACC_SYNCHRONOUS=1. Otherwise, there will be a significant loss in performance (factor of two!).
Next steps:
- porting of MIN/MAXLOC operations in (timestep, porting of disturb_fields (implement parallel random number generator)
- check the capability of parallel regions (can IF-constructs be removed from inner loops?)
- for MPI mode update ghost boundaries only, overlapping of update/MPI-transfer and computation
- overlapping communication in pressure solver (alltoall operations)
- porting of remaining things (exchange_horiz_2d, calc_liquid_water_content, compute_vpt, averaging, I/O, etc.)
- ...
work packages / questions for the EuroHack:
- getting the CUDA-aware MPI to run: for this routines time_integration and exchange_horiz in r1749 have to be replaced by the routines that I provided. If the exchange of ghost points is running sufficiently, the next step would be to make the MPI_ALLTOALL in transpose.f90 CUDA-aware. This should be very easy. Just add (e.g.) host_data use_device( f_inv, work ) clauses in front of the MPI_ALLTOALL calls and remove the existing update host and data copyin clauses. Also, update host and update device clauses for array ar have to be removed in poisfft.
- CUDA-fft has been implemented and successfully tested for the single-GPU (non-MPI) mode. It can be switched on using parameter fft_method = 'system-specific. Additionally, the compiler-switch -D__cuda_fft and the linker option -lcufft have to be set. For an unknown reason, this method does not work in the MPI-mode (the pressure solver does not reduce the divergence).
- In general: do the existing clauses (e.g. loop vector / loop gang give the best performance?
- For several routines extra OpenACC-versions had to be created. These are most of the tendency-subroutines (advec_ws, diffusion_..., etc.), prognostic_equations and flow_statistics. There are two main reasons for this: the PGI compiler (at least until version 14.6) was unable to vectorize loops like DO k = nzb_s_inner(j,i), nzt where the index depends on the indices of the outer loops. Does this restriction still exist? Furthermore, reduction operations only work for single scalar variables. Since in flow_statistics reductions are carried out in loops on several elements of arrays, separate temporary scalar variables had to be introduced. Are reduction operations still restricted to scalar variables?
- In routine advec_ws I had to introduce another array wall_flags_00 to hold wall flags for bits 32-63. It seems, that OpenACC/PGI-compiler can only handle single precision (32bit) INTEGER. Is that true?
- Routine fft_xy: The clause !$acc declare create( ar_tmp ) does not work starting with the 14.1 compiler-version. Instead, I had to use !$acc data create( ar_tmp ) clauses. Why? Does this problem still exist for the current compiler version?
- Routine surface_layer_fluxes: there are some loops (DO WHILE, DO without specific loop counter, etc.) which cannot be vectorized
- Routine swap_timelevel: Why does the compiler cannot vectorize the FORTRAN vector assignments like u = u_p?
- Routine timestep: FORTRAN functions MINLOC and MAXLOC, which are used in routine global_min_max, do not run on GPU. Is there a better way to realize these functions than the openacc-workaround that is programmed in routine timestep?
- General question: Is it (meanwhile) possible to run code in parallel on the CPU and the GPU at the same time, or - in other words - to run something "in background" on the GPU?