Porting the code to NVidia GPU using the openACC programming model

Currently, PALM-GPU usage has following restrictions / requirements:

  • 2d domain decomposition (or 1PE, single-core)
  • cyclic lateral boundary conditions
  • no humidity / cloud physics
  • no canopy model
  • no Lagrangian particle model
  • random number generator needs to be ported
  • most of I/O does not work
  • MG solver does not work / needs to be ported

Tests can be done on our GPU workstation (2*8 core Intel-CPU, 2 NVidia Kepler K40 boards), which runs as a node of the cluster-system at LUIS. Access as follows:

ssh -X <your cluster username>@login.cluster.uni-hannover.de

After login, switch to our GPU-node by starting an interactive job:

qsub -X -I -l nodes=1:ppn=2,walltime=10000 -l mem=2GB -W x=PARTITION:muk

The session uses 2 Intel-CPU-cores for 10000 seconds. You will need two cores if you want to run PALM on both of the two K40 boards.

Configuration file settings should be as followed:

%remote_username   <replace>                          
#%modules           pgi/14.6:mvapich2/2.0-pgi-cuda         # mvapich doesn't work so far
%modules           pgi/14.6:openmpi/1.8.3-pgi-cuda    
%tmp_user_catalog  /tmp                               
%compiler_name     mpif90                             
%compiler_name_ser pgf90                              
%cpp_options       -Mpreprocess:-DMPI_REAL=MPI_DOUBLE_PRECISION:-DMPI_2REAL=MPI_2DOUBLE_PRECISION:-D__nopointer:-D__openacc:-D__cuda_fft:-D__lc  
%mopts             -j:1                               
%fopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-fastsse:-Mcuda=cuda6.0  
%lopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-fastsse:-Mcuda=cuda6.0:-lcufft  

Please note settings of cpp-directives (-D__nopointer -D__openacc -D__cuda_fft + CUDA library path in lopts).
The nocache compiler switch is not required any more. (Earlier compiler versions, e.g. 13.6 gave a significant loss of performance in case of omitting this switch). The time-switch creates and outputs performance data at the end of a run. Very useful!

It might be necessary to load the modules manually before calling palmbuild or palmrun:

module load pgi/14.6 openmpi/1.8.3-pgi-cuda

Furthermore, it is required to set the environment variables


before calling palmrun! The second one is absolutely required in case of using the CUDA-fft (fft_method='system_specific' + -D__cuda_fft). If it is not used, the pressure solver does not reduce the divergence!

Compiler version 14.10 gives a runtime error when pres is called for the first time in init_3d_model:

cuEventRecord returned error 400: Invalid handle

I guess that this problem is also somehow connected with usage of streams. I got following informations from Mat Colgrove (NVidia/PGI):

We were able to determine the issue with calling cuFFT (TPR#20579).  In 14.4 we stopped using stream 0 as the default
stream for OpenACC since stream 0 has some special properties that made asynchronous behavior.  The problem with that
if combined with a calling a CUDA code, which still uses stream 0, the streams and hence the data can get out of sync.
In 14.7, we'll change OpenACC to use stream 0 again if "-Mcuda" is used.

In the meantime, you can set the environment variable "PGI_ACC_NOSYNCQUEUE=1" to work around the issue.

The workaround worked, e.g. for 14.6, but maybe not for 14.10? There might be a solution for this problem using explicit setting of CUDA streams, which is described under https://www.olcf.ornl.gov/tutorials/mixing-openacc-with-gpu-libraries/#Fortran

A test parameter-set can be found here:


Please note that loop_optimization = 'acc' and psolver = 'poisfft' have to be set. fft_method = 'system-specific' is required to switch on the CUDA-fft. All other fft-methods do not run on the GPU, i.e. they are extremely slow.

palmrun-command to run on two GPU-devices:

palmrun -r acc_medium -c lcmuk -K "parallel pgigpu146" -X2 -T2 -a "d3#"

Runs on a single GPU without MPI (i.e. no domain decomposition) require this configuration:

%compiler_name     pgf90                                                       
%compiler_name_ser pgf90                                                       
%cpp_options       -Mpreprocess:-D__nopointer:-D__openacc:-D__cuda_fft:-D__lc  
%fopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-Mcray=pointer:-fastsse:-Mcuda=cuda6.0  
%lopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-Mcray=pointer:-fastsse:-Mcuda=cuda6.0:-lcufft  

Run it with

palmrun -r acc_medium -K pgigpu146 -a "d3#"

Some PGI-compiler options for debugging:

-O0 -C -g -Mbounds -Mchkstk -traceback

Report on current activities:

prognostic equations (partly: q and sa is missing), prandtl_fluxes, and diffusivities have been ported
additional versions for tendency subroutines have been created (..._acc)
statistics are not ported at all
speedup seems to be similar to what have been reported by Klaus Ketelsen
measurements with Intel compiler on inferno still have to be carried out

Pressure solver (including the tridiagonal solver) has been almost completely ported. Still missing are calculations in pres.
CUDA fft has been implemented.
GPU can also been used in the single-core (non-MPI-parallel) version.

In single-core mode, lateral boundary conditions completely run on device. Most loops in pres ported. Vertical boundary conditions (boundary_conds) ported.

Reduction operations in pres and flow_statistics ported.

Partial adjustments for new surface layer scheme. Version is (in principle) instrumented to run on multiple GPUs

Results for 256x256x64 grid (time in micro-s per gridpoint and timestep):

.1 1*Tesla, single-core (no MPI), pgi13.6 0.33342 r1221
.2 single-core (no MPI), pgi13.6 (cache, Temperton) 2.34144 r1221

The initialization time of the GPU (power up) can be avoided by running /muksoft/packages/pgi/2013-136/linux86-64/13.6/bin/pgcudainit in background.

For PGI compiler version 13.6, use "-ta=nocache" and set environment variable PGI_ACC_SYNCHRONOUS=1. Otherwise, there will be a significant loss in performance (factor of two!).

Next steps:

  • porting of MIN/MAXLOC operations in (timestep, porting of disturb_fields (implement parallel random number generator)
  • check the capability of parallel regions (can IF-constructs be removed from inner loops?)
  • for MPI mode update ghost boundaries only, overlapping of update/MPI-transfer and computation
  • overlapping communication in pressure solver (alltoall operations)
  • porting of remaining things (exchange_horiz_2d, calc_liquid_water_content, compute_vpt, averaging, I/O, etc.)
  • ...

work packages / questions for the EuroHack:

  • getting the CUDA-aware MPI to run: for this routines time_integration and exchange_horiz in r1749 have to be replaced by the routines that I provided. If the exchange of ghost points is running sufficiently, the next step would be to make the MPI_ALLTOALL in transpose.f90 CUDA-aware. This should be very easy. Just add (e.g.) host_data use_device( f_inv, work ) clauses in front of the MPI_ALLTOALL calls and remove the existing update host and data copyin clauses. Also, update host and update device clauses for array ar have to be removed in poisfft.
  • CUDA-fft has been implemented and successfully tested for the single-GPU (non-MPI) mode. It can be switched on using parameter fft_method = 'system-specific. Additionally, the compiler-switch -D__cuda_fft and the linker option -lcufft have to be set. For an unknown reason, this method does not work in the MPI-mode (the pressure solver does not reduce the divergence).
  • In general: do the existing clauses (e.g. loop vector / loop gang give the best performance?
  • For several routines extra OpenACC-versions had to be created. These are most of the tendency-subroutines (advec_ws, diffusion_..., etc.), prognostic_equations and flow_statistics. There are two main reasons for this: the PGI compiler (at least until version 14.6) was unable to vectorize loops like DO k = nzb_s_inner(j,i), nzt where the index depends on the indices of the outer loops. Does this restriction still exist? Furthermore, reduction operations only work for single scalar variables. Since in flow_statistics reductions are carried out in loops on several elements of arrays, separate temporary scalar variables had to be introduced. Are reduction operations still restricted to scalar variables?
  • In routine advec_ws I had to introduce another array wall_flags_00 to hold wall flags for bits 32-63. It seems, that OpenACC/PGI-compiler can only handle single precision (32bit) INTEGER. Is that true?
  • Routine fft_xy: The clause !$acc declare create( ar_tmp ) does not work starting with the 14.1 compiler-version. Instead, I had to use !$acc data create( ar_tmp ) clauses. Why? Does this problem still exist for the current compiler version?
  • Routine surface_layer_fluxes: there are some loops (DO WHILE, DO without specific loop counter, etc.) which cannot be vectorized
  • Routine swap_timelevel: Why does the compiler cannot vectorize the FORTRAN vector assignments like u = u_p?
  • Routine timestep: FORTRAN functions MINLOC and MAXLOC, which are used in routine global_min_max, do not run on GPU. Is there a better way to realize these functions than the openacc-workaround that is programmed in routine timestep?
  • General question: Is it (meanwhile) possible to run code in parallel on the CPU and the GPU at the same time, or - in other words - to run something "in background" on the GPU?
Last modified 2 years ago Last modified on Nov 21, 2018 5:10:59 PM