Version 17 (modified by raasch, 9 years ago) (diff)

--

Porting the code to NVidia GPU using the openACC programming model

Currently, PALM-GPU usage has following restrictions / requirements:

  • 2d domain decomposition (or 1PE, single-core)
  • cyclic lateral boundary conditions
  • no humidity / cloud physics
  • no canopy model
  • no Lagrangian particle model

Tests can be done on our GPU workstation (2*8 core Intel-CPU, 2 NVidia Kepler K40 boards), which runs as a node of the cluster-system at LUIS. Access as follows:

ssh -X <your cluster username>@login.cluster.uni-hannover.de

After login, switch to our GPU-node by starting an interactive job:

qsub -X -I -l nodes=1:ppn=2,walltime=10000 -l mem=2GB -W x=PARTITION:muk

The session uses 2 Intel-CPU-cores for 10000 seconds. You will need two cores if you want to run PALM on both of the two K40 boards.

Configuration file settings should be as followed:

%remote_username   <replace>                          lcmuk parallel pgigpu146
#%modules           pgi/14.6:mvapich2/2.0-pgi-cuda    lcmuk parallel pgigpu146     # mvapich doesn't work so far
%modules           pgi/14.6:openmpi/1.8.3-pgi-cuda    lcmuk parallel pgigpu146
%tmp_user_catalog  /tmp                               lcmuk parallel pgigpu146
%compiler_name     mpif90                             lcmuk parallel pgigpu146
%compiler_name_ser pgf90                              lcmuk parallel pgigpu146
%cpp_options       -Mpreprocess:-DMPI_REAL=MPI_DOUBLE_PRECISION:-DMPI_2REAL=MPI_2DOUBLE_PRECISION:-D__nopointer:-D__openacc:-D__cuda_fft:-D__lc  lcmuk parallel pgigpu146
%mopts             -j:1                               lcmuk parallel pgigpu146
%fopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-fastsse:-Mcuda=cuda6.0  lcmuk parallel pgigpu146
%lopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-fastsse:-Mcuda=cuda6.0:-lcufft  lcmuk parallel pgigpu146

The nocache compiler switch is currently required. Otherwise there would be a significant loss of performance. It might be neccessary to load the modules manually before calling mbuild or mrun:

module load pgi/14.6 openmpi/1.8.3-pgi-cuda

Furthermore, it is required to set the environment variable

export OMPI_COMM_WORLD_LOCAL_RANK=1

before calling mrun! Compiler version 14.10 gives a runtime error when pres is called for the first time in init_3d_model. A test parameter-set:

/home/raasch/current_version/JOBS/acc_medium/INPUT/acc_medium_p3d

Here are some hints for running the single-GPU (no-MPI) version:
Compiler settings are given in

.../trunk/SCRIPTS/.mrun.config.imuk_gpu
.../trunk/INSTALL/MAKE.inc.pgi.openacc

Please note settings of cpp-directives (-D__openacc -D__cuda_fft + CUDA library path).
Test parameter set:

/home/raasch/current_version/JOBS/gputest/INPUT/gputest_p3d
/home/raasch/current_version/JOBS/acc_medium/INPUT/acc_medium_p3d

Please note that loop_optimization = 'acc', psolver = 'poisfft', and fft_method = 'system-specific' have to be set. Results of tests are stored in the respective MONITORING directory.

Report on current activities:

r1015
prognostic equations (partly: q and sa is missing), prandtl_fluxes, and diffusivities have been ported
additional versions for tendency subroutines have been created (..._acc)
statistics are not ported at all
speedup seems to be similar to what have been reported by Klaus Ketelsen
measurements with Intel compiler on inferno still have to be carried out

r1111
Pressure solver (including the tridiagonal solver) has been almost completely ported. Still missing are calculations in pres.
CUDA fft has been implemented.
GPU can also been used in the single-core (non-MPI-parallel) version.

r1113
In single-core mode, lateral boundary conditions completely run on device. Most loops in pres ported. Vertical boundary conditions (boundary_conds) ported.

r1221
Reduction operations in pres and flow_statistics ported.

r1749
Partial adjustments for new surface layer scheme. Version is (in principle) instrumented to run on multiple GPUs

Results for 256x256x64 grid (time in micro-s per gridpoint and timestep):

.1 1*Tesla, single-core (no MPI), pgi13.6 0.33342 r1221
.2 single-core (no MPI), pgi13.6 (cache, Temperton) 2.34144 r1221

The initialization time of the GPU (power up) can be avoided by running /muksoft/packages/pgi/2013-136/linux86-64/13.6/bin/pgcudainit in background.

For PGI compiler version 13.6, use "-ta=nocache" and set environment variable PGI_ACC_SYNCHRONOUS=1. Otherwise, there will be a significant loss in performance (factor of two!).

Next steps:

  • porting of MIN/MAXLOC operations in (timestep, porting of disturb_fields (implement parallel random number generator)
  • check the capability of parallel regions (can IF-constructs be removed from inner loops?)
  • for MPI mode update ghost boundaries only, overlapping of update/MPI-transfer and computation
  • overlapping communication in pressure solver (alltoall operations)
  • porting of remaining things (exchange_horiz_2d, calc_liquid_water_content, compute_vpt, averaging, I/O, etc.)
  • ...

work packages fpr the EuroHack:

  • getting the CUDA-aware MPI to run: for this routines time_integration and exchange_horiz in r1749 have to be replaced by the routines that I provided. If the exchange of ghost points is running sufficiently, the next step would be to make the MPI_ALLTOALL in transpose.f90 CUDA-aware. This should be very easy. Just add (e.g.) host_data use_device( f_inv, work ) clauses in front of the MPI_ALLTOALL calls and remove the existing update host and data copyin clauses. Also, update host and update device clauses for array ar have to be removed in poisfft.
  • CUDA-fft has been implemented and successfully tested for the single-GPU (non-MPI) mode. It can be switched on using parameter fft_method = 'system-specific. Additionally, the compiler-switch -D__cuda_fft and the linker option -lcufft have to be set. For an unknown reason, this method does not work in the MPI-mode (the pressure solver does not reduce the divergence).
  • In general: do the existing clauses (e.g. loop vector / loop gang give the best performance?
  • For several routines extra OpenACC-versions had to be created. These are most of the tendency-subroutines (advec_ws, diffusion_..., etc.), prognostic_equations and flow_statistics. There are two main reasons for this: the PGI compiler (at least until version 14.6) was unable to vectorize loops like DO k = nzb_s_inner(j,i), nzt where the index depends on the indices of the outer loops. Does this restriction still exist? Furthermore, reduction operations only work for single scalar variables. Since in flow_statistics reductions are carried out in loops on several elements of arrays, separate temporary scalar variables had to be introduced. Are reduction operations still restricted to scalar variables?
  • In routine advec_ws I had to introduce another array wall_flags_00 to hold wall flags for bits 32-63. It seems, that OpenACC/PGI-compiler can only handle single precision (32bit) INTEGER. Is that true?
  • Routine fft_xy: The clause !$acc declare create( ar_tmp ) does not work starting with the 14.1 compiler-version. Instead, I had to use !$acc data create( ar_tmp ) clauses. Why? Does this problem still exist for the current compiler version?
  • Routine surface_layer_fluxes: there are some loops (DO WHILE, DO without specific loop counter, etc.) which cannot be vectorized
  • Routine swap_timelevel: Why does the compiler cannot vectorize the FORTRAN vector assignments like u = u_p?
  • Routine timestep: Is there a chance that the FORTRAN functions MINLOC and MAXLOC, which are used in routine global_min_max, are directly supported on the GPU?