== Porting the code to NVidia GPU using the openACC programming model

Currently, PALM-GPU usage has following restrictions / requirements:
* 2d domain decomposition (or 1PE, single-core)
* cyclic lateral boundary conditions
* no humidity / cloud physics
* no canopy model
* no Lagrangian particle model
* random number generator needs to be ported
* most of I/O does not work
* MG solver does not work / needs to be ported

Tests can be done on our GPU workstation (2*8 core Intel-CPU, 2 NVidia Kepler K40 boards), which runs as a node of the cluster-system at LUIS. Access as follows:
{{{
ssh -X <your cluster username>@login.cluster.uni-hannover.de
}}}
After login, switch to our GPU-node by starting an interactive job:
{{{
qsub -X -I -l nodes=1:ppn=2,walltime=10000 -l mem=2GB -W x=PARTITION:muk
}}}
The session uses 2 Intel-CPU-cores for 10000 seconds. You will need two cores if you want to run PALM on both of the two K40 boards.

Configuration file settings should be as followed:
{{{
%remote_username   <replace>                          
#%modules           pgi/14.6:mvapich2/2.0-pgi-cuda         # mvapich doesn't work so far
%modules           pgi/14.6:openmpi/1.8.3-pgi-cuda    
%tmp_user_catalog  /tmp                               
%compiler_name     mpif90                             
%compiler_name_ser pgf90                              
%cpp_options       -Mpreprocess:-DMPI_REAL=MPI_DOUBLE_PRECISION:-DMPI_2REAL=MPI_2DOUBLE_PRECISION:-D__nopointer:-D__openacc:-D__cuda_fft:-D__lc  
%mopts             -j:1                               
%fopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-fastsse:-Mcuda=cuda6.0  
%lopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-fastsse:-Mcuda=cuda6.0:-lcufft  
}}}
Please note settings of cpp-directives ({{{-D__nopointer -D__openacc -D__cuda_fft}}} + CUDA library path in {{{lopts}}}).\\The {{{nocache}}} compiler switch '''is not required any more'''. (Earlier compiler versions, e.g. 13.6 gave a significant loss of performance in case of omitting this switch). The {{{time}}}-switch creates and outputs performance data at the end of a run. Very useful! \\ \\
 
It might be necessary to load the modules manually before calling palmbuild or palmrun:
{{{
module load pgi/14.6 openmpi/1.8.3-pgi-cuda
}}}
Furthermore, it is required to set the environment variables
{{{
export OMPI_COMM_WORLD_LOCAL_RANK=1
export PGI_ACC_NOSYNCQUEUE=1
}}}
before calling palmrun! The second one is '''absolutely required''' in case of using the CUDA-fft ({{{fft_method='system_specific' + -D__cuda_fft}}}). If it is not used, the pressure solver does not reduce the divergence! \\

Compiler version 14.10 gives a runtime error when pres is called for the first time in init_3d_model:
{{{
cuEventRecord returned error 400: Invalid handle
}}}
I guess that this problem is also somehow connected with usage of streams. I got following informations from Mat Colgrove (NVidia/PGI):
{{{
We were able to determine the issue with calling cuFFT (TPR#20579).  In 14.4 we stopped using stream 0 as the default
stream for OpenACC since stream 0 has some special properties that made asynchronous behavior.  The problem with that
if combined with a calling a CUDA code, which still uses stream 0, the streams and hence the data can get out of sync.
In 14.7, we'll change OpenACC to use stream 0 again if "-Mcuda" is used.

In the meantime, you can set the environment variable "PGI_ACC_NOSYNCQUEUE=1" to work around the issue.
}}}
The workaround worked, e.g. for 14.6, but maybe not for 14.10? There might be a solution for this problem using explicit setting of CUDA streams, which is described under [https://www.olcf.ornl.gov/tutorials/mixing-openacc-with-gpu-libraries/#Fortran ] \\

A test parameter-set can be found here:
{{{
/home/raasch/current_version/JOBS/acc_medium/INPUT/acc_medium_p3d
}}}
Please note that {{{loop_optimization = 'acc'}}} and {{{psolver = 'poisfft'}}} have to be set. {{{fft_method = 'system-specific'}}} is required to switch on the CUDA-fft. All other fft-methods do not run on the GPU, i.e. they are extremely slow. \\ \\

palmrun-command to run on two GPU-devices:
{{{
palmrun -r acc_medium -c lcmuk -K "parallel pgigpu146" -X2 -T2 -a "d3#"
}}}
 \\

Runs on a single GPU without MPI (i.e. no domain decomposition) require this configuration:
{{{
%compiler_name     pgf90                                                       
%compiler_name_ser pgf90                                                       
%cpp_options       -Mpreprocess:-D__nopointer:-D__openacc:-D__cuda_fft:-D__lc  
%fopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-Mcray=pointer:-fastsse:-Mcuda=cuda6.0  
%lopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-Mcray=pointer:-fastsse:-Mcuda=cuda6.0:-lcufft  

}}}
Run it with
{{{
palmrun -r acc_medium -K pgigpu146 -a "d3#"
}}}
 \\

Some PGI-compiler options for debugging:
{{{
-O0 -C -g -Mbounds -Mchkstk -traceback
}}}
 \\

'''Report on current activities:'''

r1015 \\
prognostic equations (partly: q and sa is missing), prandtl_fluxes, and diffusivities have been ported \\
additional versions for tendency subroutines have been created ({{{..._acc}}}) \\
statistics are not ported at all \\
speedup seems to be similar to what have been reported by Klaus Ketelsen \\
measurements with Intel compiler on {{{inferno}}} still have to be carried out

r1111 \\
Pressure solver (including the tridiagonal solver) has been almost completely ported. Still missing are calculations in pres. \\
CUDA fft has been implemented. \\
GPU can also been used in the single-core (non-MPI-parallel) version.

r1113 \\
In single-core mode, lateral boundary conditions completely run on device. Most loops in {{{pres}}} ported. Vertical boundary conditions ({{{boundary_conds}}}) ported.

r1221 \\
Reduction operations in {{{pres}}} and {{{flow_statistics ported}}}.

r1749 \\
Partial adjustments for new surface layer scheme. Version is (in principle) instrumented to run on multiple GPUs

'''Results for 256x256x64 grid (time in micro-s per gridpoint and timestep):''' \\
||.1 ||1*Tesla, single-core (no MPI), pgi13.6             ||0.33342 ||r1221 ||
||.2 ||single-core (no MPI), pgi13.6   (cache, Temperton) ||2.34144 ||r1221 ||

The initialization time of the GPU (power up) can be avoided by running {{{/muksoft/packages/pgi/2013-136/linux86-64/13.6/bin/pgcudainit}}} in background.

For PGI compiler version 13.6, use "-ta=nocache" and set environment variable {{{PGI_ACC_SYNCHRONOUS=1}}}. Otherwise, there will be a significant loss in performance (factor of two!).

'''Next steps:'''

* porting of MIN/MAXLOC operations in ({{{timestep}}}, porting of {{{disturb_fields}}} (implement parallel random number generator)
* check the capability of parallel regions (can IF-constructs be removed from inner loops?)
* for MPI mode update ghost boundaries only, overlapping of update/MPI-transfer and computation
* overlapping communication in pressure solver (alltoall operations)
* porting of remaining things (exchange_horiz_2d, calc_liquid_water_content, compute_vpt, averaging, I/O, etc.)
* ...


'''work packages / questions for the EuroHack:'''

* getting the CUDA-aware MPI to run: for this routines {{{time_integration}}} and {{{exchange_horiz}}} in r1749 have to be replaced by the routines that I provided. If the exchange of ghost points is running sufficiently, the next step would be to make the {{{MPI_ALLTOALL}}} in {{{transpose.f90}}} CUDA-aware. This should be very easy. Just add (e.g.) {{{host_data use_device( f_inv, work )}}} clauses in front of the {{{MPI_ALLTOALL}}} calls and remove the existing {{{update host}}} and {{{data copyin}}} clauses. Also, {{{update host}}} and {{{update device}}} clauses for array {{{ar}}} have to be removed in {{{poisfft}}}.
* CUDA-fft has been implemented and successfully tested for the single-GPU (non-MPI) mode. It can be switched on using parameter {{{fft_method = 'system-specific}}}. Additionally, the compiler-switch {{{-D__cuda_fft}}} and the linker option {{{-lcufft}}} have to be set. For an unknown reason, this method does not work in the MPI-mode (the pressure solver does not reduce the divergence).
* In general: do the existing clauses (e.g. {{{loop vector / loop gang}}} give the best performance?
* For several routines extra OpenACC-versions had to be created. These are most of the tendency-subroutines (advec_ws, diffusion_..., etc.), prognostic_equations and flow_statistics. There are two main reasons for this: the PGI compiler (at least until version 14.6) was unable to vectorize loops like {{{DO  k = nzb_s_inner(j,i), nzt}}} where the index depends on the indices of the outer loops. Does this restriction still exist?  Furthermore, reduction operations only work for single scalar variables. Since in {{{flow_statistics}}} reductions are carried out in loops on several elements of arrays, separate temporary scalar variables had to be introduced. Are reduction operations still restricted to scalar variables?
* In routine {{{advec_ws}}} I had to introduce another array {{{wall_flags_00}}} to hold wall flags for bits 32-63. It seems, that OpenACC/PGI-compiler can only handle single precision (32bit) INTEGER. Is that true?
* Routine {{{fft_xy}}}: The clause {{{!$acc declare create( ar_tmp )}}} does not work starting with the 14.1 compiler-version. Instead, I had to use {{{!$acc data create( ar_tmp )}}} clauses. Why? Does this problem still exist for the current compiler version?
* Routine {{{surface_layer_fluxes}}}: there are some loops (DO WHILE, DO without specific loop counter, etc.) which cannot be vectorized
* Routine {{{swap_timelevel}}}: Why does the compiler cannot vectorize the FORTRAN vector assignments like {{{u = u_p}}}?
* Routine {{{timestep}}}: FORTRAN functions {{{MINLOC}}} and {{{MAXLOC}}}, which are used in routine {{{global_min_max}}}, do not run on GPU. Is there a better way to realize these functions than the openacc-workaround that is programmed in routine {{{timestep}}}?
* General question: Is it (meanwhile) possible to run code in parallel on the CPU and the GPU at the same time, or - in other words - to run something "in background" on the GPU?