== Porting the code to NVidia GPU using the openACC programming model

Currently, PALM-GPU usage has following restrictions / requirements:
* 2d domain decomposition (or 1PE, single-core)
* cyclic lateral boundary conditions
* no humidity / cloud physics
* no canopy model
* no Lagrangian particle model

Tests can be done on our GPU workstation (2*8 core Intel-CPU, 2 NVidia Kepler K40 boards), which runs as a node of the cluster-system at LUIS. Access as follows:
{{{
ssh -X <your cluster username>@login.cluster.uni-hannover.de
}}}
After login, switch to our GPU-node by starting an interactive job:
{{{
qsub -X -I -l nodes=1:ppn=2,walltime=10000 -l mem=2GB -W x=PARTITION:muk
}}}
The session uses 2 Intel-CPU-cores for 10000 seconds. You will need two cores if you want to run PALM on both of the two K40 boards.

Configuration file settings should be as followed:
{{{
%remote_username   <replace>                          lcmuk parallel pgigpu146
#%modules           pgi/14.6:mvapich2/2.0-pgi-cuda    lcmuk parallel pgigpu146     # mvapich doesn't work so far
%modules           pgi/14.6:openmpi/1.8.3-pgi-cuda    lcmuk parallel pgigpu146
%tmp_user_catalog  /tmp                               lcmuk parallel pgigpu146
%compiler_name     mpif90                             lcmuk parallel pgigpu146
%compiler_name_ser pgf90                              lcmuk parallel pgigpu146
%cpp_options       -Mpreprocess:-DMPI_REAL=MPI_DOUBLE_PRECISION:-DMPI_2REAL=MPI_2DOUBLE_PRECISION:-D__nopointer:-D__openacc:-D__cuda_fft:-D__lc  lcmuk parallel pgigpu146
%mopts             -j:1                               lcmuk parallel pgigpu146
%fopts             -acc:-ta=tesla,6.0,nocache,time:-Minfo=acc:-fastsse:-Mcuda=cuda6.0  lcmuk parallel pgigpu146
%lopts             -acc:-ta=tesla,6.0,nocache,time:-Minfo=acc:-fastsse:-Mcuda=cuda6.0:-lcufft  lcmuk parallel pgigpu146
}}}
The {{{nocache}}} compiler switch is currently required. Otherwise there would be a significant loss of performance.
It might be neccessary to load the modules manually before calling mbuild or mrun:
{{{
module load pgi/14.6 openmpi/1.8.3-pgi-cuda
}}}
Furthermore, it is required to set the environment variable
{{{
export OMPI_COMM_WORLD_LOCAL_RANK=1
}}}
before calling mrun!  Compiler version 14.10 gives a runtime error when pres is called for the first time in init_3d_model.
 
A test parameter-set:
{{{
/home/raasch/current_version/JOBS/acc_medium/INPUT/acc_medium_p3d
}}}

Here are some hints for running the single-GPU (no-MPI) version:\\
Compiler settings are given in
{{{
.../trunk/SCRIPTS/.mrun.config.imuk_gpu
.../trunk/INSTALL/MAKE.inc.pgi.openacc
}}}
Please note settings of cpp-directives ({{{-D__openacc -D__cuda_fft}}} + CUDA library path).\\
Test parameter set:
{{{
/home/raasch/current_version/JOBS/gputest/INPUT/gputest_p3d
/home/raasch/current_version/JOBS/acc_medium/INPUT/acc_medium_p3d
}}}
Please note that {{{loop_optimization = 'acc'}}}, {{{psolver = 'poisfft'}}}, and {{{fft_method = 'system-specific'}}} have to be set. Results of tests are stored in the respective {{{MONITORING}}} directory.

'''Report on current activities:'''

r1015 \\
prognostic equations (partly: q and sa is missing), prandtl_fluxes, and diffusivities have been ported \\
additional versions for tendency subroutines have been created ({{{..._acc}}}) \\
statistics are not ported at all \\
speedup seems to be similar to what have been reported by Klaus Ketelsen \\
measurements with Intel compiler on {{{inferno}}} still have to be carried out

r1111 \\
Pressure solver (including the tridiagonal solver) has been almost completely ported. Still missing are calculations in pres. \\
CUDA fft has been implemented. \\
GPU can also been used in the single-core (non-MPI-parallel) version.

r1113 \\
In single-core mode, lateral boundary conditions completely run on device. Most loops in {{{pres}}} ported. Vertical boundary conditions ({{{boundary_conds}}}) ported.

r1221 \\
Reduction operations in {{{pres}}} and {{{flow_statistics ported}}}.

r1747 \\
Partial adjustments for new surface layer scheme. Version is (in principle) instrumented to run on multiple GPUs

'''Results for 256x256x64 grid (time in micro-s per gridpoint and timestep):''' \\
||.1 ||1*Tesla, single-core (no MPI), pgi13.6             ||0.33342 ||r1221 ||
||.2 ||single-core (no MPI), pgi13.6   (cache, Temperton) ||2.34144 ||r1221 ||

The initialization time of the GPU (power up) can be avoided by running {{{/muksoft/packages/pgi/2013-136/linux86-64/13.6/bin/pgcudainit}}} in background.

For PGI compiler version 13.6, use "-ta=nocache" and set environment variable {{{PGI_ACC_SYNCHRONOUS=1}}}. Otherwise, there will be a significant loss in performance (factor of two!).

'''Next steps:'''

* porting of MIN/MAXLOC operations in ({{{timestep}}}, porting of {{{disturb_fields}}} (implement parallel random number generator)
* check the capability of parallel regions (can IF-constructs be removed from inner loops?)
* for MPI mode update ghost boundaries only, overlapping of update/MPI-transfer and computation
* overlapping communication in pressure solver (alltoall operations)
* porting of remaining things (exchange_horiz_2d, calc_liquid_water_content, compute_vpt, averaging, I/O, etc.)
* ...