== Porting the code to NVidia GPU using the openACC programming model

Currently, PALM-GPU usage has following restrictions / requirements:
* 2d domain decomposition (or 1PE, single-core)
* cyclic lateral boundary conditions
* no humidity / cloud physics
* no topography
* no Lagrangian particle model

Tests can be done on host {{{inferno}}} only, using the PGI-FORTRAN compiler. Required settings:
{{{
export LM_LICENSE_FILE=27000@lizenzserv.rrzn.uni-hannover.de
export PATH=/localdata/opt/mpich2/1.4.1p1/bin:$PATH
export PATH=$PATH:/muksoft/packages/intel/bin:/muksoft/bin
export PATH=$PATH:/localdata/opt/pgi/linux86-64/12.5/bin:/usr/local/cuda/bin
}}}
Compiler settings are given in
{{{
.../trunk/SCRIPTS/.mrun.config.imuk_gpu
}}}
Please note settings of cpp-directives ({{{-D__openacc -D__cuda_fft}}} + CUDA library path).\\
Test parameter set:
{{{
/home/raasch/current_version/JOBS/gputest/INPUT/gputest_p3d
}}}
Please note that {{{loop_optimization = 'acc'}}} and {{{fft_method = 'system-specific'}}} have to be set. Results of tests are stored in the respective {{{MONITORING}}} directory.

'''Report on current activities:'''

r1015 \\
prognostic equations (partly: q and sa is missing), prandtl_fluxes, and diffusivities have been ported \\
additional versions for tendency subroutines have been created ({{{..._acc}}}) \\
statistics are not ported at all \\
speedup seems to be similar to what have been reported by Klaus Ketelsen \\
measurements with Intel compiler on {{{inferno}}} still have to be carried out

r1111 \\
Pressure solver (including the tridiagonal solver) has been almost completely ported. Still missing are calculations in pres. \\
CUDA fft has been implemented. \\
GPU can also been used in the single-core (non-MPI-parallel) version. 

'''Results for 512x512x64 grid (time in micro-s per gridpoint and timestep):''' \\
||.1 ||2*Tesla, quadcore, pgi             ||0.32053 ||
||.2 ||1*Tesla, single-core (no MPI), pgi ||0.54789 ||
||.3 ||quadcore, pgi                      ||0.78343 ||
||.4 ||quadcore, intel (on bora, cache-v) ||0.82395 ||

'''Next steps:'''

* testing the newest PGI 13.2 compiler version, porting of reduction operations (especially in flow_statistics), check the capability of parallel regions
* update ghost boundaries only, overlapping of update/MPI and computation
* remove host/device data transfer for the single-core version, still required for the cyclic boundary conditions, in order to run the code completely on one GPU
* overlapping communication in pressure solver (alltoall operations)
* porting of remaining things (averaging, I/O, etc.)
* ...