== Porting the code to NVidia GPU using the OpenACC programming model

Tests can be done on host {{{inferno}}} only, using the PGI-FORTRAN compiler. Required settings:
{{{
export LM_LICENSE_FILE=27000@lizenzserv.rrzn.uni-hannover.de
export PATH=/localdata/opt/mpich2/1.4.1p1/bin:$PATH
export PATH=$PATH:/muksoft/packages/intel/bin:/muksoft/bin
export PATH=$PATH:/localdata/opt/pgi/linux86-64/12.5/bin:/usr/local/cuda/bin
}}}
Compiler settings are given in
{{{
.../trunk/SCRIPTS/.mrun.config.imuk_gpu
}}}
Please note settings of cpp-directives.\\
Test parameter set:
{{{
/home/raasch/current_version/JOBS/gputest/INPUT/gputest_p3d
}}}
Please note that {{{loop_optomization = 'acc'}}} has to be set. Results of tests are stored in the respective {{{MONITORING}}} directory.

'''Report on current activities:'''

r1015 \\
prognostic equations (partly: q and sa is missing), prandtl_fluxes, and diffusivities have been ported \\
additional versions for tendency subroutines have been created ({{{..._acc}}}) \\
statistics are not ported at all \\
speedup seems to be similar to what have been reported by Klaus Ketelsen \\
measurements with Intel compiler on {{{inferno}}} still have to be carried out

'''Results:''' \\
.6   pgf90 without any acc kernels \\
.31  last acc version \\
.32  ifort (on bora) using acc-branch \\
.34  ifort (on bora) using vector-branch \\\\

'''Next steps:'''

* porting the Poisson solver following Klaus' suggestions (there is still a bug in his last version), implement fast tridiagonal solver for GPU
* creating a single core version (without using MPI, so that host-device transfer is minimized)
* testing the PGI 12.6 compiler version, porting of flow_statistics if reduction is implemented, check the capability of parallel regions
* update ghost boundaries only, overlapping of update/MPI and computation?
* overlapping communication
* porting of remaining things (averaging, I/O, etc.)
* ...