Version 2 (modified by raasch, 12 years ago) (diff)

--

Porting the code to NVidia GPU using the OpenACC programming model

Tests can be done on host inferno only, using the PGI-FORTRAN compiler. Required settings:

export LM_LICENSE_FILE=27000@lizenzserv.rrzn.uni-hannover.de
export PATH=/localdata/opt/mpich2/1.4.1p1/bin:$PATH
export PATH=$PATH:/muksoft/packages/intel/bin:/muksoft/bin
export PATH=$PATH:/localdata/opt/pgi/linux86-64/12.5/bin:/usr/local/cuda/bin

Compiler settings are given in

.../trunk/SCRIPTS/.mrun.config.imuk_gpu

Please note settings of cpp-directives.
Test parameter set:

/home/raasch/current_version/JOBS/gputest/INPUT/gputest_p3d

Please note that loop_optomization = 'acc' has to be set. Results of tests are stored in the respective MONITORING directory.

Report on current activities:

r1015
prognostic equations (partly: q and sa is missing), prandtl_fluxes, and diffusivities have been ported
additional versions for tendency subroutines have been created (..._acc)
statistics are not ported at all
speedup seems to be similar to what have been reported by Klaus Ketelsen
measurements with Intel compiler on inferno still have to be carried out

Results:
.6 pgf90 without any acc kernels
.31 last acc version
.32 ifort (on bora) using acc-branch
.34 ifort (on bora) using vector-branch

Next steps:

  • porting the Poisson solver following Klaus' suggestions (there is still a bug in his last version), implement fast tridiagonal solver for GPU
  • creating a single core version (without using MPI, so that host-device transfer is minimized)
  • testing the PGI 12.6 compiler version, porting of flow_statistics if reduction is implemented, check the capability of parallel regions
  • update ghost boundaries only, overlapping of update/MPI and computation?
  • overlapping communication
  • porting of remaining things (averaging, I/O, etc.)
  • ...