Home

Context Navigation

Changes between Version 3 and Version 4 of doc/tec/gpu

Timestamp:: Mar 9, 2013 1:27:18 AM (12 years ago)
Author:: raasch
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

doc/tec/gpu

-                      v3
+                      v4
+== Porting the code to NVidia GPU using the OpenACC programming model
+== Porting the code to NVidia GPU using the openACC programming model
+Currently, PALM-GPU usage has following restrictions / requirements:
+* 2d domain decomposition (or 1PE, single-core)
+* cyclic lateral boundary conditions
+* no humidity / cloud physics
+* no topography
+* no Lagrangian particle model
 Tests can be done on host {{{inferno}}} only, using the PGI-FORTRAN compiler. Required settings:
 …
 .../trunk/SCRIPTS/.mrun.config.imuk_gpu
 }}}
 Please note settings of cpp-directives.\\
+Please note settings of cpp-directives ({{{-D__openacc -D__cuda_fft}}} + CUDA library path).\\
 Test parameter set:
 {{{
 /home/raasch/current_version/JOBS/gputest/INPUT/gputest_p3d
 }}}
 Please note that {{{loop_optimization = 'acc'}}} has to be set. Results of tests are stored in the respective {{{MONITORING}}} directory.
+Please note that {{{loop_optimization = 'acc'}}} and {{{fft_method = 'system-specific'}}} have to be set. Results of tests are stored in the respective {{{MONITORING}}} directory.
 '''Report on current activities:'''
 …
 measurements with Intel compiler on {{{inferno}}} still have to be carried out
+'''Results:''' \\
+.6   pgf90 without any acc kernels \\
+.31  last acc version \\
+.32  ifort (on bora) using acc-branch \\
+.34  ifort (on bora) using vector-branch \\\\
+r1111 \\
+Pressure solver (including the tridiagonal solver) has been almost completely ported. Still missing are calculations in pres. \\
+CUDA fft has been implemented. \\
+GPU can also been used in the single-core (non-MPI-parallel) version.
+'''Results for 512x512x64 grid (time in micro-s per gridpoint and timestep):''' \\
+||.1 ||2*Tesla, quadcore, pgi             ||0.32053 ||
+||.2 ||1*Tesla, single-core (no MPI), pgi ||0.54789 ||
+||.3 ||quadcore, pgi                      ||0.78343 ||
+||.4 ||quadcore, intel (on bora, cache-v) ||0.82395 ||
 '''Next steps:'''
+* porting the Poisson solver following Klaus' suggestions (there is still a bug in his last version), implement fast tridiagonal solver for GPU
+* creating a single core version (without using MPI, so that host-device transfer is minimized)
+* testing the PGI 12.6 compiler version, porting of flow_statistics if reduction is implemented, check the capability of parallel regions
+* update ghost boundaries only, overlapping of update/MPI and computation?
+* overlapping communication
+* testing the newest PGI 13.2 compiler version, porting of reduction operations (especially in flow_statistics), check the capability of parallel regions
+* update ghost boundaries only, overlapping of update/MPI and computation
+* remove host/device data transfer for the single-core version, still required for the cyclic boundary conditions, in order to run the code completely on one GPU
+* overlapping communication in pressure solver (alltoall operations)
 * porting of remaining things (averaging, I/O, etc.)
 * ...