Home

Context Navigation

Changes between Version 18 and Version 19 of doc/tec/gpu

Timestamp:: Feb 10, 2016 8:30:14 AM (9 years ago)
Author:: raasch
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

doc/tec/gpu

-                      v18
+                      v19
 * no canopy model
 * no Lagrangian particle model
+* random number generator needs to be ported
+* most of I/O does not work
+* MG solver does not work / needs to be ported
 Tests can be done on our GPU workstation (2*8 core Intel-CPU, 2 NVidia Kepler K40 boards), which runs as a node of the cluster-system at LUIS. Access as follows:
 …
 %lopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-fastsse:-Mcuda=cuda6.0:-lcufft  lcmuk parallel pgigpu146
 }}}
 The {{{nocache}}} compiler switch '''is not required any more'''. (Earlier compiler versions, e.g. 13.6 gave a significant loss of performance in case of omitting this switch). The {{{time}}}-switch creates and outputs performance data at the end of a run. Very useful! \\ \\
+Please note settings of cpp-directives ({{{-D__nopointer -D__openacc -D__cuda_fft}}} + CUDA library path in {{{lopts}}}).\\The {{{nocache}}} compiler switch '''is not required any more'''. (Earlier compiler versions, e.g. 13.6 gave a significant loss of performance in case of omitting this switch). The {{{time}}}-switch creates and outputs performance data at the end of a run. Very useful! \\ \\
 It might be necessary to load the modules manually before calling mbuild or mrun:
 …
 export PGI_ACC_NOSYNCQUEUE=1
 }}}
 before calling mrun! The second one is '''absolutely required''' in case of using the CUDA-fft ({{{fft_method='system_specific' + -D__cuda_fft}}}). If it is not used, the pressure solver does not reduce the divergence!
+before calling mrun! The second one is '''absolutely required''' in case of using the CUDA-fft ({{{fft_method='system_specific' + -D__cuda_fft}}}). If it is not used, the pressure solver does not reduce the divergence! \\
 Compiler version 14.10 gives a runtime error when pres is called for the first time in init_3d_model:
 …
 In the meantime, you can set the environment variable "PGI_ACC_NOSYNCQUEUE=1" to work around the issue.
 }}}
+The workaround worked, e.g. for 14.6, but maybe not for 14.10?
+ \\
 A test parameter-set can be found here:
 …
 /home/raasch/current_version/JOBS/acc_medium/INPUT/acc_medium_p3d
 }}}
+Please note that {{{loop_optimization = 'acc'}}} and {{{psolver = 'poisfft'}}} have to be set. {{{fft_method = 'system-specific'}}} is required to switch on the CUDA-fft. All other fft-methods do not run on the GPU, i.e. they are extremely slow. \\ \\
+Here are some hints for running the single-GPU (no-MPI) version:\\
+Compiler settings are given in
+mrun-command to run on two GPU-devices:
 {{{
+.../trunk/SCRIPTS/.mrun.config.imuk_gpu
+.../trunk/INSTALL/MAKE.inc.pgi.openacc
+mrun -d acc_medium -h lcmuk -K "parallel pgigpu146" -X2 -T2 -r "d3#"
 }}}
+Please note settings of cpp-directives ({{{-D__openacc -D__cuda_fft}}} + CUDA library path).\\
+Test parameter set:
+ \\
+Runs on a single GPU without MPI (i.e. no domain decomposition) require this configuration:
 {{{
+/home/raasch/current_version/JOBS/gputest/INPUT/gputest_p3d
+/home/raasch/current_version/JOBS/acc_medium/INPUT/acc_medium_p3d
+%compiler_name     pgf90                                                       lcmuk pgigpu146
+%compiler_name_ser pgf90                                                       lcmuk pgigpu146
+%cpp_options       -Mpreprocess:-D__nopointer:-D__openacc:-D__cuda_fft:-D__lc  lcmuk pgigpu146
+%fopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-Mcray=pointer:-fastsse:-Mcuda=cuda6.0  lcmuk pgigpu146
+%lopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-Mcray=pointer:-fastsse:-Mcuda=cuda6.0:-lcufft  lcmuk pgigpu146
 }}}
+Please note that {{{loop_optimization = 'acc'}}}, {{{psolver = 'poisfft'}}}, and {{{fft_method = 'system-specific'}}} have to be set. Results of tests are stored in the respective {{{MONITORING}}} directory.
+Run it with
+{{{
+mrun -d acc_medium -K pgigpu146 -r "d3#"
+}}}
+ \\
+Some PGI-compiler options for debugging:
+{{{
+-O0 -C -g -Mbounds -Mchkstk -traceback
+}}}
+ \\
 '''Report on current activities:'''
 …
 '''work packages fpr the EuroHack:'''
+'''work packages / questions for the EuroHack:'''
 * getting the CUDA-aware MPI to run: for this routines {{{time_integration}}} and {{{exchange_horiz}}} in r1749 have to be replaced by the routines that I provided. If the exchange of ghost points is running sufficiently, the next step would be to make the {{{MPI_ALLTOALL}}} in {{{transpose.f90}}} CUDA-aware. This should be very easy. Just add (e.g.) {{{host_data use_device( f_inv, work )}}} clauses in front of the {{{MPI_ALLTOALL}}} calls and remove the existing {{{update host}}} and {{{data copyin}}} clauses. Also, {{{update host}}} and {{{update device}}} clauses for array {{{ar}}} have to be removed in {{{poisfft}}}.
 …
 * Routine {{{surface_layer_fluxes}}}: there are some loops (DO WHILE, DO without specific loop counter, etc.) which cannot be vectorized
 * Routine {{{swap_timelevel}}}: Why does the compiler cannot vectorize the FORTRAN vector assignments like {{{u = u_p}}}?
+* Routine {{{timestep}}}: Is there a chance that the FORTRAN functions {{{MINLOC}}} and {{{MAXLOC}}}, which are used in routine {{{global_min_max}}}, are directly supported on the GPU?
+* Routine {{{timestep}}}: FORTRAN functions {{{MINLOC}}} and {{{MAXLOC}}}, which are used in routine {{{global_min_max}}}, do not run on GPU. Is there a better way to realize these functions than the openacc-workaround that is programmed in routine {{{timestep}}}?
+* General question: Is it (meanwhile) possible to run code in parallel on the CPU and the GPU at the same time, or - in other words - to run something "in background" on the GPU?
-'''Things that still need to be ported:'''
-* multigrid-solver
-* cloud physics
-* stuff related with non-cyclic BC
-* random-number generator
-* the complete LPM