Home

Context Navigation

Changes between Version 17 and Version 18 of doc/tec/gpu

Timestamp:: Feb 9, 2016 4:38:15 PM (9 years ago)
Author:: raasch
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

doc/tec/gpu

-                      v17
+                      v18
 %lopts             -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-fastsse:-Mcuda=cuda6.0:-lcufft  lcmuk parallel pgigpu146
 }}}
+The {{{nocache}}} compiler switch is currently required. Otherwise there would be a significant loss of performance.
+It might be neccessary to load the modules manually before calling mbuild or mrun:
+The {{{nocache}}} compiler switch '''is not required any more'''. (Earlier compiler versions, e.g. 13.6 gave a significant loss of performance in case of omitting this switch). The {{{time}}}-switch creates and outputs performance data at the end of a run. Very useful! \\ \\
+It might be necessary to load the modules manually before calling mbuild or mrun:
 {{{
 module load pgi/14.6 openmpi/1.8.3-pgi-cuda
 }}}
 Furthermore, it is required to set the environment variable
+Furthermore, it is required to set the environment variables
 {{{
 export OMPI_COMM_WORLD_LOCAL_RANK=1
+export PGI_ACC_NOSYNCQUEUE=1
 }}}
+before calling mrun!  Compiler version 14.10 gives a runtime error when pres is called for the first time in init_3d_model.
+A test parameter-set:
+before calling mrun! The second one is '''absolutely required''' in case of using the CUDA-fft ({{{fft_method='system_specific' + -D__cuda_fft}}}). If it is not used, the pressure solver does not reduce the divergence!
+Compiler version 14.10 gives a runtime error when pres is called for the first time in init_3d_model:
+{{{
+cuEventRecord returned error 400: Invalid handle
+}}}
+I guess that this problem is also somehow connected with usage of streams. I got following informations from Mat Colgrove (NVidia/PGI):
+{{{
+We were able to determine the issue with calling cuFFT (TPR#20579).  In 14.4 we stopped using stream 0 as the default
+stream for OpenACC since stream 0 has some special properties that made asynchronous behavior.  The problem with that
+if combined with a calling a CUDA code, which still uses stream 0, the streams and hence the data can get out of sync.
+In 14.7, we'll change OpenACC to use stream 0 again if "-Mcuda" is used.
+In the meantime, you can set the environment variable "PGI_ACC_NOSYNCQUEUE=1" to work around the issue.
+}}}
+A test parameter-set can be found here:
 {{{
 /home/raasch/current_version/JOBS/acc_medium/INPUT/acc_medium_p3d
 …
 * Routine {{{surface_layer_fluxes}}}: there are some loops (DO WHILE, DO without specific loop counter, etc.) which cannot be vectorized
 * Routine {{{swap_timelevel}}}: Why does the compiler cannot vectorize the FORTRAN vector assignments like {{{u = u_p}}}?
 * Routine {{{timestep}}}: Is there a chance that the FORTRAN functions {{{MINLOC}}} and {{{MAXLOC}}}, which are used in routine {{{global_min_max}}}, are directly supported on the GPU?
+* Routine {{{timestep}}}: Is there a chance that the FORTRAN functions {{{MINLOC}}} and {{{MAXLOC}}}, which are used in routine {{{global_min_max}}}, are directly supported on the GPU?
+'''Things that still need to be ported:'''
+* multigrid-solver
+* cloud physics
+* stuff related with non-cyclic BC
+* random-number generator
+* the complete LPM