Changes between Version 18 and Version 19 of doc/tec/gpu
- Timestamp:
- Feb 10, 2016 8:30:14 AM (9 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
doc/tec/gpu
v18 v19 7 7 * no canopy model 8 8 * no Lagrangian particle model 9 * random number generator needs to be ported 10 * most of I/O does not work 11 * MG solver does not work / needs to be ported 9 12 10 13 Tests can be done on our GPU workstation (2*8 core Intel-CPU, 2 NVidia Kepler K40 boards), which runs as a node of the cluster-system at LUIS. Access as follows: … … 31 34 %lopts -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-fastsse:-Mcuda=cuda6.0:-lcufft lcmuk parallel pgigpu146 32 35 }}} 33 The {{{nocache}}} compiler switch '''is not required any more'''. (Earlier compiler versions, e.g. 13.6 gave a significant loss of performance in case of omitting this switch). The {{{time}}}-switch creates and outputs performance data at the end of a run. Very useful! \\ \\36 Please note settings of cpp-directives ({{{-D__nopointer -D__openacc -D__cuda_fft}}} + CUDA library path in {{{lopts}}}).\\The {{{nocache}}} compiler switch '''is not required any more'''. (Earlier compiler versions, e.g. 13.6 gave a significant loss of performance in case of omitting this switch). The {{{time}}}-switch creates and outputs performance data at the end of a run. Very useful! \\ \\ 34 37 35 38 It might be necessary to load the modules manually before calling mbuild or mrun: … … 42 45 export PGI_ACC_NOSYNCQUEUE=1 43 46 }}} 44 before calling mrun! The second one is '''absolutely required''' in case of using the CUDA-fft ({{{fft_method='system_specific' + -D__cuda_fft}}}). If it is not used, the pressure solver does not reduce the divergence! 47 before calling mrun! The second one is '''absolutely required''' in case of using the CUDA-fft ({{{fft_method='system_specific' + -D__cuda_fft}}}). If it is not used, the pressure solver does not reduce the divergence! \\ 45 48 46 49 Compiler version 14.10 gives a runtime error when pres is called for the first time in init_3d_model: … … 57 60 In the meantime, you can set the environment variable "PGI_ACC_NOSYNCQUEUE=1" to work around the issue. 58 61 }}} 62 The workaround worked, e.g. for 14.6, but maybe not for 14.10? 63 \\ 59 64 60 65 A test parameter-set can be found here: … … 62 67 /home/raasch/current_version/JOBS/acc_medium/INPUT/acc_medium_p3d 63 68 }}} 69 Please note that {{{loop_optimization = 'acc'}}} and {{{psolver = 'poisfft'}}} have to be set. {{{fft_method = 'system-specific'}}} is required to switch on the CUDA-fft. All other fft-methods do not run on the GPU, i.e. they are extremely slow. \\ \\ 64 70 65 Here are some hints for running the single-GPU (no-MPI) version:\\ 66 Compiler settings are given in 71 mrun-command to run on two GPU-devices: 67 72 {{{ 68 .../trunk/SCRIPTS/.mrun.config.imuk_gpu 69 .../trunk/INSTALL/MAKE.inc.pgi.openacc 73 mrun -d acc_medium -h lcmuk -K "parallel pgigpu146" -X2 -T2 -r "d3#" 70 74 }}} 71 Please note settings of cpp-directives ({{{-D__openacc -D__cuda_fft}}} + CUDA library path).\\ 72 Test parameter set: 75 \\ 76 77 Runs on a single GPU without MPI (i.e. no domain decomposition) require this configuration: 73 78 {{{ 74 /home/raasch/current_version/JOBS/gputest/INPUT/gputest_p3d 75 /home/raasch/current_version/JOBS/acc_medium/INPUT/acc_medium_p3d 79 %compiler_name pgf90 lcmuk pgigpu146 80 %compiler_name_ser pgf90 lcmuk pgigpu146 81 %cpp_options -Mpreprocess:-D__nopointer:-D__openacc:-D__cuda_fft:-D__lc lcmuk pgigpu146 82 %fopts -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-Mcray=pointer:-fastsse:-Mcuda=cuda6.0 lcmuk pgigpu146 83 %lopts -acc:-ta=tesla,6.0,nocache,time:-Minline:-Minfo=acc:-Mcray=pointer:-fastsse:-Mcuda=cuda6.0:-lcufft lcmuk pgigpu146 84 76 85 }}} 77 Please note that {{{loop_optimization = 'acc'}}}, {{{psolver = 'poisfft'}}}, and {{{fft_method = 'system-specific'}}} have to be set. Results of tests are stored in the respective {{{MONITORING}}} directory. 86 Run it with 87 {{{ 88 mrun -d acc_medium -K pgigpu146 -r "d3#" 89 }}} 90 \\ 91 92 Some PGI-compiler options for debugging: 93 {{{ 94 -O0 -C -g -Mbounds -Mchkstk -traceback 95 }}} 96 \\ 78 97 79 98 '''Report on current activities:''' … … 118 137 119 138 120 '''work packages fpr the EuroHack:'''139 '''work packages / questions for the EuroHack:''' 121 140 122 141 * getting the CUDA-aware MPI to run: for this routines {{{time_integration}}} and {{{exchange_horiz}}} in r1749 have to be replaced by the routines that I provided. If the exchange of ghost points is running sufficiently, the next step would be to make the {{{MPI_ALLTOALL}}} in {{{transpose.f90}}} CUDA-aware. This should be very easy. Just add (e.g.) {{{host_data use_device( f_inv, work )}}} clauses in front of the {{{MPI_ALLTOALL}}} calls and remove the existing {{{update host}}} and {{{data copyin}}} clauses. Also, {{{update host}}} and {{{update device}}} clauses for array {{{ar}}} have to be removed in {{{poisfft}}}. … … 128 147 * Routine {{{surface_layer_fluxes}}}: there are some loops (DO WHILE, DO without specific loop counter, etc.) which cannot be vectorized 129 148 * Routine {{{swap_timelevel}}}: Why does the compiler cannot vectorize the FORTRAN vector assignments like {{{u = u_p}}}? 130 * Routine {{{timestep}}}: Is there a chance that the FORTRAN functions {{{MINLOC}}} and {{{MAXLOC}}}, which are used in routine {{{global_min_max}}}, are directly supported on the GPU? 149 * Routine {{{timestep}}}: FORTRAN functions {{{MINLOC}}} and {{{MAXLOC}}}, which are used in routine {{{global_min_max}}}, do not run on GPU. Is there a better way to realize these functions than the openacc-workaround that is programmed in routine {{{timestep}}}? 150 * General question: Is it (meanwhile) possible to run code in parallel on the CPU and the GPU at the same time, or - in other words - to run something "in background" on the GPU? 131 151 132 '''Things that still need to be ported:'''133 * multigrid-solver134 * cloud physics135 * stuff related with non-cyclic BC136 * random-number generator137 * the complete LPM138 139