Changes between Version 14 and Version 15 of doc/tec/gpu


Ignore:
Timestamp:
Feb 8, 2016 4:09:56 PM (9 years ago)
Author:
raasch
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • doc/tec/gpu

    v14 v15  
    100100* porting of remaining things (exchange_horiz_2d, calc_liquid_water_content, compute_vpt, averaging, I/O, etc.)
    101101* ...
     102
     103
     104'''work packages fpr the EuroHack:'''
     105
     106* getting the CUDA-aware MPI to run: for this routines {{{time_integration}}} and {{{exchange_horiz}}} in r1747 have to be replaced by the routines that I provided. If the exchange of ghost points is running sufficiently, the next step would be to make the {{{MPI_ALLTOALL}}} in {{{transpose.f90}}} CUDA-aware. This should be very easy. Just add (e.g.) {{{host_data use_device( f_inv, work )}}} clauses in front of the {{{MPI_ALLTOALL}}} calls and remove the existing {{{update host}}} and {{{data copyin}}} clauses. Also, {{{update host}}} and {{{update device}}} clauses for array {{{ar}}} have to be removed in {{{poisfft}}}.
     107* CUDA-fft has been implemented and successfully tested for the single-GPU (non-MPI) mode. It can be switched on using parameter {{{fft_method = 'system-specific}}}. Additionally, the compiler-switch {{{-D__cuda_fft}}} and the linker option {{{-lcufft}}} have to be set. For an unknown reason, this method does not work in the MPI-mode (the pressure solver does not reduce the divergence).
     108* In general: do the existing clauses (e.g. {{{loop vector / loop gang}}} give the best performance?
     109* For several routines extra OpenACC-versions had to be created. These are most of the tendency-subroutines (advec_ws, diffusion_..., etc.), prognostic_equations and flow_statistics. There are two main reasons for this: the PGI compiler (at least until version 14.6) was unable to vectorize loops like {{{DO  k = nzb_s_inner(j,i), nzt}}} where the index depends on the indices of the outer loops. Does this restriction still exist?  Furthermore, reduction operations only work for single scalar variables. Since in {{{flow_statistics}}} reductions are carried out in loops on several elements of arrays, separate temporary scalar variables had to be introduced. Are reduction operations still restricted to scalar variables?
     110* In routine {{{advec_ws}}} I had to introduce another array {{{wall_flags_00}}} to hold wall flags for bits 32-63. It seems, that OpenACC/PGI-compiler can only handle single precision (32bit) INTEGER. Is that true?
     111* Routine {{{fft_xy}}}: The clause {{{!$acc declare create( ar_tmp )}}} does not work starting with the 14.1 compiler-version. Instead, I had to use {{{!$acc data create( ar_tmp )}}} clauses. Why? Does this problem still exist for the current compiler version?
     112* Routine {{{surface_layer_fluxes}}}: inlining of functions
     113* Routine {{{swap_timelevel}}}: Why does the compiler cannot vectorize the FORTRAN vector assignments like {{{u = u_p}}}?
     114* Routine {{{timestep}}}: Is there a chance that the FORTRAN functions {{{MINLOC}}} and {{{MAXLOC}}}, which are used in routine {{{global_min_max}}}, are directly supported on the GPU?
     115