| 102 | |
| 103 | |
| 104 | '''work packages fpr the EuroHack:''' |
| 105 | |
| 106 | * getting the CUDA-aware MPI to run: for this routines {{{time_integration}}} and {{{exchange_horiz}}} in r1747 have to be replaced by the routines that I provided. If the exchange of ghost points is running sufficiently, the next step would be to make the {{{MPI_ALLTOALL}}} in {{{transpose.f90}}} CUDA-aware. This should be very easy. Just add (e.g.) {{{host_data use_device( f_inv, work )}}} clauses in front of the {{{MPI_ALLTOALL}}} calls and remove the existing {{{update host}}} and {{{data copyin}}} clauses. Also, {{{update host}}} and {{{update device}}} clauses for array {{{ar}}} have to be removed in {{{poisfft}}}. |
| 107 | * CUDA-fft has been implemented and successfully tested for the single-GPU (non-MPI) mode. It can be switched on using parameter {{{fft_method = 'system-specific}}}. Additionally, the compiler-switch {{{-D__cuda_fft}}} and the linker option {{{-lcufft}}} have to be set. For an unknown reason, this method does not work in the MPI-mode (the pressure solver does not reduce the divergence). |
| 108 | * In general: do the existing clauses (e.g. {{{loop vector / loop gang}}} give the best performance? |
| 109 | * For several routines extra OpenACC-versions had to be created. These are most of the tendency-subroutines (advec_ws, diffusion_..., etc.), prognostic_equations and flow_statistics. There are two main reasons for this: the PGI compiler (at least until version 14.6) was unable to vectorize loops like {{{DO k = nzb_s_inner(j,i), nzt}}} where the index depends on the indices of the outer loops. Does this restriction still exist? Furthermore, reduction operations only work for single scalar variables. Since in {{{flow_statistics}}} reductions are carried out in loops on several elements of arrays, separate temporary scalar variables had to be introduced. Are reduction operations still restricted to scalar variables? |
| 110 | * In routine {{{advec_ws}}} I had to introduce another array {{{wall_flags_00}}} to hold wall flags for bits 32-63. It seems, that OpenACC/PGI-compiler can only handle single precision (32bit) INTEGER. Is that true? |
| 111 | * Routine {{{fft_xy}}}: The clause {{{!$acc declare create( ar_tmp )}}} does not work starting with the 14.1 compiler-version. Instead, I had to use {{{!$acc data create( ar_tmp )}}} clauses. Why? Does this problem still exist for the current compiler version? |
| 112 | * Routine {{{surface_layer_fluxes}}}: inlining of functions |
| 113 | * Routine {{{swap_timelevel}}}: Why does the compiler cannot vectorize the FORTRAN vector assignments like {{{u = u_p}}}? |
| 114 | * Routine {{{timestep}}}: Is there a chance that the FORTRAN functions {{{MINLOC}}} and {{{MAXLOC}}}, which are used in routine {{{global_min_max}}}, are directly supported on the GPU? |
| 115 | |