source: palm/trunk/SOURCE/transpose.f90 @ 1111

Last change on this file since 1111 was 1111, checked in by raasch, 12 years ago

New:
---

GPU porting of pres, swap_timelevel. Adjustments of openACC directives.
Further porting of poisfft, which now runs completely on GPU without any
host/device data transfer for serial an parallel runs (but parallel runs
require data transfer before and after the MPI transpositions).
GPU-porting of tridiagonal solver:
tridiagonal routines split into extermal subroutines (instead using CONTAINS),
no distinction between parallel/non-parallel in poisfft and tridia any more,
tridia routines moved to end of file because of probable bug in PGI compiler
(otherwise "invalid device function" is indicated during runtime).
(cuda_fft_interfaces, fft_xy, flow_statistics, init_3d_model, palm, poisfft, pres, prognostic_equations, swap_timelevel, time_integration, transpose)
output of accelerator board information. (header)

optimization of tridia routines: constant elements and coefficients of tri are
stored in seperate arrays ddzuw and tric, last dimension of tri reduced from 5 to 2,
(init_grid, init_3d_model, modules, palm, poisfft)

poisfft_init is now called internally from poisfft,
(Makefile, Makefile_check, init_pegrid, poisfft, poisfft_hybrid)

CPU-time per grid point and timestep is output to CPU_MEASURES file
(cpu_statistics, modules, time_integration)

Changed:


resorting from/to array work changed, work now has 4 dimensions instead of 1 (transpose)
array diss allocated only if required (init_3d_model)

pressure boundary condition "Neumann+inhomo" removed from the code
(check_parameters, header, poisfft, poisfft_hybrid, pres)

Errors:


bugfix: dependency added for cuda_fft_interfaces (Makefile)
bugfix: CUDA fft plans adjusted for domain decomposition (before they always
used total domain) (fft_xy)

  • Property svn:keywords set to Id
File size: 23.8 KB
RevLine 
[164]1 SUBROUTINE transpose_xy( f_in, work, f_out )
[1]2
[1036]3!--------------------------------------------------------------------------------!
4! This file is part of PALM.
5!
6! PALM is free software: you can redistribute it and/or modify it under the terms
7! of the GNU General Public License as published by the Free Software Foundation,
8! either version 3 of the License, or (at your option) any later version.
9!
10! PALM is distributed in the hope that it will be useful, but WITHOUT ANY
11! WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
12! A PARTICULAR PURPOSE.  See the GNU General Public License for more details.
13!
14! You should have received a copy of the GNU General Public License along with
15! PALM. If not, see <http://www.gnu.org/licenses/>.
16!
17! Copyright 1997-2012  Leibniz University Hannover
18!--------------------------------------------------------------------------------!
19!
[484]20! Current revisions:
[1]21! -----------------
[1111]22! openACC directives added,
23! resorting data from/to work changed, work got 4 dimensions instead of 1
[198]24!
25! Former revisions:
26! -----------------
27! $Id: transpose.f90 1111 2013-03-08 23:54:10Z raasch $
28!
[1107]29! 1106 2013-03-04 05:31:38Z raasch
30! preprocessor lines rearranged so that routines can also be used in serial
31! (non-parallel) mode
32!
[1093]33! 1092 2013-02-02 11:24:22Z raasch
34! unused variables removed
35!
[1037]36! 1036 2012-10-22 13:43:42Z raasch
37! code put under GPL (PALM 3.9)
38!
[1004]39! 1003 2012-09-14 14:35:53Z raasch
40! indices nxa, nya, etc. replaced by nx, ny, etc.
41!
[684]42! 683 2011-02-09 14:25:15Z raasch
43! openMP parallelization of transpositions for 2d-domain-decomposition
44!
[623]45! 622 2010-12-10 08:08:13Z raasch
46! optional barriers included in order to speed up collective operations
47!
[198]48! 164 2008-05-15 08:46:15Z raasch
[164]49! f_inv changed from subroutine argument to automatic array in order to do
50! re-ordering from f_in to f_inv in one step, one array work is needed instead
51! of work1 and work2
[1]52!
[198]53! February 2007
[3]54! RCS Log replace by Id keyword, revision history cleaned up
55!
[1]56! Revision 1.2  2004/04/30 13:12:17  raasch
57! Switched from mpi_alltoallv to the simpler mpi_alltoall,
58! all former transpose-routine files collected in this file, enlarged
59! transposition arrays introduced
60!
61! Revision 1.1  2004/04/30 13:08:16  raasch
62! Initial revision (collection of former routines transpose_xy, transpose_xz,
63!                   transpose_yx, transpose_yz, transpose_zx, transpose_zy)
64!
65! Revision 1.1  1997/07/24 11:25:18  raasch
66! Initial revision
67!
68!
69! Description:
70! ------------
71! Transposition of input array (f_in) from x to y. For the input array, all
72! elements along x reside on the same PE, while after transposition, all
73! elements along y reside on the same PE.
74!------------------------------------------------------------------------------!
75
76    USE cpulog
77    USE indices
78    USE interfaces
79    USE pegrid
80    USE transpose_indices
81
82    IMPLICIT NONE
83
[1111]84    INTEGER ::  i, j, k, l, ys
[1]85   
[1111]86    REAL ::  f_in(0:nx,nys_x:nyn_x,nzb_x:nzt_x), f_out(0:ny,nxl_y:nxr_y,nzb_y:nzt_y)
[1]87
[1111]88    REAL, DIMENSION(nyn_x-nys_x+1,nzb_y:nzt_y,nxl_y:nxr_y,0:pdims(2)-1) ::  work
89
90    !$acc declare create( f_inv )
91    REAL ::  f_inv(nys_x:nyn_x,nzb_x:nzt_x,0:nx)
92
93
[1]94!
95!-- Rearrange indices of input array in order to make data to be send
96!-- by MPI contiguous
[683]97!$OMP  PARALLEL PRIVATE ( i, j, k )
98!$OMP  DO
[1111]99    !$acc kernels present( f_in )
100    !$acc loop
[1003]101    DO  i = 0, nx
102       DO  k = nzb_x, nzt_x
[1111]103          !$acc loop vector( 32 )
[1003]104          DO  j = nys_x, nyn_x
[164]105             f_inv(j,k,i) = f_in(i,j,k)
[1]106          ENDDO
107       ENDDO
108    ENDDO
[1111]109    !$acc end kernels
[683]110!$OMP  END PARALLEL
[1]111
[1106]112    IF ( numprocs /= 1 )  THEN
113
114#if defined( __parallel )
[1]115!
[1106]116!--    Transpose array
117       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
118       IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
[1111]119       !$acc update host( f_inv )
120       CALL MPI_ALLTOALL( f_inv(nys_x,nzb_x,0),  sendrecvcount_xy, MPI_REAL, &
121                          work(1,nzb_y,nxl_y,0), sendrecvcount_xy, MPI_REAL, &
[1106]122                          comm1dy, ierr )
[1111]123       !$acc update device( work )
[1106]124       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
[1]125
126!
[1106]127!--    Reorder transposed array
[1111]128!$OMP  PARALLEL PRIVATE ( i, j, k, l, ys )
[683]129!$OMP  DO
[1106]130       DO  l = 0, pdims(2) - 1
131          ys = 0 + l * ( nyn_x - nys_x + 1 )
[1111]132          !$acc kernels present( f_out, work )
133          !$acc loop
[1106]134          DO  i = nxl_y, nxr_y
135             DO  k = nzb_y, nzt_y
[1111]136                !$acc loop vector( 32 )
[1106]137                DO  j = ys, ys + nyn_x - nys_x
[1111]138                   f_out(j,i,k) = work(j-ys+1,k,i,l)
[1106]139                ENDDO
[1]140             ENDDO
141          ENDDO
[1111]142          !$acc end kernels
[1]143       ENDDO
[683]144!$OMP  END PARALLEL
[1]145#endif
146
[1106]147    ELSE
148
149!
150!--    Reorder transposed array
151!$OMP  PARALLEL PRIVATE ( i, j, k )
152!$OMP  DO
[1111]153       !$acc kernels present( f_out )
154       !$acc loop
[1106]155       DO  k = nzb_y, nzt_y
156          DO  i = nxl_y, nxr_y
[1111]157             !$acc loop vector( 32 )
[1106]158             DO  j = 0, ny
159                f_out(j,i,k) = f_inv(j,k,i)
160             ENDDO
161          ENDDO
162       ENDDO
[1111]163       !$acc end kernels
[1106]164!$OMP  END PARALLEL
165
166    ENDIF
167
[1]168 END SUBROUTINE transpose_xy
169
170
[164]171 SUBROUTINE transpose_xz( f_in, work, f_out )
[1]172
173!------------------------------------------------------------------------------!
174! Description:
175! ------------
176! Transposition of input array (f_in) from x to z. For the input array, all
177! elements along x reside on the same PE, while after transposition, all
178! elements along z reside on the same PE.
179!------------------------------------------------------------------------------!
180
181    USE cpulog
182    USE indices
183    USE interfaces
184    USE pegrid
185    USE transpose_indices
186
187    IMPLICIT NONE
188
[1111]189    INTEGER ::  i, j, k, l, xs
[1]190   
[1111]191    REAL ::  f_in(0:nx,nys_x:nyn_x,nzb_x:nzt_x), f_out(1:nz,nys:nyn,nxl:nxr)
[1]192
[1111]193    REAL, DIMENSION(nys_x:nyn_x,nnx,nzb_x:nzt_x,0:pdims(1)-1) ::  work
[1]194
[1111]195    !$acc declare create( f_inv )
196    REAL ::  f_inv(nys:nyn,nxl:nxr,1:nz)
197
198
[1]199!
200!-- If the PE grid is one-dimensional along y, the array has only to be
201!-- reordered locally and therefore no transposition has to be done.
202    IF ( pdims(1) /= 1 )  THEN
[1106]203
204#if defined( __parallel )
[1]205!
206!--    Reorder input array for transposition
[1111]207!$OMP  PARALLEL PRIVATE ( i, j, k, l, xs )
[683]208!$OMP  DO
[1]209       DO  l = 0, pdims(1) - 1
210          xs = 0 + l * nnx
[1111]211          !$acc kernels present( f_in, work )
212          !$acc loop
[1003]213          DO  k = nzb_x, nzt_x
[164]214             DO  i = xs, xs + nnx - 1
[1111]215                !$acc loop vector( 32 )
[1003]216                DO  j = nys_x, nyn_x
[1111]217                   work(j,i-xs+1,k,l) = f_in(i,j,k)
[1]218                ENDDO
219             ENDDO
220          ENDDO
[1111]221          !$acc end kernels
[1]222       ENDDO
[683]223!$OMP  END PARALLEL
[1]224
225!
226!--    Transpose array
227       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
[622]228       IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
[1111]229       !$acc update host( work )
230       CALL MPI_ALLTOALL( work(nys_x,1,nzb_x,0), sendrecvcount_zx, MPI_REAL, &
231                          f_inv(nys,nxl,1),      sendrecvcount_zx, MPI_REAL, &
[1]232                          comm1dx, ierr )
[1111]233       !$acc update device( f_inv )
[1]234       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
235
236!
237!--    Reorder transposed array in a way that the z index is in first position
[683]238!$OMP  PARALLEL PRIVATE ( i, j, k )
239!$OMP  DO
[1111]240       !$acc kernels present( f_out )
241       !$acc loop
[1003]242       DO  k = 1, nz
243          DO  i = nxl, nxr
[1111]244             !$acc loop vector( 32 )
[1003]245             DO  j = nys, nyn
[164]246                f_out(k,j,i) = f_inv(j,i,k)
[1]247             ENDDO
248          ENDDO
249       ENDDO
[1111]250       !$acc end kernels
[683]251!$OMP  END PARALLEL
[1106]252#endif
253
[1]254    ELSE
[1106]255
[1]256!
257!--    Reorder the array in a way that the z index is in first position
[683]258!$OMP  PARALLEL PRIVATE ( i, j, k )
259!$OMP  DO
[1111]260       !$acc kernels present( f_in )
261       !$acc loop
[1003]262       DO  i = nxl, nxr
263          DO  j = nys, nyn
[1111]264             !$acc loop vector( 32 )
[1003]265             DO  k = 1, nz
[164]266                f_inv(j,i,k) = f_in(i,j,k)
[1]267             ENDDO
268          ENDDO
269       ENDDO
[1111]270       !$acc end kernels
[683]271!$OMP  END PARALLEL
[1]272
[683]273!$OMP  PARALLEL PRIVATE ( i, j, k )
274!$OMP  DO
[1111]275       !$acc kernels present( f_out )
276       !$acc loop
[1003]277       DO  k = 1, nz
278          DO  i = nxl, nxr
[1111]279             !$acc loop vector( 32 )
[1003]280             DO  j = nys, nyn
[164]281                f_out(k,j,i) = f_inv(j,i,k)
282             ENDDO
[1]283          ENDDO
284       ENDDO
[1111]285       !$acc end kernels
[683]286!$OMP  END PARALLEL
[1]287
[164]288    ENDIF
289
[1]290 END SUBROUTINE transpose_xz
291
292
[164]293 SUBROUTINE transpose_yx( f_in, work, f_out )
[1]294
295!------------------------------------------------------------------------------!
296! Description:
297! ------------
298! Transposition of input array (f_in) from y to x. For the input array, all
299! elements along y reside on the same PE, while after transposition, all
300! elements along x reside on the same PE.
301!------------------------------------------------------------------------------!
302
303    USE cpulog
304    USE indices
305    USE interfaces
306    USE pegrid
307    USE transpose_indices
308
309    IMPLICIT NONE
310
[1111]311    INTEGER ::  i, j, k, l, ys
[1]312   
[1111]313    REAL ::  f_in(0:ny,nxl_y:nxr_y,nzb_y:nzt_y), f_out(0:nx,nys_x:nyn_x,nzb_x:nzt_x)
[1]314
[1111]315    REAL, DIMENSION(nyn_x-nys_x+1,nzb_y:nzt_y,nxl_y:nxr_y,0:pdims(2)-1) ::  work
316
317    !$acc declare create( f_inv )
318    REAL ::  f_inv(nys_x:nyn_x,nzb_x:nzt_x,0:nx)
319
320
[1106]321    IF ( numprocs /= 1 )  THEN
322
[1]323#if defined( __parallel )
324!
[1106]325!--    Reorder input array for transposition
[1111]326!$OMP  PARALLEL PRIVATE ( i, j, k, l, ys )
[683]327!$OMP  DO
[1106]328       DO  l = 0, pdims(2) - 1
329          ys = 0 + l * ( nyn_x - nys_x + 1 )
[1111]330          !$acc kernels present( f_in, work )
331          !$acc loop
[1106]332          DO  i = nxl_y, nxr_y
333             DO  k = nzb_y, nzt_y
[1111]334                !$acc loop vector( 32 )
[1106]335                DO  j = ys, ys + nyn_x - nys_x
[1111]336                   work(j-ys+1,k,i,l) = f_in(j,i,k)
[1106]337                ENDDO
338             ENDDO
339          ENDDO
[1111]340          !$acc end kernels
[1106]341       ENDDO
342!$OMP  END PARALLEL
343
344!
345!--    Transpose array
346       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
347       IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
[1111]348       !$acc update host( work )
349       CALL MPI_ALLTOALL( work(1,nzb_y,nxl_y,0), sendrecvcount_xy, MPI_REAL, &
350                          f_inv(nys_x,nzb_x,0),  sendrecvcount_xy, MPI_REAL, &
[1106]351                          comm1dy, ierr )
[1111]352       !$acc update device( f_inv )
[1106]353       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
354#endif
355
356    ELSE
357
358!
359!--    Reorder array f_in the same way as ALLTOALL did it
360!$OMP  PARALLEL PRIVATE ( i, j, k )
361!$OMP  DO
[1111]362       !$acc kernels present( f_in )
363       !$acc loop
[1003]364       DO  i = nxl_y, nxr_y
365          DO  k = nzb_y, nzt_y
[1111]366             !$acc loop vector( 32 )
[1106]367             DO  j = 0, ny
368                f_inv(j,k,i) = f_in(j,i,k)
[1]369             ENDDO
370          ENDDO
371       ENDDO
[1111]372       !$acc end kernels
[683]373!$OMP  END PARALLEL
[1]374
[1106]375    ENDIF
[1]376
377!
378!-- Reorder transposed array in a way that the x index is in first position
[683]379!$OMP  PARALLEL PRIVATE ( i, j, k )
380!$OMP  DO
[1111]381    !$acc kernels present( f_out )
382    !$acc loop
[1003]383    DO  i = 0, nx
384       DO  k = nzb_x, nzt_x
[1111]385          !$acc loop vector( 32 )
[1003]386          DO  j = nys_x, nyn_x
[164]387             f_out(i,j,k) = f_inv(j,k,i)
[1]388          ENDDO
389       ENDDO
390    ENDDO
[1111]391    !$acc end kernels
[683]392!$OMP  END PARALLEL
[1]393
394 END SUBROUTINE transpose_yx
395
396
[164]397 SUBROUTINE transpose_yxd( f_in, work, f_out )
[1]398
399!------------------------------------------------------------------------------!
400! Description:
401! ------------
402! Transposition of input array (f_in) from y to x. For the input array, all
403! elements along y reside on the same PE, while after transposition, all
404! elements along x reside on the same PE.
405! This is a direct transposition for arrays with indices in regular order
406! (k,j,i) (cf. transpose_yx).
407!------------------------------------------------------------------------------!
408
409    USE cpulog
410    USE indices
411    USE interfaces
412    USE pegrid
413    USE transpose_indices
414
415    IMPLICIT NONE
416
[1092]417    INTEGER ::  i, j, k, l, m, xs
[1]418
[1003]419    REAL ::  f_in(1:nz,nys:nyn,nxl:nxr), f_inv(nxl:nxr,1:nz,nys:nyn), &
420             f_out(0:nx,nys_x:nyn_x,nzb_x:nzt_x),                     &
[164]421             work(nnx*nny*nnz)
[1]422
423#if defined( __parallel )
424
425!
426!-- Rearrange indices of input array in order to make data to be send
427!-- by MPI contiguous
[1003]428    DO  k = 1, nz
429       DO  j = nys, nyn
430          DO  i = nxl, nxr
[164]431             f_inv(i,k,j) = f_in(k,j,i)
[1]432          ENDDO
433       ENDDO
434    ENDDO
435
436!
437!-- Transpose array
438    CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
[622]439    IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
[1]440    CALL MPI_ALLTOALL( f_inv(nxl,1,nys), sendrecvcount_xy, MPI_REAL, &
[164]441                       work(1),          sendrecvcount_xy, MPI_REAL, &
[1]442                       comm1dx, ierr )
443    CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
444
445!
446!-- Reorder transposed array
447    m = 0
448    DO  l = 0, pdims(1) - 1
449       xs = 0 + l * nnx
[1003]450       DO  j = nys_x, nyn_x
451          DO  k = 1, nz
[1]452             DO  i = xs, xs + nnx - 1
453                m = m + 1
[164]454                f_out(i,j,k) = work(m)
[1]455             ENDDO
456          ENDDO
457       ENDDO
458    ENDDO
459
460#endif
461
462 END SUBROUTINE transpose_yxd
463
464
[164]465 SUBROUTINE transpose_yz( f_in, work, f_out )
[1]466
467!------------------------------------------------------------------------------!
468! Description:
469! ------------
470! Transposition of input array (f_in) from y to z. For the input array, all
471! elements along y reside on the same PE, while after transposition, all
472! elements along z reside on the same PE.
473!------------------------------------------------------------------------------!
474
475    USE cpulog
476    USE indices
477    USE interfaces
478    USE pegrid
479    USE transpose_indices
480
481    IMPLICIT NONE
482
[1111]483    INTEGER ::  i, j, k, l, zs
[1]484   
[1111]485    REAL ::  f_in(0:ny,nxl_y:nxr_y,nzb_y:nzt_y), f_out(nxl_z:nxr_z,nys_z:nyn_z,1:nz)
[1]486
[1111]487    REAL, DIMENSION(nxl_z:nxr_z,nzt_y-nzb_y+1,nys_z:nyn_z,0:pdims(1)-1) ::  work
488
489    !$acc declare create( f_inv )
490    REAL ::  f_inv(nxl_y:nxr_y,nzb_y:nzt_y,0:ny)
491
492
[1]493!
494!-- Rearrange indices of input array in order to make data to be send
495!-- by MPI contiguous
[683]496!$OMP  PARALLEL PRIVATE ( i, j, k )
497!$OMP  DO
[1111]498    !$acc kernels present( f_in )
499    !$acc loop
[1003]500    DO  j = 0, ny
501       DO  k = nzb_y, nzt_y
[1111]502          !$acc loop vector( 32 )
[1003]503          DO  i = nxl_y, nxr_y
[164]504             f_inv(i,k,j) = f_in(j,i,k)
[1]505          ENDDO
506       ENDDO
507    ENDDO
[1111]508    !$acc end kernels
[683]509!$OMP  END PARALLEL
[1]510
511!
512!-- Move data to different array, because memory location of work1 is
513!-- needed further below (work1 = work2).
514!-- If the PE grid is one-dimensional along y, only local reordering
515!-- of the data is necessary and no transposition has to be done.
516    IF ( pdims(1) == 1 )  THEN
[1106]517
[683]518!$OMP  PARALLEL PRIVATE ( i, j, k )
519!$OMP  DO
[1111]520       !$acc kernels present( f_out )
521       !$acc loop
[1003]522       DO  j = 0, ny
523          DO  k = nzb_y, nzt_y
[1111]524             !$acc loop vector( 32 )
[1003]525             DO  i = nxl_y, nxr_y
[164]526                f_out(i,j,k) = f_inv(i,k,j)
[1]527             ENDDO
528          ENDDO
529       ENDDO
[1111]530       !$acc end kernels
[683]531!$OMP  END PARALLEL
[1]532
[1106]533    ELSE
534
535#if defined( __parallel )
[1]536!
[1106]537!--    Transpose array
538       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
539       IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
[1111]540       !$acc update host( f_inv )
541       CALL MPI_ALLTOALL( f_inv(nxl_y,nzb_y,0),  sendrecvcount_yz, MPI_REAL, &
542                          work(nxl_z,1,nys_z,0), sendrecvcount_yz, MPI_REAL, &
[1106]543                          comm1dx, ierr )
[1111]544       !$acc update device( work )
[1106]545       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
[1]546
547!
[1106]548!--    Reorder transposed array
[1111]549!$OMP  PARALLEL PRIVATE ( i, j, k, l, zs )
[683]550!$OMP  DO
[1106]551       DO  l = 0, pdims(1) - 1
552          zs = 1 + l * ( nzt_y - nzb_y + 1 )
[1111]553          !$acc kernels present( f_out, work )
554          !$acc loop
[1106]555          DO  j = nys_z, nyn_z
556             DO  k = zs, zs + nzt_y - nzb_y
[1111]557                !$acc loop vector( 32 )
[1106]558                DO  i = nxl_z, nxr_z
[1111]559                   f_out(i,j,k) = work(i,k-zs+1,j,l)
[1106]560                ENDDO
[1]561             ENDDO
562          ENDDO
[1111]563          !$acc end kernels
[1]564       ENDDO
[683]565!$OMP  END PARALLEL
[1]566#endif
567
[1106]568   ENDIF
569
[1]570 END SUBROUTINE transpose_yz
571
572
[164]573 SUBROUTINE transpose_zx( f_in, work, f_out )
[1]574
575!------------------------------------------------------------------------------!
576! Description:
577! ------------
578! Transposition of input array (f_in) from z to x. For the input array, all
579! elements along z reside on the same PE, while after transposition, all
580! elements along x reside on the same PE.
581!------------------------------------------------------------------------------!
582
583    USE cpulog
584    USE indices
585    USE interfaces
586    USE pegrid
587    USE transpose_indices
588
589    IMPLICIT NONE
590
[1111]591    INTEGER ::  i, j, k, l, xs
[1]592   
[1111]593    REAL ::  f_in(1:nz,nys:nyn,nxl:nxr), f_out(0:nx,nys_x:nyn_x,nzb_x:nzt_x)
[1]594
[1111]595    REAL, DIMENSION(nys_x:nyn_x,nnx,nzb_x:nzt_x,0:pdims(1)-1) ::  work
596
597    !$acc declare create( f_inv )
[1106]598    REAL ::  f_inv(nys:nyn,nxl:nxr,1:nz)
[1]599
[1106]600
[1]601!
602!-- Rearrange indices of input array in order to make data to be send
603!-- by MPI contiguous
[683]604!$OMP  PARALLEL PRIVATE ( i, j, k )
605!$OMP  DO
[1106]606    !$acc kernels present( f_in )
607    !$acc loop
[1003]608    DO  k = 1,nz
609       DO  i = nxl, nxr
[1106]610          !$acc loop vector( 32 )
[1003]611          DO  j = nys, nyn
[164]612             f_inv(j,i,k) = f_in(k,j,i)
[1]613          ENDDO
614       ENDDO
615    ENDDO
[1111]616    !$acc end kernels
[683]617!$OMP  END PARALLEL
[1]618
619!
620!-- Move data to different array, because memory location of work1 is
621!-- needed further below (work1 = work2).
622!-- If the PE grid is one-dimensional along y, only local reordering
623!-- of the data is necessary and no transposition has to be done.
624    IF ( pdims(1) == 1 )  THEN
[1106]625
[683]626!$OMP  PARALLEL PRIVATE ( i, j, k )
627!$OMP  DO
[1106]628       !$acc kernels present( f_out )
629       !$acc loop
[1003]630       DO  k = 1, nz
631          DO  i = nxl, nxr
[1106]632             !$acc loop vector( 32 )
[1003]633             DO  j = nys, nyn
[164]634                f_out(i,j,k) = f_inv(j,i,k)
[1]635             ENDDO
636          ENDDO
637       ENDDO
[1111]638       !$acc end kernels
[683]639!$OMP  END PARALLEL
[1]640
[1106]641    ELSE
642
643#if defined( __parallel )
[1]644!
[1106]645!--    Transpose array
646       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
647       IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
[1111]648       !$acc update host( f_inv )
649       CALL MPI_ALLTOALL( f_inv(nys,nxl,1),      sendrecvcount_zx, MPI_REAL, &
650                          work(nys_x,1,nzb_x,0), sendrecvcount_zx, MPI_REAL, &
[1106]651                          comm1dx, ierr )
[1111]652       !$acc update device( work )
[1106]653       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
[1]654
655!
[1106]656!--    Reorder transposed array
[1111]657!$OMP  PARALLEL PRIVATE ( i, j, k, l, xs )
[683]658!$OMP  DO
[1106]659       DO  l = 0, pdims(1) - 1
660          xs = 0 + l * nnx
[1111]661          !$acc kernels present( f_out, work )
662          !$acc loop
[1106]663          DO  k = nzb_x, nzt_x
664             DO  i = xs, xs + nnx - 1
[1111]665                !$acc loop vector( 32 )
[1106]666                DO  j = nys_x, nyn_x
[1111]667                   f_out(i,j,k) = work(j,i-xs+1,k,l)
[1106]668                ENDDO
[1]669             ENDDO
670          ENDDO
[1111]671          !$acc end kernels
[1]672       ENDDO
[683]673!$OMP  END PARALLEL
[1]674#endif
675
[1106]676    ENDIF
677
[1]678 END SUBROUTINE transpose_zx
679
680
[164]681 SUBROUTINE transpose_zy( f_in, work, f_out )
[1]682
683!------------------------------------------------------------------------------!
684! Description:
685! ------------
686! Transposition of input array (f_in) from z to y. For the input array, all
687! elements along z reside on the same PE, while after transposition, all
688! elements along y reside on the same PE.
689!------------------------------------------------------------------------------!
690
691    USE cpulog
692    USE indices
693    USE interfaces
694    USE pegrid
695    USE transpose_indices
696
697    IMPLICIT NONE
698
[1111]699    INTEGER ::  i, j, k, l, zs
[1]700   
[1111]701    REAL ::  f_in(nxl_z:nxr_z,nys_z:nyn_z,1:nz), f_out(0:ny,nxl_y:nxr_y,nzb_y:nzt_y)
[1]702
[1111]703    REAL, DIMENSION(nxl_z:nxr_z,nzt_y-nzb_y+1,nys_z:nyn_z,0:pdims(1)-1) ::  work
704
705    !$acc declare create( f_inv )
706    REAL ::  f_inv(nxl_y:nxr_y,nzb_y:nzt_y,0:ny)
707
708
[1]709!
710!-- If the PE grid is one-dimensional along y, the array has only to be
711!-- reordered locally and therefore no transposition has to be done.
712    IF ( pdims(1) /= 1 )  THEN
[1106]713
714#if defined( __parallel )
[1]715!
716!--    Reorder input array for transposition
[1111]717!$OMP  PARALLEL PRIVATE ( i, j, k, l, zs )
[683]718!$OMP  DO
[1]719       DO  l = 0, pdims(1) - 1
[1003]720          zs = 1 + l * ( nzt_y - nzb_y + 1 )
[1111]721          !$acc kernels present( f_in, work )
722          !$acc loop
[1003]723          DO  j = nys_z, nyn_z
724             DO  k = zs, zs + nzt_y - nzb_y
[1111]725                !$acc loop vector( 32 )
[1003]726                DO  i = nxl_z, nxr_z
[1111]727                   work(i,k-zs+1,j,l) = f_in(i,j,k)
[1]728                ENDDO
729             ENDDO
730          ENDDO
[1111]731          !$acc end kernels
[1]732       ENDDO
[683]733!$OMP  END PARALLEL
[1]734
735!
736!--    Transpose array
737       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
[622]738       IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
[1111]739       !$acc update host( work )
740       CALL MPI_ALLTOALL( work(nxl_z,1,nys_z,0), sendrecvcount_yz, MPI_REAL, &
741                          f_inv(nxl_y,nzb_y,0),  sendrecvcount_yz, MPI_REAL, &
[1]742                          comm1dx, ierr )
[1111]743       !$acc update device( f_inv )
[1]744       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
[1106]745#endif
[1]746
747    ELSE
748!
[1106]749!--    Reorder the array in the same way like ALLTOALL did it
[683]750!$OMP  PARALLEL PRIVATE ( i, j, k )
751!$OMP  DO
[1111]752       !$acc kernels present( f_in )
753       !$acc loop
[1003]754       DO  k = nzb_y, nzt_y
755          DO  j = 0, ny
[1111]756             !$acc loop vector( 32 )
[1003]757             DO  i = nxl_y, nxr_y
[164]758                f_inv(i,k,j) = f_in(i,j,k)
759             ENDDO
760          ENDDO
761       ENDDO
[1111]762       !$acc end kernels
[683]763!$OMP  END PARALLEL
[1106]764
765    ENDIF
766
[164]767!
[1106]768!-- Reorder transposed array in a way that the y index is in first position
[683]769!$OMP  PARALLEL PRIVATE ( i, j, k )
770!$OMP  DO
[1111]771    !$acc kernels present( f_out )
772    !$acc loop
[1106]773    DO  k = nzb_y, nzt_y
774       DO  i = nxl_y, nxr_y
[1111]775          !$acc loop vector( 32 )
[1106]776          DO  j = 0, ny
777             f_out(j,i,k) = f_inv(i,k,j)
[1]778          ENDDO
779       ENDDO
[1106]780    ENDDO
[1111]781    !$acc end kernels
[683]782!$OMP  END PARALLEL
[164]783
[1]784 END SUBROUTINE transpose_zy
785
786
[164]787 SUBROUTINE transpose_zyd( f_in, work, f_out )
[1]788
789!------------------------------------------------------------------------------!
790! Description:
791! ------------
792! Transposition of input array (f_in) from z to y. For the input array, all
793! elements along z reside on the same PE, while after transposition, all
794! elements along y reside on the same PE.
795! This is a direct transposition for arrays with indices in regular order
796! (k,j,i) (cf. transpose_zy).
797!------------------------------------------------------------------------------!
798
799    USE cpulog
800    USE indices
801    USE interfaces
802    USE pegrid
803    USE transpose_indices
804
805    IMPLICIT NONE
806
807    INTEGER ::  i, j, k, l, m, ys
808   
[1003]809    REAL ::  f_in(1:nz,nys:nyn,nxl:nxr), f_inv(nys:nyn,nxl:nxr,1:nz), &
810             f_out(0:ny,nxl_yd:nxr_yd,nzb_yd:nzt_yd),                 &
[164]811             work(nnx*nny*nnz)
[1]812
813#if defined( __parallel )
814
815!
816!-- Rearrange indices of input array in order to make data to be send
817!-- by MPI contiguous
[1003]818    DO  i = nxl, nxr
819       DO  j = nys, nyn
820          DO  k = 1, nz
[164]821             f_inv(j,i,k) = f_in(k,j,i)
[1]822          ENDDO
823       ENDDO
824    ENDDO
825
826!
827!-- Move data to different array, because memory location of work1 is
828!-- needed further below (work1 = work2).
829!-- If the PE grid is one-dimensional along x, only local reordering
830!-- of the data is necessary and no transposition has to be done.
831    IF ( pdims(2) == 1 )  THEN
[1003]832       DO  k = 1, nz
833          DO  i = nxl, nxr
834             DO  j = nys, nyn
[164]835                f_out(j,i,k) = f_inv(j,i,k)
[1]836             ENDDO
837          ENDDO
838       ENDDO
839       RETURN
840    ENDIF
841
842!
843!-- Transpose array
844    CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
[622]845    IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
[1]846    CALL MPI_ALLTOALL( f_inv(nys,nxl,1), sendrecvcount_zyd, MPI_REAL, &
[164]847                       work(1),          sendrecvcount_zyd, MPI_REAL, &
[1]848                       comm1dy, ierr )
849    CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
850
851!
852!-- Reorder transposed array
853    m = 0
854    DO  l = 0, pdims(2) - 1
855       ys = 0 + l * nny
[1003]856       DO  k = nzb_yd, nzt_yd
857          DO  i = nxl_yd, nxr_yd
[1]858             DO  j = ys, ys + nny - 1
859                m = m + 1
[164]860                f_out(j,i,k) = work(m)
[1]861             ENDDO
862          ENDDO
863       ENDDO
864    ENDDO
865
866#endif
867
868 END SUBROUTINE transpose_zyd
Note: See TracBrowser for help on using the repository browser.