source: palm/trunk/SOURCE/transpose.f90 @ 1111

Last change on this file since 1111 was 1111, checked in by raasch, 11 years ago

New:
---

GPU porting of pres, swap_timelevel. Adjustments of openACC directives.
Further porting of poisfft, which now runs completely on GPU without any
host/device data transfer for serial an parallel runs (but parallel runs
require data transfer before and after the MPI transpositions).
GPU-porting of tridiagonal solver:
tridiagonal routines split into extermal subroutines (instead using CONTAINS),
no distinction between parallel/non-parallel in poisfft and tridia any more,
tridia routines moved to end of file because of probable bug in PGI compiler
(otherwise "invalid device function" is indicated during runtime).
(cuda_fft_interfaces, fft_xy, flow_statistics, init_3d_model, palm, poisfft, pres, prognostic_equations, swap_timelevel, time_integration, transpose)
output of accelerator board information. (header)

optimization of tridia routines: constant elements and coefficients of tri are
stored in seperate arrays ddzuw and tric, last dimension of tri reduced from 5 to 2,
(init_grid, init_3d_model, modules, palm, poisfft)

poisfft_init is now called internally from poisfft,
(Makefile, Makefile_check, init_pegrid, poisfft, poisfft_hybrid)

CPU-time per grid point and timestep is output to CPU_MEASURES file
(cpu_statistics, modules, time_integration)

Changed:


resorting from/to array work changed, work now has 4 dimensions instead of 1 (transpose)
array diss allocated only if required (init_3d_model)

pressure boundary condition "Neumann+inhomo" removed from the code
(check_parameters, header, poisfft, poisfft_hybrid, pres)

Errors:


bugfix: dependency added for cuda_fft_interfaces (Makefile)
bugfix: CUDA fft plans adjusted for domain decomposition (before they always
used total domain) (fft_xy)

  • Property svn:keywords set to Id
File size: 23.8 KB
Line 
1 SUBROUTINE transpose_xy( f_in, work, f_out )
2
3!--------------------------------------------------------------------------------!
4! This file is part of PALM.
5!
6! PALM is free software: you can redistribute it and/or modify it under the terms
7! of the GNU General Public License as published by the Free Software Foundation,
8! either version 3 of the License, or (at your option) any later version.
9!
10! PALM is distributed in the hope that it will be useful, but WITHOUT ANY
11! WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
12! A PARTICULAR PURPOSE.  See the GNU General Public License for more details.
13!
14! You should have received a copy of the GNU General Public License along with
15! PALM. If not, see <http://www.gnu.org/licenses/>.
16!
17! Copyright 1997-2012  Leibniz University Hannover
18!--------------------------------------------------------------------------------!
19!
20! Current revisions:
21! -----------------
22! openACC directives added,
23! resorting data from/to work changed, work got 4 dimensions instead of 1
24!
25! Former revisions:
26! -----------------
27! $Id: transpose.f90 1111 2013-03-08 23:54:10Z raasch $
28!
29! 1106 2013-03-04 05:31:38Z raasch
30! preprocessor lines rearranged so that routines can also be used in serial
31! (non-parallel) mode
32!
33! 1092 2013-02-02 11:24:22Z raasch
34! unused variables removed
35!
36! 1036 2012-10-22 13:43:42Z raasch
37! code put under GPL (PALM 3.9)
38!
39! 1003 2012-09-14 14:35:53Z raasch
40! indices nxa, nya, etc. replaced by nx, ny, etc.
41!
42! 683 2011-02-09 14:25:15Z raasch
43! openMP parallelization of transpositions for 2d-domain-decomposition
44!
45! 622 2010-12-10 08:08:13Z raasch
46! optional barriers included in order to speed up collective operations
47!
48! 164 2008-05-15 08:46:15Z raasch
49! f_inv changed from subroutine argument to automatic array in order to do
50! re-ordering from f_in to f_inv in one step, one array work is needed instead
51! of work1 and work2
52!
53! February 2007
54! RCS Log replace by Id keyword, revision history cleaned up
55!
56! Revision 1.2  2004/04/30 13:12:17  raasch
57! Switched from mpi_alltoallv to the simpler mpi_alltoall,
58! all former transpose-routine files collected in this file, enlarged
59! transposition arrays introduced
60!
61! Revision 1.1  2004/04/30 13:08:16  raasch
62! Initial revision (collection of former routines transpose_xy, transpose_xz,
63!                   transpose_yx, transpose_yz, transpose_zx, transpose_zy)
64!
65! Revision 1.1  1997/07/24 11:25:18  raasch
66! Initial revision
67!
68!
69! Description:
70! ------------
71! Transposition of input array (f_in) from x to y. For the input array, all
72! elements along x reside on the same PE, while after transposition, all
73! elements along y reside on the same PE.
74!------------------------------------------------------------------------------!
75
76    USE cpulog
77    USE indices
78    USE interfaces
79    USE pegrid
80    USE transpose_indices
81
82    IMPLICIT NONE
83
84    INTEGER ::  i, j, k, l, ys
85   
86    REAL ::  f_in(0:nx,nys_x:nyn_x,nzb_x:nzt_x), f_out(0:ny,nxl_y:nxr_y,nzb_y:nzt_y)
87
88    REAL, DIMENSION(nyn_x-nys_x+1,nzb_y:nzt_y,nxl_y:nxr_y,0:pdims(2)-1) ::  work
89
90    !$acc declare create( f_inv )
91    REAL ::  f_inv(nys_x:nyn_x,nzb_x:nzt_x,0:nx)
92
93
94!
95!-- Rearrange indices of input array in order to make data to be send
96!-- by MPI contiguous
97!$OMP  PARALLEL PRIVATE ( i, j, k )
98!$OMP  DO
99    !$acc kernels present( f_in )
100    !$acc loop
101    DO  i = 0, nx
102       DO  k = nzb_x, nzt_x
103          !$acc loop vector( 32 )
104          DO  j = nys_x, nyn_x
105             f_inv(j,k,i) = f_in(i,j,k)
106          ENDDO
107       ENDDO
108    ENDDO
109    !$acc end kernels
110!$OMP  END PARALLEL
111
112    IF ( numprocs /= 1 )  THEN
113
114#if defined( __parallel )
115!
116!--    Transpose array
117       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
118       IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
119       !$acc update host( f_inv )
120       CALL MPI_ALLTOALL( f_inv(nys_x,nzb_x,0),  sendrecvcount_xy, MPI_REAL, &
121                          work(1,nzb_y,nxl_y,0), sendrecvcount_xy, MPI_REAL, &
122                          comm1dy, ierr )
123       !$acc update device( work )
124       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
125
126!
127!--    Reorder transposed array
128!$OMP  PARALLEL PRIVATE ( i, j, k, l, ys )
129!$OMP  DO
130       DO  l = 0, pdims(2) - 1
131          ys = 0 + l * ( nyn_x - nys_x + 1 )
132          !$acc kernels present( f_out, work )
133          !$acc loop
134          DO  i = nxl_y, nxr_y
135             DO  k = nzb_y, nzt_y
136                !$acc loop vector( 32 )
137                DO  j = ys, ys + nyn_x - nys_x
138                   f_out(j,i,k) = work(j-ys+1,k,i,l)
139                ENDDO
140             ENDDO
141          ENDDO
142          !$acc end kernels
143       ENDDO
144!$OMP  END PARALLEL
145#endif
146
147    ELSE
148
149!
150!--    Reorder transposed array
151!$OMP  PARALLEL PRIVATE ( i, j, k )
152!$OMP  DO
153       !$acc kernels present( f_out )
154       !$acc loop
155       DO  k = nzb_y, nzt_y
156          DO  i = nxl_y, nxr_y
157             !$acc loop vector( 32 )
158             DO  j = 0, ny
159                f_out(j,i,k) = f_inv(j,k,i)
160             ENDDO
161          ENDDO
162       ENDDO
163       !$acc end kernels
164!$OMP  END PARALLEL
165
166    ENDIF
167
168 END SUBROUTINE transpose_xy
169
170
171 SUBROUTINE transpose_xz( f_in, work, f_out )
172
173!------------------------------------------------------------------------------!
174! Description:
175! ------------
176! Transposition of input array (f_in) from x to z. For the input array, all
177! elements along x reside on the same PE, while after transposition, all
178! elements along z reside on the same PE.
179!------------------------------------------------------------------------------!
180
181    USE cpulog
182    USE indices
183    USE interfaces
184    USE pegrid
185    USE transpose_indices
186
187    IMPLICIT NONE
188
189    INTEGER ::  i, j, k, l, xs
190   
191    REAL ::  f_in(0:nx,nys_x:nyn_x,nzb_x:nzt_x), f_out(1:nz,nys:nyn,nxl:nxr)
192
193    REAL, DIMENSION(nys_x:nyn_x,nnx,nzb_x:nzt_x,0:pdims(1)-1) ::  work
194
195    !$acc declare create( f_inv )
196    REAL ::  f_inv(nys:nyn,nxl:nxr,1:nz)
197
198
199!
200!-- If the PE grid is one-dimensional along y, the array has only to be
201!-- reordered locally and therefore no transposition has to be done.
202    IF ( pdims(1) /= 1 )  THEN
203
204#if defined( __parallel )
205!
206!--    Reorder input array for transposition
207!$OMP  PARALLEL PRIVATE ( i, j, k, l, xs )
208!$OMP  DO
209       DO  l = 0, pdims(1) - 1
210          xs = 0 + l * nnx
211          !$acc kernels present( f_in, work )
212          !$acc loop
213          DO  k = nzb_x, nzt_x
214             DO  i = xs, xs + nnx - 1
215                !$acc loop vector( 32 )
216                DO  j = nys_x, nyn_x
217                   work(j,i-xs+1,k,l) = f_in(i,j,k)
218                ENDDO
219             ENDDO
220          ENDDO
221          !$acc end kernels
222       ENDDO
223!$OMP  END PARALLEL
224
225!
226!--    Transpose array
227       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
228       IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
229       !$acc update host( work )
230       CALL MPI_ALLTOALL( work(nys_x,1,nzb_x,0), sendrecvcount_zx, MPI_REAL, &
231                          f_inv(nys,nxl,1),      sendrecvcount_zx, MPI_REAL, &
232                          comm1dx, ierr )
233       !$acc update device( f_inv )
234       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
235
236!
237!--    Reorder transposed array in a way that the z index is in first position
238!$OMP  PARALLEL PRIVATE ( i, j, k )
239!$OMP  DO
240       !$acc kernels present( f_out )
241       !$acc loop
242       DO  k = 1, nz
243          DO  i = nxl, nxr
244             !$acc loop vector( 32 )
245             DO  j = nys, nyn
246                f_out(k,j,i) = f_inv(j,i,k)
247             ENDDO
248          ENDDO
249       ENDDO
250       !$acc end kernels
251!$OMP  END PARALLEL
252#endif
253
254    ELSE
255
256!
257!--    Reorder the array in a way that the z index is in first position
258!$OMP  PARALLEL PRIVATE ( i, j, k )
259!$OMP  DO
260       !$acc kernels present( f_in )
261       !$acc loop
262       DO  i = nxl, nxr
263          DO  j = nys, nyn
264             !$acc loop vector( 32 )
265             DO  k = 1, nz
266                f_inv(j,i,k) = f_in(i,j,k)
267             ENDDO
268          ENDDO
269       ENDDO
270       !$acc end kernels
271!$OMP  END PARALLEL
272
273!$OMP  PARALLEL PRIVATE ( i, j, k )
274!$OMP  DO
275       !$acc kernels present( f_out )
276       !$acc loop
277       DO  k = 1, nz
278          DO  i = nxl, nxr
279             !$acc loop vector( 32 )
280             DO  j = nys, nyn
281                f_out(k,j,i) = f_inv(j,i,k)
282             ENDDO
283          ENDDO
284       ENDDO
285       !$acc end kernels
286!$OMP  END PARALLEL
287
288    ENDIF
289
290 END SUBROUTINE transpose_xz
291
292
293 SUBROUTINE transpose_yx( f_in, work, f_out )
294
295!------------------------------------------------------------------------------!
296! Description:
297! ------------
298! Transposition of input array (f_in) from y to x. For the input array, all
299! elements along y reside on the same PE, while after transposition, all
300! elements along x reside on the same PE.
301!------------------------------------------------------------------------------!
302
303    USE cpulog
304    USE indices
305    USE interfaces
306    USE pegrid
307    USE transpose_indices
308
309    IMPLICIT NONE
310
311    INTEGER ::  i, j, k, l, ys
312   
313    REAL ::  f_in(0:ny,nxl_y:nxr_y,nzb_y:nzt_y), f_out(0:nx,nys_x:nyn_x,nzb_x:nzt_x)
314
315    REAL, DIMENSION(nyn_x-nys_x+1,nzb_y:nzt_y,nxl_y:nxr_y,0:pdims(2)-1) ::  work
316
317    !$acc declare create( f_inv )
318    REAL ::  f_inv(nys_x:nyn_x,nzb_x:nzt_x,0:nx)
319
320
321    IF ( numprocs /= 1 )  THEN
322
323#if defined( __parallel )
324!
325!--    Reorder input array for transposition
326!$OMP  PARALLEL PRIVATE ( i, j, k, l, ys )
327!$OMP  DO
328       DO  l = 0, pdims(2) - 1
329          ys = 0 + l * ( nyn_x - nys_x + 1 )
330          !$acc kernels present( f_in, work )
331          !$acc loop
332          DO  i = nxl_y, nxr_y
333             DO  k = nzb_y, nzt_y
334                !$acc loop vector( 32 )
335                DO  j = ys, ys + nyn_x - nys_x
336                   work(j-ys+1,k,i,l) = f_in(j,i,k)
337                ENDDO
338             ENDDO
339          ENDDO
340          !$acc end kernels
341       ENDDO
342!$OMP  END PARALLEL
343
344!
345!--    Transpose array
346       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
347       IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
348       !$acc update host( work )
349       CALL MPI_ALLTOALL( work(1,nzb_y,nxl_y,0), sendrecvcount_xy, MPI_REAL, &
350                          f_inv(nys_x,nzb_x,0),  sendrecvcount_xy, MPI_REAL, &
351                          comm1dy, ierr )
352       !$acc update device( f_inv )
353       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
354#endif
355
356    ELSE
357
358!
359!--    Reorder array f_in the same way as ALLTOALL did it
360!$OMP  PARALLEL PRIVATE ( i, j, k )
361!$OMP  DO
362       !$acc kernels present( f_in )
363       !$acc loop
364       DO  i = nxl_y, nxr_y
365          DO  k = nzb_y, nzt_y
366             !$acc loop vector( 32 )
367             DO  j = 0, ny
368                f_inv(j,k,i) = f_in(j,i,k)
369             ENDDO
370          ENDDO
371       ENDDO
372       !$acc end kernels
373!$OMP  END PARALLEL
374
375    ENDIF
376
377!
378!-- Reorder transposed array in a way that the x index is in first position
379!$OMP  PARALLEL PRIVATE ( i, j, k )
380!$OMP  DO
381    !$acc kernels present( f_out )
382    !$acc loop
383    DO  i = 0, nx
384       DO  k = nzb_x, nzt_x
385          !$acc loop vector( 32 )
386          DO  j = nys_x, nyn_x
387             f_out(i,j,k) = f_inv(j,k,i)
388          ENDDO
389       ENDDO
390    ENDDO
391    !$acc end kernels
392!$OMP  END PARALLEL
393
394 END SUBROUTINE transpose_yx
395
396
397 SUBROUTINE transpose_yxd( f_in, work, f_out )
398
399!------------------------------------------------------------------------------!
400! Description:
401! ------------
402! Transposition of input array (f_in) from y to x. For the input array, all
403! elements along y reside on the same PE, while after transposition, all
404! elements along x reside on the same PE.
405! This is a direct transposition for arrays with indices in regular order
406! (k,j,i) (cf. transpose_yx).
407!------------------------------------------------------------------------------!
408
409    USE cpulog
410    USE indices
411    USE interfaces
412    USE pegrid
413    USE transpose_indices
414
415    IMPLICIT NONE
416
417    INTEGER ::  i, j, k, l, m, xs
418
419    REAL ::  f_in(1:nz,nys:nyn,nxl:nxr), f_inv(nxl:nxr,1:nz,nys:nyn), &
420             f_out(0:nx,nys_x:nyn_x,nzb_x:nzt_x),                     &
421             work(nnx*nny*nnz)
422
423#if defined( __parallel )
424
425!
426!-- Rearrange indices of input array in order to make data to be send
427!-- by MPI contiguous
428    DO  k = 1, nz
429       DO  j = nys, nyn
430          DO  i = nxl, nxr
431             f_inv(i,k,j) = f_in(k,j,i)
432          ENDDO
433       ENDDO
434    ENDDO
435
436!
437!-- Transpose array
438    CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
439    IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
440    CALL MPI_ALLTOALL( f_inv(nxl,1,nys), sendrecvcount_xy, MPI_REAL, &
441                       work(1),          sendrecvcount_xy, MPI_REAL, &
442                       comm1dx, ierr )
443    CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
444
445!
446!-- Reorder transposed array
447    m = 0
448    DO  l = 0, pdims(1) - 1
449       xs = 0 + l * nnx
450       DO  j = nys_x, nyn_x
451          DO  k = 1, nz
452             DO  i = xs, xs + nnx - 1
453                m = m + 1
454                f_out(i,j,k) = work(m)
455             ENDDO
456          ENDDO
457       ENDDO
458    ENDDO
459
460#endif
461
462 END SUBROUTINE transpose_yxd
463
464
465 SUBROUTINE transpose_yz( f_in, work, f_out )
466
467!------------------------------------------------------------------------------!
468! Description:
469! ------------
470! Transposition of input array (f_in) from y to z. For the input array, all
471! elements along y reside on the same PE, while after transposition, all
472! elements along z reside on the same PE.
473!------------------------------------------------------------------------------!
474
475    USE cpulog
476    USE indices
477    USE interfaces
478    USE pegrid
479    USE transpose_indices
480
481    IMPLICIT NONE
482
483    INTEGER ::  i, j, k, l, zs
484   
485    REAL ::  f_in(0:ny,nxl_y:nxr_y,nzb_y:nzt_y), f_out(nxl_z:nxr_z,nys_z:nyn_z,1:nz)
486
487    REAL, DIMENSION(nxl_z:nxr_z,nzt_y-nzb_y+1,nys_z:nyn_z,0:pdims(1)-1) ::  work
488
489    !$acc declare create( f_inv )
490    REAL ::  f_inv(nxl_y:nxr_y,nzb_y:nzt_y,0:ny)
491
492
493!
494!-- Rearrange indices of input array in order to make data to be send
495!-- by MPI contiguous
496!$OMP  PARALLEL PRIVATE ( i, j, k )
497!$OMP  DO
498    !$acc kernels present( f_in )
499    !$acc loop
500    DO  j = 0, ny
501       DO  k = nzb_y, nzt_y
502          !$acc loop vector( 32 )
503          DO  i = nxl_y, nxr_y
504             f_inv(i,k,j) = f_in(j,i,k)
505          ENDDO
506       ENDDO
507    ENDDO
508    !$acc end kernels
509!$OMP  END PARALLEL
510
511!
512!-- Move data to different array, because memory location of work1 is
513!-- needed further below (work1 = work2).
514!-- If the PE grid is one-dimensional along y, only local reordering
515!-- of the data is necessary and no transposition has to be done.
516    IF ( pdims(1) == 1 )  THEN
517
518!$OMP  PARALLEL PRIVATE ( i, j, k )
519!$OMP  DO
520       !$acc kernels present( f_out )
521       !$acc loop
522       DO  j = 0, ny
523          DO  k = nzb_y, nzt_y
524             !$acc loop vector( 32 )
525             DO  i = nxl_y, nxr_y
526                f_out(i,j,k) = f_inv(i,k,j)
527             ENDDO
528          ENDDO
529       ENDDO
530       !$acc end kernels
531!$OMP  END PARALLEL
532
533    ELSE
534
535#if defined( __parallel )
536!
537!--    Transpose array
538       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
539       IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
540       !$acc update host( f_inv )
541       CALL MPI_ALLTOALL( f_inv(nxl_y,nzb_y,0),  sendrecvcount_yz, MPI_REAL, &
542                          work(nxl_z,1,nys_z,0), sendrecvcount_yz, MPI_REAL, &
543                          comm1dx, ierr )
544       !$acc update device( work )
545       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
546
547!
548!--    Reorder transposed array
549!$OMP  PARALLEL PRIVATE ( i, j, k, l, zs )
550!$OMP  DO
551       DO  l = 0, pdims(1) - 1
552          zs = 1 + l * ( nzt_y - nzb_y + 1 )
553          !$acc kernels present( f_out, work )
554          !$acc loop
555          DO  j = nys_z, nyn_z
556             DO  k = zs, zs + nzt_y - nzb_y
557                !$acc loop vector( 32 )
558                DO  i = nxl_z, nxr_z
559                   f_out(i,j,k) = work(i,k-zs+1,j,l)
560                ENDDO
561             ENDDO
562          ENDDO
563          !$acc end kernels
564       ENDDO
565!$OMP  END PARALLEL
566#endif
567
568   ENDIF
569
570 END SUBROUTINE transpose_yz
571
572
573 SUBROUTINE transpose_zx( f_in, work, f_out )
574
575!------------------------------------------------------------------------------!
576! Description:
577! ------------
578! Transposition of input array (f_in) from z to x. For the input array, all
579! elements along z reside on the same PE, while after transposition, all
580! elements along x reside on the same PE.
581!------------------------------------------------------------------------------!
582
583    USE cpulog
584    USE indices
585    USE interfaces
586    USE pegrid
587    USE transpose_indices
588
589    IMPLICIT NONE
590
591    INTEGER ::  i, j, k, l, xs
592   
593    REAL ::  f_in(1:nz,nys:nyn,nxl:nxr), f_out(0:nx,nys_x:nyn_x,nzb_x:nzt_x)
594
595    REAL, DIMENSION(nys_x:nyn_x,nnx,nzb_x:nzt_x,0:pdims(1)-1) ::  work
596
597    !$acc declare create( f_inv )
598    REAL ::  f_inv(nys:nyn,nxl:nxr,1:nz)
599
600
601!
602!-- Rearrange indices of input array in order to make data to be send
603!-- by MPI contiguous
604!$OMP  PARALLEL PRIVATE ( i, j, k )
605!$OMP  DO
606    !$acc kernels present( f_in )
607    !$acc loop
608    DO  k = 1,nz
609       DO  i = nxl, nxr
610          !$acc loop vector( 32 )
611          DO  j = nys, nyn
612             f_inv(j,i,k) = f_in(k,j,i)
613          ENDDO
614       ENDDO
615    ENDDO
616    !$acc end kernels
617!$OMP  END PARALLEL
618
619!
620!-- Move data to different array, because memory location of work1 is
621!-- needed further below (work1 = work2).
622!-- If the PE grid is one-dimensional along y, only local reordering
623!-- of the data is necessary and no transposition has to be done.
624    IF ( pdims(1) == 1 )  THEN
625
626!$OMP  PARALLEL PRIVATE ( i, j, k )
627!$OMP  DO
628       !$acc kernels present( f_out )
629       !$acc loop
630       DO  k = 1, nz
631          DO  i = nxl, nxr
632             !$acc loop vector( 32 )
633             DO  j = nys, nyn
634                f_out(i,j,k) = f_inv(j,i,k)
635             ENDDO
636          ENDDO
637       ENDDO
638       !$acc end kernels
639!$OMP  END PARALLEL
640
641    ELSE
642
643#if defined( __parallel )
644!
645!--    Transpose array
646       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
647       IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
648       !$acc update host( f_inv )
649       CALL MPI_ALLTOALL( f_inv(nys,nxl,1),      sendrecvcount_zx, MPI_REAL, &
650                          work(nys_x,1,nzb_x,0), sendrecvcount_zx, MPI_REAL, &
651                          comm1dx, ierr )
652       !$acc update device( work )
653       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
654
655!
656!--    Reorder transposed array
657!$OMP  PARALLEL PRIVATE ( i, j, k, l, xs )
658!$OMP  DO
659       DO  l = 0, pdims(1) - 1
660          xs = 0 + l * nnx
661          !$acc kernels present( f_out, work )
662          !$acc loop
663          DO  k = nzb_x, nzt_x
664             DO  i = xs, xs + nnx - 1
665                !$acc loop vector( 32 )
666                DO  j = nys_x, nyn_x
667                   f_out(i,j,k) = work(j,i-xs+1,k,l)
668                ENDDO
669             ENDDO
670          ENDDO
671          !$acc end kernels
672       ENDDO
673!$OMP  END PARALLEL
674#endif
675
676    ENDIF
677
678 END SUBROUTINE transpose_zx
679
680
681 SUBROUTINE transpose_zy( f_in, work, f_out )
682
683!------------------------------------------------------------------------------!
684! Description:
685! ------------
686! Transposition of input array (f_in) from z to y. For the input array, all
687! elements along z reside on the same PE, while after transposition, all
688! elements along y reside on the same PE.
689!------------------------------------------------------------------------------!
690
691    USE cpulog
692    USE indices
693    USE interfaces
694    USE pegrid
695    USE transpose_indices
696
697    IMPLICIT NONE
698
699    INTEGER ::  i, j, k, l, zs
700   
701    REAL ::  f_in(nxl_z:nxr_z,nys_z:nyn_z,1:nz), f_out(0:ny,nxl_y:nxr_y,nzb_y:nzt_y)
702
703    REAL, DIMENSION(nxl_z:nxr_z,nzt_y-nzb_y+1,nys_z:nyn_z,0:pdims(1)-1) ::  work
704
705    !$acc declare create( f_inv )
706    REAL ::  f_inv(nxl_y:nxr_y,nzb_y:nzt_y,0:ny)
707
708
709!
710!-- If the PE grid is one-dimensional along y, the array has only to be
711!-- reordered locally and therefore no transposition has to be done.
712    IF ( pdims(1) /= 1 )  THEN
713
714#if defined( __parallel )
715!
716!--    Reorder input array for transposition
717!$OMP  PARALLEL PRIVATE ( i, j, k, l, zs )
718!$OMP  DO
719       DO  l = 0, pdims(1) - 1
720          zs = 1 + l * ( nzt_y - nzb_y + 1 )
721          !$acc kernels present( f_in, work )
722          !$acc loop
723          DO  j = nys_z, nyn_z
724             DO  k = zs, zs + nzt_y - nzb_y
725                !$acc loop vector( 32 )
726                DO  i = nxl_z, nxr_z
727                   work(i,k-zs+1,j,l) = f_in(i,j,k)
728                ENDDO
729             ENDDO
730          ENDDO
731          !$acc end kernels
732       ENDDO
733!$OMP  END PARALLEL
734
735!
736!--    Transpose array
737       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
738       IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
739       !$acc update host( work )
740       CALL MPI_ALLTOALL( work(nxl_z,1,nys_z,0), sendrecvcount_yz, MPI_REAL, &
741                          f_inv(nxl_y,nzb_y,0),  sendrecvcount_yz, MPI_REAL, &
742                          comm1dx, ierr )
743       !$acc update device( f_inv )
744       CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
745#endif
746
747    ELSE
748!
749!--    Reorder the array in the same way like ALLTOALL did it
750!$OMP  PARALLEL PRIVATE ( i, j, k )
751!$OMP  DO
752       !$acc kernels present( f_in )
753       !$acc loop
754       DO  k = nzb_y, nzt_y
755          DO  j = 0, ny
756             !$acc loop vector( 32 )
757             DO  i = nxl_y, nxr_y
758                f_inv(i,k,j) = f_in(i,j,k)
759             ENDDO
760          ENDDO
761       ENDDO
762       !$acc end kernels
763!$OMP  END PARALLEL
764
765    ENDIF
766
767!
768!-- Reorder transposed array in a way that the y index is in first position
769!$OMP  PARALLEL PRIVATE ( i, j, k )
770!$OMP  DO
771    !$acc kernels present( f_out )
772    !$acc loop
773    DO  k = nzb_y, nzt_y
774       DO  i = nxl_y, nxr_y
775          !$acc loop vector( 32 )
776          DO  j = 0, ny
777             f_out(j,i,k) = f_inv(i,k,j)
778          ENDDO
779       ENDDO
780    ENDDO
781    !$acc end kernels
782!$OMP  END PARALLEL
783
784 END SUBROUTINE transpose_zy
785
786
787 SUBROUTINE transpose_zyd( f_in, work, f_out )
788
789!------------------------------------------------------------------------------!
790! Description:
791! ------------
792! Transposition of input array (f_in) from z to y. For the input array, all
793! elements along z reside on the same PE, while after transposition, all
794! elements along y reside on the same PE.
795! This is a direct transposition for arrays with indices in regular order
796! (k,j,i) (cf. transpose_zy).
797!------------------------------------------------------------------------------!
798
799    USE cpulog
800    USE indices
801    USE interfaces
802    USE pegrid
803    USE transpose_indices
804
805    IMPLICIT NONE
806
807    INTEGER ::  i, j, k, l, m, ys
808   
809    REAL ::  f_in(1:nz,nys:nyn,nxl:nxr), f_inv(nys:nyn,nxl:nxr,1:nz), &
810             f_out(0:ny,nxl_yd:nxr_yd,nzb_yd:nzt_yd),                 &
811             work(nnx*nny*nnz)
812
813#if defined( __parallel )
814
815!
816!-- Rearrange indices of input array in order to make data to be send
817!-- by MPI contiguous
818    DO  i = nxl, nxr
819       DO  j = nys, nyn
820          DO  k = 1, nz
821             f_inv(j,i,k) = f_in(k,j,i)
822          ENDDO
823       ENDDO
824    ENDDO
825
826!
827!-- Move data to different array, because memory location of work1 is
828!-- needed further below (work1 = work2).
829!-- If the PE grid is one-dimensional along x, only local reordering
830!-- of the data is necessary and no transposition has to be done.
831    IF ( pdims(2) == 1 )  THEN
832       DO  k = 1, nz
833          DO  i = nxl, nxr
834             DO  j = nys, nyn
835                f_out(j,i,k) = f_inv(j,i,k)
836             ENDDO
837          ENDDO
838       ENDDO
839       RETURN
840    ENDIF
841
842!
843!-- Transpose array
844    CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
845    IF ( collective_wait )  CALL MPI_BARRIER( comm2d, ierr )
846    CALL MPI_ALLTOALL( f_inv(nys,nxl,1), sendrecvcount_zyd, MPI_REAL, &
847                       work(1),          sendrecvcount_zyd, MPI_REAL, &
848                       comm1dy, ierr )
849    CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
850
851!
852!-- Reorder transposed array
853    m = 0
854    DO  l = 0, pdims(2) - 1
855       ys = 0 + l * nny
856       DO  k = nzb_yd, nzt_yd
857          DO  i = nxl_yd, nxr_yd
858             DO  j = ys, ys + nny - 1
859                m = m + 1
860                f_out(j,i,k) = work(m)
861             ENDDO
862          ENDDO
863       ENDDO
864    ENDDO
865
866#endif
867
868 END SUBROUTINE transpose_zyd
Note: See TracBrowser for help on using the repository browser.