Home

Context Navigation

transpose.f90 @ 1111

Last change on this file since 1111 was 1111, checked in by raasch, 12 years ago

New:
---

GPU porting of pres, swap_timelevel. Adjustments of openACC directives.
Further porting of poisfft, which now runs completely on GPU without any
host/device data transfer for serial an parallel runs (but parallel runs
require data transfer before and after the MPI transpositions).
GPU-porting of tridiagonal solver:
tridiagonal routines split into extermal subroutines (instead using CONTAINS),
no distinction between parallel/non-parallel in poisfft and tridia any more,
tridia routines moved to end of file because of probable bug in PGI compiler
(otherwise "invalid device function" is indicated during runtime).
(cuda_fft_interfaces, fft_xy, flow_statistics, init_3d_model, palm, poisfft, pres, prognostic_equations, swap_timelevel, time_integration, transpose)
output of accelerator board information. (header)

optimization of tridia routines: constant elements and coefficients of tri are
stored in seperate arrays ddzuw and tric, last dimension of tri reduced from 5 to 2,
(init_grid, init_3d_model, modules, palm, poisfft)

poisfft_init is now called internally from poisfft,
(Makefile, Makefile_check, init_pegrid, poisfft, poisfft_hybrid)

CPU-time per grid point and timestep is output to CPU_MEASURES file
(cpu_statistics, modules, time_integration)

Changed:

resorting from/to array work changed, work now has 4 dimensions instead of 1 (transpose)
array diss allocated only if required (init_3d_model)

pressure boundary condition "Neumann+inhomo" removed from the code
(check_parameters, header, poisfft, poisfft_hybrid, pres)

Errors:

bugfix: dependency added for cuda_fft_interfaces (Makefile)
bugfix: CUDA fft plans adjusted for domain decomposition (before they always
used total domain) (fft_xy)

Property svn:keywords set to Id

File size: 23.8 KB

Rev	Line
[164]	1	SUBROUTINE transpose_xy( f_in, work, f_out )
[1]	2
[1036]	3	!--------------------------------------------------------------------------------!
	4	! This file is part of PALM.
	5	!
	6	! PALM is free software: you can redistribute it and/or modify it under the terms
	7	! of the GNU General Public License as published by the Free Software Foundation,
	8	! either version 3 of the License, or (at your option) any later version.
	9	!
	10	! PALM is distributed in the hope that it will be useful, but WITHOUT ANY
	11	! WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
	12	! A PARTICULAR PURPOSE. See the GNU General Public License for more details.
	13	!
	14	! You should have received a copy of the GNU General Public License along with
	15	! PALM. If not, see <http://www.gnu.org/licenses/>.
	16	!
	17	! Copyright 1997-2012 Leibniz University Hannover
	18	!--------------------------------------------------------------------------------!
	19	!
[484]	20	! Current revisions:
[1]	21	! -----------------
[1111]	22	! openACC directives added,
	23	! resorting data from/to work changed, work got 4 dimensions instead of 1
[198]	24	!
	25	! Former revisions:
	26	! -----------------
	27	! $Id: transpose.f90 1111 2013-03-08 23:54:10Z raasch $
	28	!
[1107]	29	! 1106 2013-03-04 05:31:38Z raasch
	30	! preprocessor lines rearranged so that routines can also be used in serial
	31	! (non-parallel) mode
	32	!
[1093]	33	! 1092 2013-02-02 11:24:22Z raasch
	34	! unused variables removed
	35	!
[1037]	36	! 1036 2012-10-22 13:43:42Z raasch
	37	! code put under GPL (PALM 3.9)
	38	!
[1004]	39	! 1003 2012-09-14 14:35:53Z raasch
	40	! indices nxa, nya, etc. replaced by nx, ny, etc.
	41	!
[684]	42	! 683 2011-02-09 14:25:15Z raasch
	43	! openMP parallelization of transpositions for 2d-domain-decomposition
	44	!
[623]	45	! 622 2010-12-10 08:08:13Z raasch
	46	! optional barriers included in order to speed up collective operations
	47	!
[198]	48	! 164 2008-05-15 08:46:15Z raasch
[164]	49	! f_inv changed from subroutine argument to automatic array in order to do
	50	! re-ordering from f_in to f_inv in one step, one array work is needed instead
	51	! of work1 and work2
[1]	52	!
[198]	53	! February 2007
[3]	54	! RCS Log replace by Id keyword, revision history cleaned up
	55	!
[1]	56	! Revision 1.2 2004/04/30 13:12:17 raasch
	57	! Switched from mpi_alltoallv to the simpler mpi_alltoall,
	58	! all former transpose-routine files collected in this file, enlarged
	59	! transposition arrays introduced
	60	!
	61	! Revision 1.1 2004/04/30 13:08:16 raasch
	62	! Initial revision (collection of former routines transpose_xy, transpose_xz,
	63	! transpose_yx, transpose_yz, transpose_zx, transpose_zy)
	64	!
	65	! Revision 1.1 1997/07/24 11:25:18 raasch
	66	! Initial revision
	67	!
	68	!
	69	! Description:
	70	! ------------
	71	! Transposition of input array (f_in) from x to y. For the input array, all
	72	! elements along x reside on the same PE, while after transposition, all
	73	! elements along y reside on the same PE.
	74	!------------------------------------------------------------------------------!
	75
	76	USE cpulog
	77	USE indices
	78	USE interfaces
	79	USE pegrid
	80	USE transpose_indices
	81
	82	IMPLICIT NONE
	83
[1111]	84	INTEGER :: i, j, k, l, ys
[1]	85
[1111]	86	REAL :: f_in(0:nx,nys_x:nyn_x,nzb_x:nzt_x), f_out(0:ny,nxl_y:nxr_y,nzb_y:nzt_y)
[1]	87
[1111]	88	REAL, DIMENSION(nyn_x-nys_x+1,nzb_y:nzt_y,nxl_y:nxr_y,0:pdims(2)-1) :: work
	89
	90	!$acc declare create( f_inv )
	91	REAL :: f_inv(nys_x:nyn_x,nzb_x:nzt_x,0:nx)
	92
	93
[1]	94	!
	95	!-- Rearrange indices of input array in order to make data to be send
	96	!-- by MPI contiguous
[683]	97	!$OMP PARALLEL PRIVATE ( i, j, k )
	98	!$OMP DO
[1111]	99	!$acc kernels present( f_in )
	100	!$acc loop
[1003]	101	DO i = 0, nx
	102	DO k = nzb_x, nzt_x
[1111]	103	!$acc loop vector( 32 )
[1003]	104	DO j = nys_x, nyn_x
[164]	105	f_inv(j,k,i) = f_in(i,j,k)
[1]	106	ENDDO
	107	ENDDO
	108	ENDDO
[1111]	109	!$acc end kernels
[683]	110	!$OMP END PARALLEL
[1]	111
[1106]	112	IF ( numprocs /= 1 ) THEN
	113
	114	#if defined( __parallel )
[1]	115	!
[1106]	116	!-- Transpose array
	117	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
	118	IF ( collective_wait ) CALL MPI_BARRIER( comm2d, ierr )
[1111]	119	!$acc update host( f_inv )
	120	CALL MPI_ALLTOALL( f_inv(nys_x,nzb_x,0), sendrecvcount_xy, MPI_REAL, &
	121	work(1,nzb_y,nxl_y,0), sendrecvcount_xy, MPI_REAL, &
[1106]	122	comm1dy, ierr )
[1111]	123	!$acc update device( work )
[1106]	124	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
[1]	125
	126	!
[1106]	127	!-- Reorder transposed array
[1111]	128	!$OMP PARALLEL PRIVATE ( i, j, k, l, ys )
[683]	129	!$OMP DO
[1106]	130	DO l = 0, pdims(2) - 1
	131	ys = 0 + l * ( nyn_x - nys_x + 1 )
[1111]	132	!$acc kernels present( f_out, work )
	133	!$acc loop
[1106]	134	DO i = nxl_y, nxr_y
	135	DO k = nzb_y, nzt_y
[1111]	136	!$acc loop vector( 32 )
[1106]	137	DO j = ys, ys + nyn_x - nys_x
[1111]	138	f_out(j,i,k) = work(j-ys+1,k,i,l)
[1106]	139	ENDDO
[1]	140	ENDDO
	141	ENDDO
[1111]	142	!$acc end kernels
[1]	143	ENDDO
[683]	144	!$OMP END PARALLEL
[1]	145	#endif
	146
[1106]	147	ELSE
	148
	149	!
	150	!-- Reorder transposed array
	151	!$OMP PARALLEL PRIVATE ( i, j, k )
	152	!$OMP DO
[1111]	153	!$acc kernels present( f_out )
	154	!$acc loop
[1106]	155	DO k = nzb_y, nzt_y
	156	DO i = nxl_y, nxr_y
[1111]	157	!$acc loop vector( 32 )
[1106]	158	DO j = 0, ny
	159	f_out(j,i,k) = f_inv(j,k,i)
	160	ENDDO
	161	ENDDO
	162	ENDDO
[1111]	163	!$acc end kernels
[1106]	164	!$OMP END PARALLEL
	165
	166	ENDIF
	167
[1]	168	END SUBROUTINE transpose_xy
	169
	170
[164]	171	SUBROUTINE transpose_xz( f_in, work, f_out )
[1]	172
	173	!------------------------------------------------------------------------------!
	174	! Description:
	175	! ------------
	176	! Transposition of input array (f_in) from x to z. For the input array, all
	177	! elements along x reside on the same PE, while after transposition, all
	178	! elements along z reside on the same PE.
	179	!------------------------------------------------------------------------------!
	180
	181	USE cpulog
	182	USE indices
	183	USE interfaces
	184	USE pegrid
	185	USE transpose_indices
	186
	187	IMPLICIT NONE
	188
[1111]	189	INTEGER :: i, j, k, l, xs
[1]	190
[1111]	191	REAL :: f_in(0:nx,nys_x:nyn_x,nzb_x:nzt_x), f_out(1:nz,nys:nyn,nxl:nxr)
[1]	192
[1111]	193	REAL, DIMENSION(nys_x:nyn_x,nnx,nzb_x:nzt_x,0:pdims(1)-1) :: work
[1]	194
[1111]	195	!$acc declare create( f_inv )
	196	REAL :: f_inv(nys:nyn,nxl:nxr,1:nz)
	197
	198
[1]	199	!
	200	!-- If the PE grid is one-dimensional along y, the array has only to be
	201	!-- reordered locally and therefore no transposition has to be done.
	202	IF ( pdims(1) /= 1 ) THEN
[1106]	203
	204	#if defined( __parallel )
[1]	205	!
	206	!-- Reorder input array for transposition
[1111]	207	!$OMP PARALLEL PRIVATE ( i, j, k, l, xs )
[683]	208	!$OMP DO
[1]	209	DO l = 0, pdims(1) - 1
	210	xs = 0 + l * nnx
[1111]	211	!$acc kernels present( f_in, work )
	212	!$acc loop
[1003]	213	DO k = nzb_x, nzt_x
[164]	214	DO i = xs, xs + nnx - 1
[1111]	215	!$acc loop vector( 32 )
[1003]	216	DO j = nys_x, nyn_x
[1111]	217	work(j,i-xs+1,k,l) = f_in(i,j,k)
[1]	218	ENDDO
	219	ENDDO
	220	ENDDO
[1111]	221	!$acc end kernels
[1]	222	ENDDO
[683]	223	!$OMP END PARALLEL
[1]	224
	225	!
	226	!-- Transpose array
	227	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
[622]	228	IF ( collective_wait ) CALL MPI_BARRIER( comm2d, ierr )
[1111]	229	!$acc update host( work )
	230	CALL MPI_ALLTOALL( work(nys_x,1,nzb_x,0), sendrecvcount_zx, MPI_REAL, &
	231	f_inv(nys,nxl,1), sendrecvcount_zx, MPI_REAL, &
[1]	232	comm1dx, ierr )
[1111]	233	!$acc update device( f_inv )
[1]	234	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
	235
	236	!
	237	!-- Reorder transposed array in a way that the z index is in first position
[683]	238	!$OMP PARALLEL PRIVATE ( i, j, k )
	239	!$OMP DO
[1111]	240	!$acc kernels present( f_out )
	241	!$acc loop
[1003]	242	DO k = 1, nz
	243	DO i = nxl, nxr
[1111]	244	!$acc loop vector( 32 )
[1003]	245	DO j = nys, nyn
[164]	246	f_out(k,j,i) = f_inv(j,i,k)
[1]	247	ENDDO
	248	ENDDO
	249	ENDDO
[1111]	250	!$acc end kernels
[683]	251	!$OMP END PARALLEL
[1106]	252	#endif
	253
[1]	254	ELSE
[1106]	255
[1]	256	!
	257	!-- Reorder the array in a way that the z index is in first position
[683]	258	!$OMP PARALLEL PRIVATE ( i, j, k )
	259	!$OMP DO
[1111]	260	!$acc kernels present( f_in )
	261	!$acc loop
[1003]	262	DO i = nxl, nxr
	263	DO j = nys, nyn
[1111]	264	!$acc loop vector( 32 )
[1003]	265	DO k = 1, nz
[164]	266	f_inv(j,i,k) = f_in(i,j,k)
[1]	267	ENDDO
	268	ENDDO
	269	ENDDO
[1111]	270	!$acc end kernels
[683]	271	!$OMP END PARALLEL
[1]	272
[683]	273	!$OMP PARALLEL PRIVATE ( i, j, k )
	274	!$OMP DO
[1111]	275	!$acc kernels present( f_out )
	276	!$acc loop
[1003]	277	DO k = 1, nz
	278	DO i = nxl, nxr
[1111]	279	!$acc loop vector( 32 )
[1003]	280	DO j = nys, nyn
[164]	281	f_out(k,j,i) = f_inv(j,i,k)
	282	ENDDO
[1]	283	ENDDO
	284	ENDDO
[1111]	285	!$acc end kernels
[683]	286	!$OMP END PARALLEL
[1]	287
[164]	288	ENDIF
	289
[1]	290	END SUBROUTINE transpose_xz
	291
	292
[164]	293	SUBROUTINE transpose_yx( f_in, work, f_out )
[1]	294
	295	!------------------------------------------------------------------------------!
	296	! Description:
	297	! ------------
	298	! Transposition of input array (f_in) from y to x. For the input array, all
	299	! elements along y reside on the same PE, while after transposition, all
	300	! elements along x reside on the same PE.
	301	!------------------------------------------------------------------------------!
	302
	303	USE cpulog
	304	USE indices
	305	USE interfaces
	306	USE pegrid
	307	USE transpose_indices
	308
	309	IMPLICIT NONE
	310
[1111]	311	INTEGER :: i, j, k, l, ys
[1]	312
[1111]	313	REAL :: f_in(0:ny,nxl_y:nxr_y,nzb_y:nzt_y), f_out(0:nx,nys_x:nyn_x,nzb_x:nzt_x)
[1]	314
[1111]	315	REAL, DIMENSION(nyn_x-nys_x+1,nzb_y:nzt_y,nxl_y:nxr_y,0:pdims(2)-1) :: work
	316
	317	!$acc declare create( f_inv )
	318	REAL :: f_inv(nys_x:nyn_x,nzb_x:nzt_x,0:nx)
	319
	320
[1106]	321	IF ( numprocs /= 1 ) THEN
	322
[1]	323	#if defined( __parallel )
	324	!
[1106]	325	!-- Reorder input array for transposition
[1111]	326	!$OMP PARALLEL PRIVATE ( i, j, k, l, ys )
[683]	327	!$OMP DO
[1106]	328	DO l = 0, pdims(2) - 1
	329	ys = 0 + l * ( nyn_x - nys_x + 1 )
[1111]	330	!$acc kernels present( f_in, work )
	331	!$acc loop
[1106]	332	DO i = nxl_y, nxr_y
	333	DO k = nzb_y, nzt_y
[1111]	334	!$acc loop vector( 32 )
[1106]	335	DO j = ys, ys + nyn_x - nys_x
[1111]	336	work(j-ys+1,k,i,l) = f_in(j,i,k)
[1106]	337	ENDDO
	338	ENDDO
	339	ENDDO
[1111]	340	!$acc end kernels
[1106]	341	ENDDO
	342	!$OMP END PARALLEL
	343
	344	!
	345	!-- Transpose array
	346	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
	347	IF ( collective_wait ) CALL MPI_BARRIER( comm2d, ierr )
[1111]	348	!$acc update host( work )
	349	CALL MPI_ALLTOALL( work(1,nzb_y,nxl_y,0), sendrecvcount_xy, MPI_REAL, &
	350	f_inv(nys_x,nzb_x,0), sendrecvcount_xy, MPI_REAL, &
[1106]	351	comm1dy, ierr )
[1111]	352	!$acc update device( f_inv )
[1106]	353	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
	354	#endif
	355
	356	ELSE
	357
	358	!
	359	!-- Reorder array f_in the same way as ALLTOALL did it
	360	!$OMP PARALLEL PRIVATE ( i, j, k )
	361	!$OMP DO
[1111]	362	!$acc kernels present( f_in )
	363	!$acc loop
[1003]	364	DO i = nxl_y, nxr_y
	365	DO k = nzb_y, nzt_y
[1111]	366	!$acc loop vector( 32 )
[1106]	367	DO j = 0, ny
	368	f_inv(j,k,i) = f_in(j,i,k)
[1]	369	ENDDO
	370	ENDDO
	371	ENDDO
[1111]	372	!$acc end kernels
[683]	373	!$OMP END PARALLEL
[1]	374
[1106]	375	ENDIF
[1]	376
	377	!
	378	!-- Reorder transposed array in a way that the x index is in first position
[683]	379	!$OMP PARALLEL PRIVATE ( i, j, k )
	380	!$OMP DO
[1111]	381	!$acc kernels present( f_out )
	382	!$acc loop
[1003]	383	DO i = 0, nx
	384	DO k = nzb_x, nzt_x
[1111]	385	!$acc loop vector( 32 )
[1003]	386	DO j = nys_x, nyn_x
[164]	387	f_out(i,j,k) = f_inv(j,k,i)
[1]	388	ENDDO
	389	ENDDO
	390	ENDDO
[1111]	391	!$acc end kernels
[683]	392	!$OMP END PARALLEL
[1]	393
	394	END SUBROUTINE transpose_yx
	395
	396
[164]	397	SUBROUTINE transpose_yxd( f_in, work, f_out )
[1]	398
	399	!------------------------------------------------------------------------------!
	400	! Description:
	401	! ------------
	402	! Transposition of input array (f_in) from y to x. For the input array, all
	403	! elements along y reside on the same PE, while after transposition, all
	404	! elements along x reside on the same PE.
	405	! This is a direct transposition for arrays with indices in regular order
	406	! (k,j,i) (cf. transpose_yx).
	407	!------------------------------------------------------------------------------!
	408
	409	USE cpulog
	410	USE indices
	411	USE interfaces
	412	USE pegrid
	413	USE transpose_indices
	414
	415	IMPLICIT NONE
	416
[1092]	417	INTEGER :: i, j, k, l, m, xs
[1]	418
[1003]	419	REAL :: f_in(1:nz,nys:nyn,nxl:nxr), f_inv(nxl:nxr,1:nz,nys:nyn), &
	420	f_out(0:nx,nys_x:nyn_x,nzb_x:nzt_x), &
[164]	421	work(nnxnnynnz)
[1]	422
	423	#if defined( __parallel )
	424
	425	!
	426	!-- Rearrange indices of input array in order to make data to be send
	427	!-- by MPI contiguous
[1003]	428	DO k = 1, nz
	429	DO j = nys, nyn
	430	DO i = nxl, nxr
[164]	431	f_inv(i,k,j) = f_in(k,j,i)
[1]	432	ENDDO
	433	ENDDO
	434	ENDDO
	435
	436	!
	437	!-- Transpose array
	438	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
[622]	439	IF ( collective_wait ) CALL MPI_BARRIER( comm2d, ierr )
[1]	440	CALL MPI_ALLTOALL( f_inv(nxl,1,nys), sendrecvcount_xy, MPI_REAL, &
[164]	441	work(1), sendrecvcount_xy, MPI_REAL, &
[1]	442	comm1dx, ierr )
	443	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
	444
	445	!
	446	!-- Reorder transposed array
	447	m = 0
	448	DO l = 0, pdims(1) - 1
	449	xs = 0 + l * nnx
[1003]	450	DO j = nys_x, nyn_x
	451	DO k = 1, nz
[1]	452	DO i = xs, xs + nnx - 1
	453	m = m + 1
[164]	454	f_out(i,j,k) = work(m)
[1]	455	ENDDO
	456	ENDDO
	457	ENDDO
	458	ENDDO
	459
	460	#endif
	461
	462	END SUBROUTINE transpose_yxd
	463
	464
[164]	465	SUBROUTINE transpose_yz( f_in, work, f_out )
[1]	466
	467	!------------------------------------------------------------------------------!
	468	! Description:
	469	! ------------
	470	! Transposition of input array (f_in) from y to z. For the input array, all
	471	! elements along y reside on the same PE, while after transposition, all
	472	! elements along z reside on the same PE.
	473	!------------------------------------------------------------------------------!
	474
	475	USE cpulog
	476	USE indices
	477	USE interfaces
	478	USE pegrid
	479	USE transpose_indices
	480
	481	IMPLICIT NONE
	482
[1111]	483	INTEGER :: i, j, k, l, zs
[1]	484
[1111]	485	REAL :: f_in(0:ny,nxl_y:nxr_y,nzb_y:nzt_y), f_out(nxl_z:nxr_z,nys_z:nyn_z,1:nz)
[1]	486
[1111]	487	REAL, DIMENSION(nxl_z:nxr_z,nzt_y-nzb_y+1,nys_z:nyn_z,0:pdims(1)-1) :: work
	488
	489	!$acc declare create( f_inv )
	490	REAL :: f_inv(nxl_y:nxr_y,nzb_y:nzt_y,0:ny)
	491
	492
[1]	493	!
	494	!-- Rearrange indices of input array in order to make data to be send
	495	!-- by MPI contiguous
[683]	496	!$OMP PARALLEL PRIVATE ( i, j, k )
	497	!$OMP DO
[1111]	498	!$acc kernels present( f_in )
	499	!$acc loop
[1003]	500	DO j = 0, ny
	501	DO k = nzb_y, nzt_y
[1111]	502	!$acc loop vector( 32 )
[1003]	503	DO i = nxl_y, nxr_y
[164]	504	f_inv(i,k,j) = f_in(j,i,k)
[1]	505	ENDDO
	506	ENDDO
	507	ENDDO
[1111]	508	!$acc end kernels
[683]	509	!$OMP END PARALLEL
[1]	510
	511	!
	512	!-- Move data to different array, because memory location of work1 is
	513	!-- needed further below (work1 = work2).
	514	!-- If the PE grid is one-dimensional along y, only local reordering
	515	!-- of the data is necessary and no transposition has to be done.
	516	IF ( pdims(1) == 1 ) THEN
[1106]	517
[683]	518	!$OMP PARALLEL PRIVATE ( i, j, k )
	519	!$OMP DO
[1111]	520	!$acc kernels present( f_out )
	521	!$acc loop
[1003]	522	DO j = 0, ny
	523	DO k = nzb_y, nzt_y
[1111]	524	!$acc loop vector( 32 )
[1003]	525	DO i = nxl_y, nxr_y
[164]	526	f_out(i,j,k) = f_inv(i,k,j)
[1]	527	ENDDO
	528	ENDDO
	529	ENDDO
[1111]	530	!$acc end kernels
[683]	531	!$OMP END PARALLEL
[1]	532
[1106]	533	ELSE
	534
	535	#if defined( __parallel )
[1]	536	!
[1106]	537	!-- Transpose array
	538	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
	539	IF ( collective_wait ) CALL MPI_BARRIER( comm2d, ierr )
[1111]	540	!$acc update host( f_inv )
	541	CALL MPI_ALLTOALL( f_inv(nxl_y,nzb_y,0), sendrecvcount_yz, MPI_REAL, &
	542	work(nxl_z,1,nys_z,0), sendrecvcount_yz, MPI_REAL, &
[1106]	543	comm1dx, ierr )
[1111]	544	!$acc update device( work )
[1106]	545	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
[1]	546
	547	!
[1106]	548	!-- Reorder transposed array
[1111]	549	!$OMP PARALLEL PRIVATE ( i, j, k, l, zs )
[683]	550	!$OMP DO
[1106]	551	DO l = 0, pdims(1) - 1
	552	zs = 1 + l * ( nzt_y - nzb_y + 1 )
[1111]	553	!$acc kernels present( f_out, work )
	554	!$acc loop
[1106]	555	DO j = nys_z, nyn_z
	556	DO k = zs, zs + nzt_y - nzb_y
[1111]	557	!$acc loop vector( 32 )
[1106]	558	DO i = nxl_z, nxr_z
[1111]	559	f_out(i,j,k) = work(i,k-zs+1,j,l)
[1106]	560	ENDDO
[1]	561	ENDDO
	562	ENDDO
[1111]	563	!$acc end kernels
[1]	564	ENDDO
[683]	565	!$OMP END PARALLEL
[1]	566	#endif
	567
[1106]	568	ENDIF
	569
[1]	570	END SUBROUTINE transpose_yz
	571
	572
[164]	573	SUBROUTINE transpose_zx( f_in, work, f_out )
[1]	574
	575	!------------------------------------------------------------------------------!
	576	! Description:
	577	! ------------
	578	! Transposition of input array (f_in) from z to x. For the input array, all
	579	! elements along z reside on the same PE, while after transposition, all
	580	! elements along x reside on the same PE.
	581	!------------------------------------------------------------------------------!
	582
	583	USE cpulog
	584	USE indices
	585	USE interfaces
	586	USE pegrid
	587	USE transpose_indices
	588
	589	IMPLICIT NONE
	590
[1111]	591	INTEGER :: i, j, k, l, xs
[1]	592
[1111]	593	REAL :: f_in(1:nz,nys:nyn,nxl:nxr), f_out(0:nx,nys_x:nyn_x,nzb_x:nzt_x)
[1]	594
[1111]	595	REAL, DIMENSION(nys_x:nyn_x,nnx,nzb_x:nzt_x,0:pdims(1)-1) :: work
	596
	597	!$acc declare create( f_inv )
[1106]	598	REAL :: f_inv(nys:nyn,nxl:nxr,1:nz)
[1]	599
[1106]	600
[1]	601	!
	602	!-- Rearrange indices of input array in order to make data to be send
	603	!-- by MPI contiguous
[683]	604	!$OMP PARALLEL PRIVATE ( i, j, k )
	605	!$OMP DO
[1106]	606	!$acc kernels present( f_in )
	607	!$acc loop
[1003]	608	DO k = 1,nz
	609	DO i = nxl, nxr
[1106]	610	!$acc loop vector( 32 )
[1003]	611	DO j = nys, nyn
[164]	612	f_inv(j,i,k) = f_in(k,j,i)
[1]	613	ENDDO
	614	ENDDO
	615	ENDDO
[1111]	616	!$acc end kernels
[683]	617	!$OMP END PARALLEL
[1]	618
	619	!
	620	!-- Move data to different array, because memory location of work1 is
	621	!-- needed further below (work1 = work2).
	622	!-- If the PE grid is one-dimensional along y, only local reordering
	623	!-- of the data is necessary and no transposition has to be done.
	624	IF ( pdims(1) == 1 ) THEN
[1106]	625
[683]	626	!$OMP PARALLEL PRIVATE ( i, j, k )
	627	!$OMP DO
[1106]	628	!$acc kernels present( f_out )
	629	!$acc loop
[1003]	630	DO k = 1, nz
	631	DO i = nxl, nxr
[1106]	632	!$acc loop vector( 32 )
[1003]	633	DO j = nys, nyn
[164]	634	f_out(i,j,k) = f_inv(j,i,k)
[1]	635	ENDDO
	636	ENDDO
	637	ENDDO
[1111]	638	!$acc end kernels
[683]	639	!$OMP END PARALLEL
[1]	640
[1106]	641	ELSE
	642
	643	#if defined( __parallel )
[1]	644	!
[1106]	645	!-- Transpose array
	646	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
	647	IF ( collective_wait ) CALL MPI_BARRIER( comm2d, ierr )
[1111]	648	!$acc update host( f_inv )
	649	CALL MPI_ALLTOALL( f_inv(nys,nxl,1), sendrecvcount_zx, MPI_REAL, &
	650	work(nys_x,1,nzb_x,0), sendrecvcount_zx, MPI_REAL, &
[1106]	651	comm1dx, ierr )
[1111]	652	!$acc update device( work )
[1106]	653	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
[1]	654
	655	!
[1106]	656	!-- Reorder transposed array
[1111]	657	!$OMP PARALLEL PRIVATE ( i, j, k, l, xs )
[683]	658	!$OMP DO
[1106]	659	DO l = 0, pdims(1) - 1
	660	xs = 0 + l * nnx
[1111]	661	!$acc kernels present( f_out, work )
	662	!$acc loop
[1106]	663	DO k = nzb_x, nzt_x
	664	DO i = xs, xs + nnx - 1
[1111]	665	!$acc loop vector( 32 )
[1106]	666	DO j = nys_x, nyn_x
[1111]	667	f_out(i,j,k) = work(j,i-xs+1,k,l)
[1106]	668	ENDDO
[1]	669	ENDDO
	670	ENDDO
[1111]	671	!$acc end kernels
[1]	672	ENDDO
[683]	673	!$OMP END PARALLEL
[1]	674	#endif
	675
[1106]	676	ENDIF
	677
[1]	678	END SUBROUTINE transpose_zx
	679
	680
[164]	681	SUBROUTINE transpose_zy( f_in, work, f_out )
[1]	682
	683	!------------------------------------------------------------------------------!
	684	! Description:
	685	! ------------
	686	! Transposition of input array (f_in) from z to y. For the input array, all
	687	! elements along z reside on the same PE, while after transposition, all
	688	! elements along y reside on the same PE.
	689	!------------------------------------------------------------------------------!
	690
	691	USE cpulog
	692	USE indices
	693	USE interfaces
	694	USE pegrid
	695	USE transpose_indices
	696
	697	IMPLICIT NONE
	698
[1111]	699	INTEGER :: i, j, k, l, zs
[1]	700
[1111]	701	REAL :: f_in(nxl_z:nxr_z,nys_z:nyn_z,1:nz), f_out(0:ny,nxl_y:nxr_y,nzb_y:nzt_y)
[1]	702
[1111]	703	REAL, DIMENSION(nxl_z:nxr_z,nzt_y-nzb_y+1,nys_z:nyn_z,0:pdims(1)-1) :: work
	704
	705	!$acc declare create( f_inv )
	706	REAL :: f_inv(nxl_y:nxr_y,nzb_y:nzt_y,0:ny)
	707
	708
[1]	709	!
	710	!-- If the PE grid is one-dimensional along y, the array has only to be
	711	!-- reordered locally and therefore no transposition has to be done.
	712	IF ( pdims(1) /= 1 ) THEN
[1106]	713
	714	#if defined( __parallel )
[1]	715	!
	716	!-- Reorder input array for transposition
[1111]	717	!$OMP PARALLEL PRIVATE ( i, j, k, l, zs )
[683]	718	!$OMP DO
[1]	719	DO l = 0, pdims(1) - 1
[1003]	720	zs = 1 + l * ( nzt_y - nzb_y + 1 )
[1111]	721	!$acc kernels present( f_in, work )
	722	!$acc loop
[1003]	723	DO j = nys_z, nyn_z
	724	DO k = zs, zs + nzt_y - nzb_y
[1111]	725	!$acc loop vector( 32 )
[1003]	726	DO i = nxl_z, nxr_z
[1111]	727	work(i,k-zs+1,j,l) = f_in(i,j,k)
[1]	728	ENDDO
	729	ENDDO
	730	ENDDO
[1111]	731	!$acc end kernels
[1]	732	ENDDO
[683]	733	!$OMP END PARALLEL
[1]	734
	735	!
	736	!-- Transpose array
	737	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
[622]	738	IF ( collective_wait ) CALL MPI_BARRIER( comm2d, ierr )
[1111]	739	!$acc update host( work )
	740	CALL MPI_ALLTOALL( work(nxl_z,1,nys_z,0), sendrecvcount_yz, MPI_REAL, &
	741	f_inv(nxl_y,nzb_y,0), sendrecvcount_yz, MPI_REAL, &
[1]	742	comm1dx, ierr )
[1111]	743	!$acc update device( f_inv )
[1]	744	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
[1106]	745	#endif
[1]	746
	747	ELSE
	748	!
[1106]	749	!-- Reorder the array in the same way like ALLTOALL did it
[683]	750	!$OMP PARALLEL PRIVATE ( i, j, k )
	751	!$OMP DO
[1111]	752	!$acc kernels present( f_in )
	753	!$acc loop
[1003]	754	DO k = nzb_y, nzt_y
	755	DO j = 0, ny
[1111]	756	!$acc loop vector( 32 )
[1003]	757	DO i = nxl_y, nxr_y
[164]	758	f_inv(i,k,j) = f_in(i,j,k)
	759	ENDDO
	760	ENDDO
	761	ENDDO
[1111]	762	!$acc end kernels
[683]	763	!$OMP END PARALLEL
[1106]	764
	765	ENDIF
	766
[164]	767	!
[1106]	768	!-- Reorder transposed array in a way that the y index is in first position
[683]	769	!$OMP PARALLEL PRIVATE ( i, j, k )
	770	!$OMP DO
[1111]	771	!$acc kernels present( f_out )
	772	!$acc loop
[1106]	773	DO k = nzb_y, nzt_y
	774	DO i = nxl_y, nxr_y
[1111]	775	!$acc loop vector( 32 )
[1106]	776	DO j = 0, ny
	777	f_out(j,i,k) = f_inv(i,k,j)
[1]	778	ENDDO
	779	ENDDO
[1106]	780	ENDDO
[1111]	781	!$acc end kernels
[683]	782	!$OMP END PARALLEL
[164]	783
[1]	784	END SUBROUTINE transpose_zy
	785
	786
[164]	787	SUBROUTINE transpose_zyd( f_in, work, f_out )
[1]	788
	789	!------------------------------------------------------------------------------!
	790	! Description:
	791	! ------------
	792	! Transposition of input array (f_in) from z to y. For the input array, all
	793	! elements along z reside on the same PE, while after transposition, all
	794	! elements along y reside on the same PE.
	795	! This is a direct transposition for arrays with indices in regular order
	796	! (k,j,i) (cf. transpose_zy).
	797	!------------------------------------------------------------------------------!
	798
	799	USE cpulog
	800	USE indices
	801	USE interfaces
	802	USE pegrid
	803	USE transpose_indices
	804
	805	IMPLICIT NONE
	806
	807	INTEGER :: i, j, k, l, m, ys
	808
[1003]	809	REAL :: f_in(1:nz,nys:nyn,nxl:nxr), f_inv(nys:nyn,nxl:nxr,1:nz), &
	810	f_out(0:ny,nxl_yd:nxr_yd,nzb_yd:nzt_yd), &
[164]	811	work(nnxnnynnz)
[1]	812
	813	#if defined( __parallel )
	814
	815	!
	816	!-- Rearrange indices of input array in order to make data to be send
	817	!-- by MPI contiguous
[1003]	818	DO i = nxl, nxr
	819	DO j = nys, nyn
	820	DO k = 1, nz
[164]	821	f_inv(j,i,k) = f_in(k,j,i)
[1]	822	ENDDO
	823	ENDDO
	824	ENDDO
	825
	826	!
	827	!-- Move data to different array, because memory location of work1 is
	828	!-- needed further below (work1 = work2).
	829	!-- If the PE grid is one-dimensional along x, only local reordering
	830	!-- of the data is necessary and no transposition has to be done.
	831	IF ( pdims(2) == 1 ) THEN
[1003]	832	DO k = 1, nz
	833	DO i = nxl, nxr
	834	DO j = nys, nyn
[164]	835	f_out(j,i,k) = f_inv(j,i,k)
[1]	836	ENDDO
	837	ENDDO
	838	ENDDO
	839	RETURN
	840	ENDIF
	841
	842	!
	843	!-- Transpose array
	844	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
[622]	845	IF ( collective_wait ) CALL MPI_BARRIER( comm2d, ierr )
[1]	846	CALL MPI_ALLTOALL( f_inv(nys,nxl,1), sendrecvcount_zyd, MPI_REAL, &
[164]	847	work(1), sendrecvcount_zyd, MPI_REAL, &
[1]	848	comm1dy, ierr )
	849	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
	850
	851	!
	852	!-- Reorder transposed array
	853	m = 0
	854	DO l = 0, pdims(2) - 1
	855	ys = 0 + l * nny
[1003]	856	DO k = nzb_yd, nzt_yd
	857	DO i = nxl_yd, nxr_yd
[1]	858	DO j = ys, ys + nny - 1
	859	m = m + 1
[164]	860	f_out(j,i,k) = work(m)
[1]	861	ENDDO
	862	ENDDO
	863	ENDDO
	864	ENDDO
	865
	866	#endif
	867
	868	END SUBROUTINE transpose_zyd

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: palm/trunk/SOURCE/transpose.f90 @ 1111

Download in other formats: