Home

Context Navigation

poisfft_hybrid.f90 @ 1111

Last change on this file since 1111 was 1111, checked in by raasch, 11 years ago

New:
---

GPU porting of pres, swap_timelevel. Adjustments of openACC directives.
Further porting of poisfft, which now runs completely on GPU without any
host/device data transfer for serial an parallel runs (but parallel runs
require data transfer before and after the MPI transpositions).
GPU-porting of tridiagonal solver:
tridiagonal routines split into extermal subroutines (instead using CONTAINS),
no distinction between parallel/non-parallel in poisfft and tridia any more,
tridia routines moved to end of file because of probable bug in PGI compiler
(otherwise "invalid device function" is indicated during runtime).
(cuda_fft_interfaces, fft_xy, flow_statistics, init_3d_model, palm, poisfft, pres, prognostic_equations, swap_timelevel, time_integration, transpose)
output of accelerator board information. (header)

optimization of tridia routines: constant elements and coefficients of tri are
stored in seperate arrays ddzuw and tric, last dimension of tri reduced from 5 to 2,
(init_grid, init_3d_model, modules, palm, poisfft)

poisfft_init is now called internally from poisfft,
(Makefile, Makefile_check, init_pegrid, poisfft, poisfft_hybrid)

CPU-time per grid point and timestep is output to CPU_MEASURES file
(cpu_statistics, modules, time_integration)

Changed:

resorting from/to array work changed, work now has 4 dimensions instead of 1 (transpose)
array diss allocated only if required (init_3d_model)

pressure boundary condition "Neumann+inhomo" removed from the code
(check_parameters, header, poisfft, poisfft_hybrid, pres)

Errors:

bugfix: dependency added for cuda_fft_interfaces (Makefile)
bugfix: CUDA fft plans adjusted for domain decomposition (before they always
used total domain) (fft_xy)

Property svn:keywords set to Id

File size: 33.3 KB

Rev	Line
[1]	1	MODULE poisfft_hybrid_mod
[808]	2
[1036]	3	!--------------------------------------------------------------------------------!
	4	! This file is part of PALM.
	5	!
	6	! PALM is free software: you can redistribute it and/or modify it under the terms
	7	! of the GNU General Public License as published by the Free Software Foundation,
	8	! either version 3 of the License, or (at your option) any later version.
	9	!
	10	! PALM is distributed in the hope that it will be useful, but WITHOUT ANY
	11	! WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
	12	! A PARTICULAR PURPOSE. See the GNU General Public License for more details.
	13	!
	14	! You should have received a copy of the GNU General Public License along with
	15	! PALM. If not, see <http://www.gnu.org/licenses/>.
	16	!
	17	! Copyright 1997-2012 Leibniz University Hannover
	18	!--------------------------------------------------------------------------------!
	19	!
[257]	20	! Current revisions:
[1]	21	! -----------------
[1111]	22	! poisfft_hybrid_ini is now called internally from poisfft_hybrid,
	23	! ibc_p_b = 2 removed
[1]	24	!
	25	! Former revisions:
	26	! -----------------
[3]	27	! $Id: poisfft_hybrid.f90 1111 2013-03-08 23:54:10Z raasch $
[392]	28	!
[1107]	29	! 1106 2013-03-04 05:31:38Z raasch
	30	! calls of fft_x, fft_y replaced by fft_x_1d, fft_y_1d
	31	!
[1037]	32	! 1036 2012-10-22 13:43:42Z raasch
	33	! code put under GPL (PALM 3.9)
	34	!
[1014]	35	! 1013 2012-09-21 07:03:55Z raasch
	36	! FLOAT type conversion replaced by REAL
	37	!
[810]	38	! 809 2012-01-30 13:32:58Z maronga
	39	! Bugfix: replaced .AND. and .NOT. with && and ! in the preprocessor directives
	40	!
[808]	41	! 807 2012-01-25 11:53:51Z maronga
	42	! New cpp directive "__check" implemented which is used by check_namelist_files
	43	! (most of the code is unneeded by check_namelist_files).
	44	!
[668]	45	! 667 2010-12-23 12:06:00Z suehring/gryschka
	46	! ddzu replaced by ddzu_pres due to changes in zu(0)
	47	!
[482]	48	! 415 2009-12-15 10:26:23Z raasch
[807]	49	! Dimension of array stat in cascade change to prevent type problems with___
[482]	50	! mpi2 libraries
	51	!
[392]	52	! 274 2009-03-26 15:11:21Z heinze
	53	! Output of messages replaced by message handling routine.
	54	!
	55	! Feb. 2007
[3]	56	! RCS Log replace by Id keyword, revision history cleaned up
	57	!
[1]	58	! Revision 1.11 2004/04/30 12:43:14 raasch
	59	! Renaming of fft routines, additional argument in calls of fft_y_m
	60	!
	61	! Revision 1.2 2002/12/19 16:08:31 raasch
	62	! Preprocessor directive KKMP introduced (OMP does NOT work),
	63	! array tri will be a shared array in OpenMP loop, to get better cache
	64	! utilization, the i index (x-direction) will be executed in stride
	65	! "istride" as outer loop and in a shorter inner loop,
	66	! overlapping of computation and communication realized by new routine
	67	! poisfft_hybrid_nodes, name of old routine poisfft_hybrid changed to
	68	! poisfft_hybrid_omp, STOP statement replaced by call of subroutine local_stop
	69	!
	70	!
	71	! Description:
	72	! ------------
	73	! Solution of the Poisson equation with a 2D spectral method.
	74	! Hybrid version for parallel computers using a 1D domain decomposition,
	75	! realized with MPI, along x and parallelization with OPEN-MP along y
	76	! (routine poisfft_hybrid_omp). In a second version (poisfft_hybrid_nodes),
	77	! optimization is realized by overlapping of computation and communication
	78	! and by simultaneously executing as many communication calls as switches
	79	! per logical partition (LPAR) are available. This version comes into
	80	! effect if more than one node is used and if the environment variable
	81	! tasks_per_node is set in a way that it can be devided by switch_per_lpar
	82	! without any rest.
	83	!
	84	! WARNING: In case of OpenMP, there are problems with allocating large
	85	! arrays in parallel regions.
	86	!
	87	! Copyright Klaus Ketelsen / Siegfried Raasch May 2002
	88	!------------------------------------------------------------------------------!
	89
	90	USE fft_xy
	91	USE indices
	92	USE pegrid
	93
	94	IMPLICIT NONE
	95
	96	INTEGER, PARAMETER :: switch_per_lpar = 2
	97
	98	INTEGER, SAVE :: nxl_a, nxr_a, & ! total x dimension
	99	nxl_p, nxr_p, & ! partial x dimension
	100	nys_a, nyn_a, & ! total y dimension
	101	nys_p, nyn_p, & ! partial y dimension
	102
	103	npe_s, & ! total number of PEs for solver
	104	nwords, & ! number of points to be exchanged
	105	! with MPI_ALLTOALL
	106	n_omp_threads ! number of OpenMP threads
	107
	108	!
	109	!-- Variables for multi node version (cluster version) using routine
	110	!-- poisfft_hybrid_nodes
	111	INTEGER, SAVE :: comm_nodes, & ! communicater nodes
	112	comm_node_all, & ! communicater all PEs node version
	113	comm_tasks, & ! communicater tasks
	114	me, me_node, me_task,& ! identity of this PE
	115	nodes, & ! number of nodes
	116	tasks_per_logical_node = -1 ! default no cluster
	117
[1111]	118	LOGICAL, SAVE :: poisfft_initialized = .FALSE.
[807]	119
[1111]	120
[807]	121	PRIVATE
	122
	123
[809]	124	#if ! defined ( __check )
[807]	125	PUBLIC poisfft_hybrid, poisfft_hybrid_ini
	126
	127
[1]	128	!
	129	!-- Public interfaces
	130	INTERFACE poisfft_hybrid_ini
	131	MODULE PROCEDURE poisfft_hybrid_ini
	132	END INTERFACE poisfft_hybrid_ini
	133
	134	INTERFACE poisfft_hybrid
	135	MODULE PROCEDURE poisfft_hybrid
	136	END INTERFACE poisfft_hybrid
	137
	138	!
	139	!-- Private interfaces
	140	INTERFACE poisfft_hybrid_omp
	141	MODULE PROCEDURE poisfft_hybrid_omp
	142	END INTERFACE poisfft_hybrid_omp
	143
	144	INTERFACE poisfft_hybrid_omp_vec
	145	MODULE PROCEDURE poisfft_hybrid_omp_vec
	146	END INTERFACE poisfft_hybrid_omp_vec
	147
	148	INTERFACE poisfft_hybrid_nodes
	149	MODULE PROCEDURE poisfft_hybrid_nodes
	150	END INTERFACE poisfft_hybrid_nodes
	151
	152	INTERFACE tridia_hybrid
	153	MODULE PROCEDURE tridia_hybrid
	154	END INTERFACE tridia_hybrid
	155
	156	INTERFACE cascade
	157	MODULE PROCEDURE cascade
	158	END INTERFACE cascade
[807]	159	#else
	160	PUBLIC poisfft_hybrid_ini
[1]	161
[807]	162	!
	163	!-- Public interfaces
	164	INTERFACE poisfft_hybrid_ini
	165	MODULE PROCEDURE poisfft_hybrid_ini
	166	END INTERFACE poisfft_hybrid_ini
	167	#endif
	168
[1]	169	CONTAINS
	170
[807]	171
[1]	172	SUBROUTINE poisfft_hybrid_ini
	173
	174	USE control_parameters
	175	USE pegrid
	176
	177	IMPLICIT NONE
	178
	179	CHARACTER(LEN=8) :: cdummy
	180	INTEGER :: idummy, istat
	181	INTEGER, DIMENSION(2) :: coords, dims
	182
	183	LOGICAL, DIMENSION(2) :: period = .false., re_dims
	184
	185
	186	!
	187	!-- Set the internal index values for the hybrid solver
	188	#if defined( __parallel )
	189	npe_s = pdims(1)
	190	#else
	191	npe_s = 1
	192	#endif
	193	nxl_a = 0
	194	nxr_a = nx
	195	nxl_p = 0
	196	nxr_p = ( ( nx+1 ) / npe_s ) - 1
	197	nys_a = nys
	198	nyn_a = nyn
	199	nys_p = 0
	200	nyn_p = ( ( ny+1 ) / npe_s ) - 1
	201
	202	nwords = ( nxr_p-nxl_p+1 ) * nz * ( nyn_p-nys_p+1 )
	203
[809]	204	#if defined( __KKMP ) && ! defined ( __check )
[1]	205	CALL LOCAL_GETENV( 'OMP_NUM_THREADS', 15, cdummy, idummy )
	206	READ ( cdummy, '(I8)' ) n_omp_threads
[257]	207	IF ( n_omp_threads > 1 ) THEN
	208	WRITE( message_string, * ) 'Number of OpenMP threads = ', &
	209	n_omp_threads
	210	CALL message( 'poisfft_hybrid_ini', 'PA0280', 0, 0, 0, 6, 0 )
[1]	211	ENDIF
	212	#else
	213	n_omp_threads = 1
	214	#endif
	215	!
	216	!-- Initialize the one-dimensional FFT routines
	217	CALL fft_init
	218
	219	!
	220	!-- Setup for multi node version (poisfft_hybrid_nodes)
	221	IF ( n_omp_threads == 1 .AND. &
	222	( host(1:4) == 'ibmh' .OR. host(1:4) == 'ibmb' ) ) THEN
	223
	224	IF ( tasks_per_node /= -9999 ) THEN
	225	!
	226	!-- Multi node version requires that the available number of
	227	!-- switches per logical partition must be an integral divisor
	228	!-- of the chosen number of tasks per node
	229	IF ( MOD( tasks_per_node, switch_per_lpar ) == 0 ) THEN
	230	!
	231	!-- Set the switch which decides about usage of the multi node
	232	!-- version
	233	IF ( tasks_per_node / switch_per_lpar > 1 .AND. &
	234	numprocs > tasks_per_node ) THEN
	235	tasks_per_logical_node = tasks_per_node / switch_per_lpar
	236	ENDIF
	237
[257]	238	IF ( tasks_per_logical_node > -1 ) THEN
	239
	240	WRITE( message_string, * ) 'running optimized ', &
	241	'multinode version', &
	242	'&switch_per_lpar = ', &
	243	switch_per_lpar, &
	244	'&tasks_per_lpar = ', &
	245	tasks_per_node, &
	246	'tasks_per_logical_node = ', &
	247	tasks_per_logical_node
[274]	248	CALL message( 'poisfft_hybrid_ini', 'PA0281', 0, 0, 0, 6, 0 )
[257]	249
[1]	250	ENDIF
	251
	252	ENDIF
	253	ENDIF
	254	ENDIF
	255
	256	!
	257	!-- Determine sub-topologies for multi node version
	258	IF ( tasks_per_logical_node >= 2 ) THEN
	259
[809]	260	#if defined( __parallel ) && ! defined ( __check )
[1]	261	nodes = ( numprocs + tasks_per_logical_node - 1 ) / &
	262	tasks_per_logical_node
	263	dims(1) = nodes
	264	dims(2) = tasks_per_logical_node
	265
	266	CALL MPI_CART_CREATE( comm2d, 2, dims, period, .FALSE., &
	267	comm_node_all, istat )
	268	CALL MPI_COMM_RANK( comm_node_all, me, istat )
	269
	270	re_dims(1) = .TRUE.
	271	re_dims(2) = .FALSE.
	272	CALL MPI_CART_SUB( comm_node_all, re_dims, comm_nodes, istat )
	273	CALL MPI_COMM_RANK( comm_nodes, me_node, istat )
	274
	275	re_dims(1) = .FALSE.
	276	re_dims(2) = .TRUE.
	277	CALL MPI_CART_SUB( comm_node_all, re_dims, comm_tasks, istat )
	278	CALL MPI_COMM_RANK( comm_tasks, me_task, istat )
	279
	280	! write(0,*) 'who am i',myid,me,me_node,me_task,nodes,&
	281	! tasks_per_logical_node
[809]	282	#elif ! defined( __parallel )
[257]	283	message_string = 'parallel environment (MPI) required'
	284	CALL message( 'poisfft_hybrid_ini', 'PA0282', 1, 2, 0, 6, 0 )
[1]	285	#endif
	286	ENDIF
	287
[1111]	288	poisfft_initialized = .TRUE.
	289
[1]	290	END SUBROUTINE poisfft_hybrid_ini
	291
[809]	292	#if ! defined ( __check )
[1]	293	SUBROUTINE poisfft_hybrid( ar )
	294
	295	USE control_parameters
	296	USE interfaces
	297
	298	IMPLICIT NONE
	299
	300	REAL, DIMENSION(1:nz,nys:nyn,nxl:nxr) :: ar
	301
[1111]	302	IF ( .NOT. poisfft_initialized ) CALL poisfft_hybrid_ini
	303
[1]	304	IF ( host(1:3) == 'nec' ) THEN
	305	CALL poisfft_hybrid_omp_vec( ar )
	306	ELSE
	307	IF ( tasks_per_logical_node == -1 ) THEN
	308	CALL poisfft_hybrid_omp( ar )
	309	ELSE
	310	CALL poisfft_hybrid_nodes( ar )
	311	ENDIF
	312	ENDIF
	313
	314	END SUBROUTINE poisfft_hybrid
	315
	316
	317	SUBROUTINE poisfft_hybrid_omp ( ar )
	318
	319	USE cpulog
	320	USE interfaces
	321
	322	IMPLICIT NONE
	323
	324	INTEGER, PARAMETER :: istride = 4 ! stride of i loop
	325	INTEGER :: i, ii, ir, iei, iouter, istat, j, jj, k, m, n, jthread
	326
	327	REAL, DIMENSION(1:nz,nys:nyn,nxl:nxr) :: ar
	328
	329	REAL, DIMENSION(0:nx) :: fftx_ar
	330	REAL, DIMENSION(0:ny,istride) :: ffty_ar
	331
	332	REAL, DIMENSION(0:nx,nz) :: tri_ar
	333
	334	REAL, DIMENSION(nxl_p:nxr_p,nz,nys_p:nyn_p,npe_s) :: work1, work2
	335	#if defined( __KKMP )
	336	INTEGER :: omp_get_thread_num
	337	REAL, DIMENSION(:,:,:,:), ALLOCATABLE :: tri
	338	ALLOCATE( tri(5,0:nx,0:nz-1,n_omp_threads ) )
	339	#else
	340	REAL, DIMENSION(5,0:nx,0:nz-1,1) :: tri
	341	#endif
	342
	343
	344	CALL cpu_log( log_point_s(30), 'poisfft_hybrid_omp', 'start' )
	345
[1106]	346	CALL cpu_log( log_point_s(7), 'fft_y_1d', 'start' )
[1]	347
	348	!$OMP PARALLEL PRIVATE (i,iouter,ii,ir,iei,j,k,m,n,ffty_ar)
	349	!$OMP DO
	350	!
	351	!-- Store grid points to be transformed on a 1d-array, do the fft
	352	!-- and sample the results on a 4d-array
	353	DO iouter = nxl_p, nxr_p, istride ! stride loop, better cache
	354	iei = MIN( iouter+istride-1, nxr_p )
	355	DO k = 1, nz
	356
	357	DO i = iouter, iei
	358	ii = nxl + i
	359	ir = i - iouter + 1
	360
	361	DO j = nys_a, nyn_a
	362	ffty_ar(j,ir) = ar(k,j,ii)
	363	ENDDO
	364
[1106]	365	CALL fft_y_1d( ffty_ar(:,ir), 'forward' )
[1]	366	ENDDO
	367
	368	m = nys_a
	369	DO n = 1, npe_s
	370	DO j = nys_p, nyn_p
	371	DO i = iouter, iei
	372	ir = i - iouter + 1
	373	work1(i,k,j,n) = ffty_ar(m,ir)
	374	ENDDO
	375	m = m+1
	376	ENDDO
	377	ENDDO
	378
	379	ENDDO
	380	ENDDO
	381	!$OMP END PARALLEL
	382
[1106]	383	CALL cpu_log( log_point_s(7), 'fft_y_1d', 'pause' )
[1]	384
	385	#if defined( __parallel )
	386	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
	387
	388	CALL MPI_ALLTOALL( work1(nxl_p,1,nys_p,1), nwords, MPI_REAL, &
	389	work2(nxl_p,1,nys_p,1), nwords, MPI_REAL, &
	390	comm2d, istat )
	391
	392	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
	393	#else
	394	work2 = work1
	395	#endif
	396
[1106]	397	CALL cpu_log( log_point_s(33), 'fft_x_1d + tridia', 'start' )
[1]	398
	399	#if defined( __KKMP )
	400	!$OMP PARALLEL PRIVATE (i,j,jj,k,m,n,fftx_ar,tri_ar,jthread)
	401	!$OMP DO
	402	DO j = nys_p, nyn_p
	403	jthread = omp_get_thread_num() + 1
	404	#else
	405	DO j = nys_p, nyn_p
	406	jthread = 1
	407	#endif
	408	DO k = 1, nz
	409
	410	m = nxl_a
	411	DO n = 1, npe_s
	412	DO i = nxl_p, nxr_p
	413	fftx_ar(m) = work2(i,k,j,n)
	414	m = m+1
	415	ENDDO
	416	ENDDO
	417
[1106]	418	CALL fft_x_1d( fftx_ar, 'forward' )
[1]	419
	420	DO i = nxl_a, nxr_a
	421	tri_ar(i,k) = fftx_ar(i)
	422	ENDDO
	423
	424	ENDDO
	425
	426	jj = myid * (nyn_p-nys_p+1) + j
	427	CALL tridia_hybrid( jj, tri_ar, tri(:,:,:,jthread))
	428
	429	DO k = 1, nz
	430	DO i = nxl_a, nxr_a
	431	fftx_ar(i) = tri_ar (i,k)
	432	ENDDO
	433
[1106]	434	CALL fft_x_1d( fftx_ar, 'backward' )
[1]	435
	436	m = nxl_a
	437	DO n = 1, npe_s
	438	DO i = nxl_p, nxr_p
	439	work2(i,k,j,n) = fftx_ar(m)
	440	m = m+1
	441	ENDDO
	442	ENDDO
	443
	444	ENDDO
	445	ENDDO
	446	#if defined( __KKMP )
	447	!$OMP END PARALLEL
	448	#endif
	449
[1106]	450	CALL cpu_log( log_point_s(33), 'fft_x_1d + tridia', 'stop' )
[1]	451
	452	#if defined( __parallel )
	453	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
	454	nwords = (nxr_p-nxl_p+1) * nz * (nyn_p-nys_p+1)
	455
	456	CALL MPI_ALLTOALL( work2(nxl_p,1,nys_p,1), nwords, MPI_REAL, &
	457	work1(nxl_p,1,nys_p,1), nwords, MPI_REAL, &
	458	comm2d, istat )
	459
	460	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
	461	#else
	462	work1 = work2
	463	#endif
	464
[1106]	465	CALL cpu_log( log_point_s(7), 'fft_y_1d', 'continue' )
[1]	466
	467	!$OMP PARALLEL PRIVATE (i,iouter,ii,ir,iei,j,k,m,n,ffty_ar)
	468	!$OMP DO
	469	DO iouter = nxl_p, nxr_p, istride
	470	iei = MIN( iouter+istride-1, nxr_p )
	471	DO k = 1, nz
	472
	473	m = nys_a
	474	DO n = 1, npe_s
	475	DO j = nys_p, nyn_p
	476	DO i = iouter, iei
	477	ir = i - iouter + 1
	478	ffty_ar(m,ir) = work1 (i,k,j,n)
	479	ENDDO
	480	m = m+1
	481	ENDDO
	482	ENDDO
	483
	484	DO i = iouter, iei
	485	ii = nxl + i
	486	ir = i - iouter + 1
[1106]	487	CALL fft_y_1d( ffty_ar(:,ir), 'backward' )
[1]	488
	489	DO j = nys_a, nyn_a
	490	ar(k,j,ii) = ffty_ar(j,ir)
	491	ENDDO
	492	ENDDO
	493
	494	ENDDO
	495	ENDDO
	496	!$OMP END PARALLEL
	497
[1106]	498	CALL cpu_log( log_point_s(7), 'fft_y_1d', 'stop' )
[1]	499
	500	CALL cpu_log( log_point_s(30), 'poisfft_hybrid_omp', 'stop' )
	501
	502	#if defined( __KKMP )
	503	DEALLOCATE( tri )
	504	#endif
	505
	506	END SUBROUTINE poisfft_hybrid_omp
	507
	508
	509	SUBROUTINE poisfft_hybrid_omp_vec ( ar )
	510
	511	USE cpulog
	512	USE interfaces
	513
	514	IMPLICIT NONE
	515
	516	INTEGER, PARAMETER :: istride = 4 ! stride of i loop
	517	INTEGER :: i, ii, ir, iei, iouter, istat, j, jj, k, m, n, jthread
	518
	519	REAL, DIMENSION(0:nx,nz) :: tri_ar
	520
	521	REAL, DIMENSION(1:nz,nys:nyn,nxl:nxr) :: ar
	522
	523	REAL, DIMENSION(0:ny+3,nz,nxl_p:nxr_p) :: ffty_ar3
	524	REAL, DIMENSION(0:nx+3,nz,nys_p:nyn_p) :: fftx_ar3
	525
	526	REAL, DIMENSION(nxl_p:nxr_p,nz,nys_p:nyn_p,npe_s) :: work1, work2
	527	#if defined( __KKMP )
	528	INTEGER :: omp_get_thread_num
	529	REAL, DIMENSION(:,:,:,:), ALLOCATABLE :: tri
	530	ALLOCATE( tri(5,0:nx,0:nz-1,n_omp_threads ) )
	531	#else
	532	REAL, DIMENSION(5,0:nx,0:nz-1,1) :: tri
	533	#endif
	534
	535
	536	CALL cpu_log( log_point_s(30), 'poisfft_hybrid_vec', 'start' )
	537
	538	CALL cpu_log( log_point_s(7), 'fft_y_m', 'start' )
	539
	540	!$OMP PARALLEL PRIVATE (i,j,k,m,n)
	541	!$OMP DO
	542	!
	543	!-- Store grid points to be transformed on a 1d-array, do the fft
	544	!-- and sample the results on a 4d-array
	545	DO i = nxl_p, nxr_p
	546
	547	DO j = nys_a, nyn_a
	548	DO k = 1, nz
	549	ffty_ar3(j,k,i) = ar(k,j,i+nxl)
	550	ENDDO
	551	ENDDO
	552
	553	CALL fft_y_m( ffty_ar3(:,:,i), ny+3, 'forward' )
	554	ENDDO
	555
	556	!$OMP DO
	557	DO k = 1, nz
	558	m = nys_a
	559	DO n = 1, npe_s
	560	DO j = nys_p, nyn_p
	561	DO i = nxl_p, nxr_p
	562	work1(i,k,j,n) = ffty_ar3(m,k,i)
	563	ENDDO
	564	m = m+1
	565	ENDDO
	566	ENDDO
	567	ENDDO
	568	!$OMP END PARALLEL
	569
	570	CALL cpu_log( log_point_s(7), 'fft_y_m', 'pause' )
	571
	572	#if defined( __parallel )
	573	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
	574	CALL MPI_ALLTOALL( work1(nxl_p,1,nys_p,1), nwords, MPI_REAL, &
	575	work2(nxl_p,1,nys_p,1), nwords, MPI_REAL, &
	576	comm2d, istat )
	577	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
	578	#else
	579	work2 = work1
	580	#endif
	581
	582	CALL cpu_log( log_point_s(33), 'fft_x_m + tridia', 'start' )
	583
	584	#if defined( __KKMP )
	585	!$OMP PARALLEL PRIVATE (i,j,jj,k,m,n,tri_ar,jthread)
	586	!$OMP DO
	587	DO j = nys_p, nyn_p
	588	jthread = omp_get_thread_num() + 1
	589	#else
	590	DO j = nys_p, nyn_p
	591	jthread = 1
	592	#endif
	593	DO k = 1, nz
	594
	595	m = nxl_a
	596	DO n = 1, npe_s
	597	DO i = nxl_p, nxr_p
	598	fftx_ar3(m,k,j) = work2(i,k,j,n)
	599	m = m+1
	600	ENDDO
	601	ENDDO
	602	ENDDO
	603
	604	CALL fft_x_m( fftx_ar3(:,:,j), 'forward' )
	605
	606	DO k = 1, nz
	607	DO i = nxl_a, nxr_a
	608	tri_ar(i,k) = fftx_ar3(i,k,j)
	609	ENDDO
	610	ENDDO
	611
	612	jj = myid * (nyn_p-nys_p+1) + j
	613	CALL tridia_hybrid( jj, tri_ar, tri(:,:,:,jthread))
	614
	615	DO k = 1, nz
	616	DO i = nxl_a, nxr_a
	617	fftx_ar3(i,k,j) = tri_ar (i,k)
	618	ENDDO
	619	ENDDO
	620
	621	CALL fft_x_m( fftx_ar3(:,:,j), 'backward' )
	622
	623	DO k = 1, nz
	624	m = nxl_a
	625	DO n = 1, npe_s
	626	DO i = nxl_p, nxr_p
	627	work2(i,k,j,n) = fftx_ar3(m,k,j)
	628	m = m+1
	629	ENDDO
	630	ENDDO
	631	ENDDO
	632
	633	ENDDO
	634	#if defined( __KKMP )
	635	!$OMP END PARALLEL
	636	#endif
	637
	638	CALL cpu_log( log_point_s(33), 'fft_x_m + tridia', 'stop' )
	639
	640	#if defined( __parallel )
	641	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
	642	nwords = (nxr_p-nxl_p+1) * nz * (nyn_p-nys_p+1)
	643	CALL MPI_ALLTOALL( work2(nxl_p,1,nys_p,1), nwords, MPI_REAL, &
	644	work1(nxl_p,1,nys_p,1), nwords, MPI_REAL, &
	645	comm2d, istat )
	646	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
	647	#else
	648	work1 = work2
	649	#endif
	650
	651	CALL cpu_log( log_point_s(7), 'fft_y_m', 'continue' )
	652
	653	!$OMP PARALLEL PRIVATE (i,iouter,ii,ir,iei,j,k,m,n)
	654	!$OMP DO
	655	DO k = 1, nz
	656	m = nys_a
	657	DO n = 1, npe_s
	658	DO j = nys_p, nyn_p
	659	DO i = nxl_p, nxr_p
	660	ffty_ar3(m,k,i) = work1(i,k,j,n)
	661	ENDDO
	662	m = m+1
	663	ENDDO
	664	ENDDO
	665	ENDDO
	666
	667	!$OMP DO
	668	DO i = nxl_p, nxr_p
	669	CALL fft_y_m( ffty_ar3(:,:,i), ny+3, 'backward' )
	670	DO j = nys_a, nyn_a
	671	DO k = 1, nz
	672	ar(k,j,i+nxl) = ffty_ar3(j,k,i)
	673	ENDDO
	674	ENDDO
	675	ENDDO
	676	!$OMP END PARALLEL
	677
	678	CALL cpu_log( log_point_s(7), 'fft_y_m', 'stop' )
	679
	680	CALL cpu_log( log_point_s(30), 'poisfft_hybrid_vec', 'stop' )
	681
	682	#if defined( __KKMP )
	683	DEALLOCATE( tri )
	684	#endif
	685
	686	END SUBROUTINE poisfft_hybrid_omp_vec
	687
	688
	689	SUBROUTINE poisfft_hybrid_nodes ( ar )
	690
	691	USE cpulog
	692	USE interfaces
	693
	694	IMPLICIT NONE
	695
	696	INTEGER, PARAMETER :: istride = 4 ! stride of i loop
	697	INTEGER :: i, iei, ii, iouter, ir, istat, j, jj, k, m, &
	698	n, nn, nt, nw1, nw2
	699
	700	REAL, DIMENSION(1:nz,nys:nyn,nxl:nxr) :: ar
	701
	702	REAL, DIMENSION(0:nx) :: fftx_ar
	703	REAL, DIMENSION(0:ny,istride) :: ffty_ar
	704
	705	REAL, DIMENSION(0:nx,nz) :: tri_ar
	706
	707	REAL, DIMENSION(nxl_p:nxr_p,nz,tasks_per_logical_node, &
	708	nodes,nys_p:nyn_p) :: work1,work2
	709	REAL, DIMENSION(5,0:nx,0:nz-1) :: tri
	710
	711
	712	CALL cpu_log( log_point_s(30), 'poisfft_hybrid_nodes', 'start' )
	713
[1106]	714	CALL cpu_log( log_point_s(7), 'fft_y_1d', 'start' )
[1]	715
	716	!
	717	!-- Store grid points to be transformed on a 1d-array, do the fft
	718	!-- and sample the results on a 4d-array
	719	DO iouter = nxl_p, nxr_p, istride ! stride loop, better cache
	720	iei = MIN( iouter+istride-1, nxr_p )
	721	DO k = 1, nz
	722
	723	DO i = iouter, iei
	724	ii = nxl + i
	725	ir = i - iouter + 1
	726
	727	DO j = nys_a, nyn_a
	728	ffty_ar(j,ir) = ar(k,j,ii)
	729	ENDDO
	730
[1106]	731	CALL fft_y_1d( ffty_ar(:,ir), 'forward' )
[1]	732	ENDDO
	733
	734	m = nys_a
	735	DO nn = 1, nodes
	736	DO nt = 1, tasks_per_logical_node
	737	DO j = nys_p, nyn_p
	738	DO i = iouter, iei
	739	ir = i - iouter + 1
	740	work1(i,k,nt,nn,j) = ffty_ar(m,ir)
	741	ENDDO
	742	m = m+1
	743	ENDDO
	744	ENDDO
	745	ENDDO
	746
	747	ENDDO
	748	ENDDO
	749
[1106]	750	CALL cpu_log( log_point_s(7), 'fft_y_1d', 'pause' )
[1]	751
	752	CALL cpu_log( log_point_s(32), 'alltoall_task', 'start' )
	753	nw1 = SIZE( work1, 1 ) * SIZE( work1, 2 )
	754	DO nn = 1, nodes
	755	DO j = nys_p, nyn_p
	756	#if defined( __parallel )
	757	CALL MPI_ALLTOALL( work1(nxl_p,1,1,nn,j), nw1, MPI_REAL, &
	758	work2(nxl_p,1,1,nn,j), nw1, MPI_REAL, &
	759	comm_tasks, istat )
	760	#endif
	761	ENDDO
	762	ENDDO
	763	CALL cpu_log( log_point_s(32), 'alltoall_task', 'stop' )
	764
	765
	766	DO j = nys_p, nyn_p
	767
	768	CALL cascade( 1, j, nys_p, nyn_p )
	769	nw2 = nw1 * SIZE( work1, 3 )
	770	CALL cpu_log( log_point_s(37), 'alltoall_node', 'start' )
	771	#if defined( __parallel )
	772	CALL MPI_ALLTOALL( work2(nxl_p,1,1,1,j), nw2, MPI_REAL, &
	773	work1(nxl_p,1,1,1,j), nw2, MPI_REAL, &
	774	comm_nodes, istat )
	775	#endif
	776	CALL cpu_log( log_point_s(37), 'alltoall_node', 'pause' )
	777	CALL cascade( 2, j, nys_p, nyn_p )
	778
[1106]	779	CALL cpu_log( log_point_s(33), 'fft_x_1d + tridia', 'start' )
[1]	780	DO k = 1, nz
	781
	782	m = nxl_a
	783	DO nn = 1, nodes
	784	DO nt = 1, tasks_per_logical_node
	785	DO i = nxl_p, nxr_p
	786	fftx_ar(m) = work1(i,k,nt,nn,j)
	787	m = m+1
	788	ENDDO
	789	ENDDO
	790	ENDDO
	791
[1106]	792	CALL fft_x_1d( fftx_ar, 'forward' )
[1]	793
	794	DO i = nxl_a, nxr_a
	795	tri_ar(i,k) = fftx_ar(i)
	796	ENDDO
	797
	798	ENDDO
	799
	800	jj = myid * (nyn_p-nys_p+1) + j
	801	CALL tridia_hybrid( jj, tri_ar, tri(:,:,:) )
	802
	803	DO k = 1, nz
	804	DO i = nxl_a, nxr_a
	805	fftx_ar(i) = tri_ar(i,k)
	806	ENDDO
	807
[1106]	808	CALL fft_x_1d( fftx_ar, 'backward' )
[1]	809
	810	m = nxl_a
	811	DO nn = 1, nodes
	812	DO nt = 1, tasks_per_logical_node
	813	DO i = nxl_p, nxr_p
	814	work1(i,k,nt,nn,j) = fftx_ar(m)
	815	m = m+1
	816	ENDDO
	817	ENDDO
	818	ENDDO
	819	ENDDO
	820
[1106]	821	CALL cpu_log( log_point_s(33), 'fft_x_1d + tridia', 'stop' )
[1]	822	nw2 = nw1 * SIZE( work1, 3 )
	823	CALL cpu_log( log_point_s(37), 'alltoall_node', 'continue' )
	824	#if defined( __parallel )
	825	CALL MPI_ALLTOALL( work1(nxl_p,1,1,1,j), nw2, MPI_REAL, &
	826	work2(nxl_p,1,1,1,j), nw2, MPI_REAL, &
	827	comm_nodes, istat )
	828	#endif
	829	CALL cpu_log( log_point_s(37), 'alltoall_node', 'stop' )
	830
	831	ENDDO
	832
	833	CALL cpu_log( log_point_s(32), 'alltoall_task', 'start' )
	834	DO nn = 1, nodes
	835	DO j = nys_p, nyn_p
	836	#if defined( __parallel )
	837	CALL MPI_ALLTOALL( work2(nxl_p,1,1,nn,j), nw1, MPI_REAL, &
	838	work1(nxl_p,1,1,nn,j), nw1, MPI_REAL, &
	839	comm_tasks, istat )
	840	#endif
	841	ENDDO
	842	ENDDO
	843	CALL cpu_log( log_point_s(32), 'alltoall_task', 'stop' )
	844
[1106]	845	CALL cpu_log( log_point_s(7), 'fft_y_1d', 'continue' )
[1]	846
	847	DO iouter = nxl_p, nxr_p, istride
	848	iei = MIN( iouter+istride-1, nxr_p )
	849	DO k = 1, nz
	850
	851	m = nys_a
	852	DO nn = 1, nodes
	853	DO nt = 1, tasks_per_logical_node
	854	DO j = nys_p, nyn_p
	855	DO i = iouter, iei
	856	ir = i - iouter + 1
	857	ffty_ar(m,ir) = work1(i,k,nt,nn,j)
	858	ENDDO
	859	m = m+1
	860	ENDDO
	861	ENDDO
	862	ENDDO
	863
	864	DO i = iouter, iei
	865	ii = nxl + i
	866	ir = i - iouter + 1
[1106]	867	CALL fft_y_1d( ffty_ar(:,ir), 'backward' )
[1]	868
	869	DO j = nys_a, nyn_a
	870	ar(k,j,ii) = ffty_ar(j,ir)
	871	ENDDO
	872	ENDDO
	873
	874	ENDDO
	875	ENDDO
	876
[1106]	877	CALL cpu_log( log_point_s(7), 'fft_y_1d', 'stop' )
[1]	878
	879	CALL cpu_log( log_point_s(30), 'poisfft_hybrid_nodes', 'stop' )
	880
	881	END SUBROUTINE poisfft_hybrid_nodes
	882
	883
	884
	885	SUBROUTINE tridia_hybrid( j, ar, tri )
	886
	887	USE arrays_3d
	888	USE control_parameters
	889	USE grid_variables
	890
	891	IMPLICIT NONE
	892
	893	INTEGER :: i, j, k, nnyh
	894	REAL, DIMENSION(0:nx,nz) :: ar
	895	REAL, DIMENSION(0:nx,0:nz-1) :: ar1
	896	REAL, DIMENSION(5,0:nx,0:nz-1) :: tri
	897
	898	nnyh = (ny+1) / 2
	899
	900	tri = 0.0
	901	!
	902	!-- Define constant elements of the tridiagonal matrix.
	903	DO k = 0, nz-1
	904	DO i = 0,nx
[667]	905	tri(2,i,k) = ddzu_pres(k+1) * ddzw(k+1)
	906	tri(3,i,k) = ddzu_pres(k+2) * ddzw(k+1)
[1]	907	ENDDO
	908	ENDDO
	909
	910	IF ( j <= nnyh ) THEN
	911	CALL maketri_hybrid( j )
	912	ELSE
	913	CALL maketri_hybrid( ny+1-j)
	914	ENDIF
	915	CALL zerleg_hybrid
	916	CALL substi_hybrid( ar, tri )
	917
	918	CONTAINS
	919
	920	SUBROUTINE maketri_hybrid( j )
	921
	922	!----------------------------------------------------------------------!
	923	! maketri !
	924	! !
	925	! computes the i- and j-dependent component of the matrix !
	926	!----------------------------------------------------------------------!
	927
	928	USE constants
	929
	930	IMPLICIT NONE
	931
	932	INTEGER :: i, j, k, nnxh
	933	REAL :: a, c
	934
	935	REAL, DIMENSION(0:nx) :: l
	936
	937
	938	nnxh = (nx+1) / 2
	939	!
	940	!-- Provide the tridiagonal matrix for solution of the Poisson equation
	941	!-- in Fourier space. The coefficients are computed following the method
	942	!-- of Schmidt et al. (DFVLR-Mitteilung 84-15) --> departs from Stephan
	943	!-- Siano's original version.
	944	DO i = 0,nx
	945	IF ( i >= 0 .AND. i < nnxh ) THEN
	946	l(i) = 2.0 * ( 1.0 - COS( ( 2.0 * pi * i ) / &
[1013]	947	REAL( nx+1 ) ) ) / ( dx * dx ) + &
[1]	948	2.0 * ( 1.0 - COS( ( 2.0 * pi * j ) / &
[1013]	949	REAL( ny+1 ) ) ) / ( dy * dy )
[1]	950	ELSEIF ( i == nnxh ) THEN
	951	l(i) = 2.0 * ( 1.0 - COS( ( 2.0 * pi * ( nx+1-i ) ) / &
[1013]	952	REAL( nx+1 ) ) ) / ( dx * dx ) + &
[1]	953	2.0 * ( 1.0 - COS( ( 2.0 * pi * j ) / &
[1013]	954	REAL(ny+1) ) ) / ( dy * dy )
[1]	955	ELSE
	956	l(i) = 2.0 * ( 1.0 - COS( ( 2.0 * pi * ( nx+1-i ) ) / &
[1013]	957	REAL( nx+1 ) ) ) / ( dx * dx ) + &
[1]	958	2.0 * ( 1.0 - COS( ( 2.0 * pi * j ) / &
[1013]	959	REAL( ny+1 ) ) ) / ( dy * dy )
[1]	960	ENDIF
	961	ENDDO
	962
	963	DO k = 0,nz-1
	964	DO i = 0, nx
[667]	965	a = -1.0 * ddzu_pres(k+2) * ddzw(k+1)
	966	c = -1.0 * ddzu_pres(k+1) * ddzw(k+1)
[1]	967	tri(1,i,k) = a + c - l(i)
	968	ENDDO
	969	ENDDO
[1111]	970	IF ( ibc_p_b == 1 ) THEN
[1]	971	DO i = 0,nx
	972	tri(1,i,0) = tri(1,i,0) + tri(2,i,0)
	973	ENDDO
	974	ENDIF
	975	IF ( ibc_p_t == 1 ) THEN
	976	DO i = 0,nx
	977	tri(1,i,nz-1) = tri(1,i,nz-1) + tri(3,i,nz-1)
	978	ENDDO
	979	ENDIF
	980
	981	END SUBROUTINE maketri_hybrid
	982
	983
	984	SUBROUTINE zerleg_hybrid
	985
	986	!----------------------------------------------------------------------!
	987	! zerleg !
	988	! !
	989	! Splitting of the tridiagonal matrix (Thomas algorithm) !
	990	!----------------------------------------------------------------------!
	991
	992	USE indices
	993
	994	IMPLICIT NONE
	995
	996	INTEGER :: i, k
	997
	998	!
	999	!-- Splitting
	1000	DO i = 0, nx
	1001	tri(4,i,0) = tri(1,i,0)
	1002	ENDDO
	1003	DO k = 1, nz-1
	1004	DO i = 0,nx
	1005	tri(5,i,k) = tri(2,i,k) / tri(4,i,k-1)
	1006	tri(4,i,k) = tri(1,i,k) - tri(3,i,k-1) * tri(5,i,k)
	1007	ENDDO
	1008	ENDDO
	1009
	1010	END SUBROUTINE zerleg_hybrid
	1011
	1012	SUBROUTINE substi_hybrid( ar, tri )
	1013
	1014	!----------------------------------------------------------------------!
	1015	! substi !
	1016	! !
	1017	! Substitution (Forward and Backward) (Thomas algorithm) !
	1018	!----------------------------------------------------------------------!
	1019
	1020	IMPLICIT NONE
	1021
	1022	INTEGER :: i, j, k
	1023	REAL, DIMENSION(0:nx,nz) :: ar
	1024	REAL, DIMENSION(0:nx,0:nz-1) :: ar1
	1025	REAL, DIMENSION(5,0:nx,0:nz-1) :: tri
	1026
	1027	!
	1028	!-- Forward substitution
	1029	DO i = 0, nx
	1030	ar1(i,0) = ar(i,1)
	1031	ENDDO
	1032	DO k = 1, nz - 1
	1033	DO i = 0,nx
	1034	ar1(i,k) = ar(i,k+1) - tri(5,i,k) * ar1(i,k-1)
	1035	ENDDO
	1036	ENDDO
	1037
	1038	!
	1039	!-- Backward substitution
	1040	DO i = 0,nx
	1041	ar(i,nz) = ar1(i,nz-1) / tri(4,i,nz-1)
	1042	ENDDO
	1043	DO k = nz-2, 0, -1
	1044	DO i = 0,nx
	1045	ar(i,k+1) = ( ar1(i,k) - tri(3,i,k) * ar(i,k+2) ) &
	1046	/ tri(4,i,k)
	1047	ENDDO
	1048	ENDDO
	1049
	1050	END SUBROUTINE substi_hybrid
	1051
	1052	END SUBROUTINE tridia_hybrid
	1053
	1054
	1055	SUBROUTINE cascade( loca, j, nys_p, nyn_p )
	1056
	1057	USE cpulog
	1058
	1059	IMPLICIT NONE
	1060
	1061	INTEGER :: ier, j, loca, nyn_p, nys_p, req, reqa(1)
	1062	INTEGER, SAVE :: tag = 10
	1063	#if defined( __parallel )
[415]	1064	INTEGER, DIMENSION(MPI_STATUS_SIZE) :: stat
	1065	INTEGER, DIMENSION(MPI_STATUS_SIZE,1) :: stata
[1]	1066	#endif
	1067
	1068	REAL :: buf, buf1
	1069
	1070
	1071	buf = 1.0
	1072	buf1 = 1.1
	1073	IF ( me_node == 0 ) THEN ! first node only
	1074
	1075	SELECT CASE ( loca )
	1076
	1077	CASE ( 1 ) ! before alltoall
	1078
	1079	IF( me_task > 0 ) THEN ! first task does not wait
	1080	#if defined( __parallel )
	1081	CALL MPI_SENDRECV( buf, 1, MPI_REAL, me_task-1, 0, &
	1082	buf1, 1, MPI_REAL, me_task-1, 0, &
[415]	1083	comm_tasks, stat, ierr )
[1]	1084	#endif
	1085	ELSEIF ( j > nys_p ) THEN
	1086	req = 0
	1087	tag = MOD( tag-10, 10 ) + 10
	1088	#if defined( __parallel )
	1089	CALL MPI_IRECV( buf, 1, MPI_REAL, tasks_per_logical_node-1,&
	1090	tag, comm_tasks, req, ierr )
	1091	reqa = req
[415]	1092	CALL MPI_WAITALL( 1, reqa, stata, ierr )
[1]	1093	#endif
	1094	ENDIF
	1095
	1096	CASE ( 2 ) ! after alltoall
	1097
	1098	IF ( me_task < tasks_per_logical_node-1 ) THEN ! last task
	1099	#if defined( __parallel )
	1100	CALL MPI_SENDRECV( buf, 1, MPI_REAL, me_task+1, 0, &
	1101	buf1, 1, MPI_REAL, me_task+1, 0, &
	1102	comm_tasks, stat, ierr)
	1103	#endif
	1104	ELSEIF ( j < nyn_p ) THEN
	1105	req = 0
	1106	tag = MOD( tag-10, 10 ) + 10
	1107	#if defined( __parallel )
	1108	CALL MPI_ISEND( buf, 1, MPI_REAL, 0, tag, comm_tasks, req, &
	1109	ierr )
	1110	#endif
	1111	ENDIF
	1112
	1113	END SELECT
	1114
	1115	ENDIF
	1116
	1117	END SUBROUTINE cascade
[807]	1118	#endif
[1]	1119	END MODULE poisfft_hybrid_mod

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: palm/trunk/SOURCE/poisfft_hybrid.f90 @ 1111

Download in other formats: