Home

Context Navigation

poisfft.f90 @ 1111

Last change on this file since 1111 was 1111, checked in by raasch, 11 years ago

New:
---

GPU porting of pres, swap_timelevel. Adjustments of openACC directives.
Further porting of poisfft, which now runs completely on GPU without any
host/device data transfer for serial an parallel runs (but parallel runs
require data transfer before and after the MPI transpositions).
GPU-porting of tridiagonal solver:
tridiagonal routines split into extermal subroutines (instead using CONTAINS),
no distinction between parallel/non-parallel in poisfft and tridia any more,
tridia routines moved to end of file because of probable bug in PGI compiler
(otherwise "invalid device function" is indicated during runtime).
(cuda_fft_interfaces, fft_xy, flow_statistics, init_3d_model, palm, poisfft, pres, prognostic_equations, swap_timelevel, time_integration, transpose)
output of accelerator board information. (header)

optimization of tridia routines: constant elements and coefficients of tri are
stored in seperate arrays ddzuw and tric, last dimension of tri reduced from 5 to 2,
(init_grid, init_3d_model, modules, palm, poisfft)

poisfft_init is now called internally from poisfft,
(Makefile, Makefile_check, init_pegrid, poisfft, poisfft_hybrid)

CPU-time per grid point and timestep is output to CPU_MEASURES file
(cpu_statistics, modules, time_integration)

Changed:

resorting from/to array work changed, work now has 4 dimensions instead of 1 (transpose)
array diss allocated only if required (init_3d_model)

pressure boundary condition "Neumann+inhomo" removed from the code
(check_parameters, header, poisfft, poisfft_hybrid, pres)

Errors:

bugfix: dependency added for cuda_fft_interfaces (Makefile)
bugfix: CUDA fft plans adjusted for domain decomposition (before they always
used total domain) (fft_xy)

Property svn:keywords set to Id

File size: 45.7 KB

Line
1	MODULE poisfft_mod
2
3	!--------------------------------------------------------------------------------!
4	! This file is part of PALM.
5	!
6	! PALM is free software: you can redistribute it and/or modify it under the terms
7	! of the GNU General Public License as published by the Free Software Foundation,
8	! either version 3 of the License, or (at your option) any later version.
9	!
10	! PALM is distributed in the hope that it will be useful, but WITHOUT ANY
11	! WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
12	! A PARTICULAR PURPOSE. See the GNU General Public License for more details.
13	!
14	! You should have received a copy of the GNU General Public License along with
15	! PALM. If not, see <http://www.gnu.org/licenses/>.
16	!
17	! Copyright 1997-2012 Leibniz University Hannover
18	!--------------------------------------------------------------------------------!
19	!
20	! Current revisions:
21	! -----------------
22	! further openACC porting of non-parallel (MPI) branch:
23	! tridiagonal routines split into extermal subroutines (instead using CONTAINS),
24	! no distinction between parallel/non-parallel in poisfft and tridia any more,
25	! tridia routines moved to end of file because of probable bug in PGI compiler
26	! (otherwise "invalid device function" is indicated during runtime),
27	! optimization of tridia routines: constant elements and coefficients of tri are
28	! stored in seperate arrays ddzuw and tric, last dimension of tri reduced from 5
29	! to 2,
30	! poisfft_init is now called internally from poisfft, maketri is called from
31	! poisfft_init,
32	! ibc_p_b = 2 removed
33	!
34	! Former revisions:
35	! -----------------
36	! $Id: poisfft.f90 1111 2013-03-08 23:54:10Z raasch $
37	!
38	! 1106 2013-03-04 05:31:38Z raasch
39	! routines fftx, ffty, fftxp, fftyp removed, calls replaced by fft_x, fft_y,
40	! in the 1D-decomposition routines fft_x, ffty are replaced by fft_x_1d,
41	! fft_y_1d
42	!
43	! 1103 2013-02-20 02:15:53Z raasch
44	! tri, ar, and ar1 arguments in tridia-routines (2d) are removed because they
45	! sometimes cause segmentation faults with intel 12.1 compiler
46	!
47	! 1092 2013-02-02 11:24:22Z raasch
48	! unused variables removed
49	!
50	! 1036 2012-10-22 13:43:42Z raasch
51	! code put under GPL (PALM 3.9)
52	!
53	! 2012-09-21 07:03:55Z raasch
54	! FLOAT type conversion replaced by REAL
55	!
56	! 1003 2012-09-14 14:35:53Z raasch
57	! indices nxa, nya, etc. replaced by nx, ny, etc.
58	!
59	! 940 2012-07-09 14:31:00Z raasch
60	! special handling of tri-array as an argument in tridia_1dd routines switched
61	! off because it caused segmentation faults with intel 12.1 compiler
62	!
63	! 877 2012-04-03 11:21:44Z suehring
64	! Bugfix: Avoid divisions by zero in case of using a 'neumann' bc for the
65	! pressure at the top of the model domain.
66	!
67	! 809 2012-01-30 13:32:58Z maronga
68	! Bugfix: replaced .AND. and .NOT. with && and ! in the preprocessor directives
69	!
70	! 807 2012-01-25 11:53:51Z maronga
71	! New cpp directive "__check" implemented which is used by check_namelist_files
72	! (most of the code is unneeded by check_namelist_files).
73	!
74	! 763 2011-10-06 09:32:09Z suehring
75	! Comment added concerning the last change.
76	!
77	! 761 2011-10-05 17:58:52Z suehring
78	! Bugfix: Avoid divisions by zero in case of using a 'neumann' bc for the
79	! pressure at the top of the model domain.
80	!
81	! 696 2011-03-18 07:03:49Z raasch
82	! work_fftx removed from PRIVATE clauses in fftx_tr_xy and tr_yx_fftx
83	!
84	! 683 2011-02-09 14:25:15Z raasch
85	! openMP parallelization for 2d-domain-decomposition
86	!
87	! 667 2010-12-23 12:06:00Z suehring/gryschka
88	! ddzu replaced by ddzu_pres due to changes in zu(0)
89	!
90	! 622 2010-12-10 08:08:13Z raasch
91	! optional barriers included in order to speed up collective operations
92	!
93	! 377 2009-09-04 11:09:00Z raasch
94	! __lcmuk changed to __lc to avoid problems with Intel compiler on sgi-ice
95	!
96	! 164 2008-05-15 08:46:15Z raasch
97	! Arguments removed from transpose routines
98	!
99	! 128 2007-10-26 13:11:14Z raasch
100	! Bugfix: wavenumber calculation for even nx in routines maketri
101	!
102	! 85 2007-05-11 09:35:14Z raasch
103	! Bugfix: work_fft*_vec removed from some PRIVATE-declarations
104	!
105	! 76 2007-03-29 00:58:32Z raasch
106	! Tridiagonal coefficients adjusted for Neumann boundary conditions both at
107	! the bottom and the top.
108	!
109	! RCS Log replace by Id keyword, revision history cleaned up
110	!
111	! Revision 1.24 2006/08/04 15:00:24 raasch
112	! Default setting of the thread number tn in case of not using OpenMP
113	!
114	! Revision 1.23 2006/02/23 12:48:38 raasch
115	! Additional compiler directive in routine tridia_1dd for preventing loop
116	! exchange on NEC-SX6
117	!
118	! Revision 1.20 2004/04/30 12:38:09 raasch
119	! Parts of former poisfft_hybrid moved to this subroutine,
120	! former subroutine changed to a module, renaming of FFT-subroutines and
121	! -module, FFTs completely substituted by calls of fft_x and fft_y,
122	! NAG fft used in the non-parallel case completely removed, l in maketri
123	! is now a 1d-array, variables passed by modules instead of using parameter
124	! lists, enlarged transposition arrays introduced
125	!
126	! Revision 1.1 1997/07/24 11:24:14 raasch
127	! Initial revision
128	!
129	!
130	! Description:
131	! ------------
132	! See below.
133	!------------------------------------------------------------------------------!
134
135	!--------------------------------------------------------------------------!
136	! poisfft !
137	! !
138	! Original version: Stephan Siano (pois3d) !
139	! !
140	! Institute of Meteorology and Climatology, University of Hannover !
141	! Germany !
142	! !
143	! Version as of July 23,1996 !
144	! !
145	! !
146	! Version for parallel computers: Siegfried Raasch !
147	! !
148	! Version as of July 03,1997 !
149	! !
150	! Solves the Poisson equation with a 2D spectral method !
151	! d^2 p / dx^2 + d^2 p / dy^2 + d^2 p / dz^2 = s !
152	! !
153	! Input: !
154	! real ar contains in the (nnx,nny,nnz) elements, !
155	! starting from the element (1,nys,nxl), the !
156	! values for s !
157	! real work Temporary array !
158	! !
159	! Output: !
160	! real ar contains the solution for p !
161	!--------------------------------------------------------------------------!
162
163	USE fft_xy
164	USE indices
165	USE transpose_indices
166
167	IMPLICIT NONE
168
169	LOGICAL, SAVE :: poisfft_initialized = .FALSE.
170
171	REAL, DIMENSION(:,:), ALLOCATABLE :: ddzuw
172
173	PRIVATE
174
175	#if ! defined ( __check )
176	PUBLIC poisfft, poisfft_init
177
178	INTERFACE poisfft
179	MODULE PROCEDURE poisfft
180	END INTERFACE poisfft
181
182	INTERFACE poisfft_init
183	MODULE PROCEDURE poisfft_init
184	END INTERFACE poisfft_init
185	#else
186	PUBLIC poisfft_init
187
188	INTERFACE poisfft_init
189	MODULE PROCEDURE poisfft_init
190	END INTERFACE poisfft_init
191	#endif
192
193	CONTAINS
194
195	SUBROUTINE poisfft_init
196
197	USE arrays_3d, ONLY: ddzu_pres, ddzw
198
199	IMPLICIT NONE
200
201	INTEGER :: k
202
203
204	CALL fft_init
205
206	ALLOCATE( ddzuw(0:nz-1,3) )
207
208	DO k = 0, nz-1
209	ddzuw(k,1) = ddzu_pres(k+1) * ddzw(k+1)
210	ddzuw(k,2) = ddzu_pres(k+2) * ddzw(k+1)
211	ddzuw(k,3) = -1.0 * &
212	( ddzu_pres(k+2) * ddzw(k+1) + ddzu_pres(k+1) * ddzw(k+1) )
213	ENDDO
214	!
215	!-- Calculate constant coefficients of the tridiagonal matrix
216	#if ! defined ( __check )
217	CALL maketri
218	#endif
219
220	poisfft_initialized = .TRUE.
221
222	END SUBROUTINE poisfft_init
223
224
225	#if ! defined ( __check )
226	SUBROUTINE poisfft( ar, work )
227
228	USE cpulog
229	USE interfaces
230	USE pegrid
231
232	IMPLICIT NONE
233
234	REAL, DIMENSION(1:nz,nys:nyn,nxl:nxr) :: ar, work
235
236
237	CALL cpu_log( log_point_s(3), 'poisfft', 'start' )
238
239	IF ( .NOT. poisfft_initialized ) CALL poisfft_init
240
241	!
242	!-- Two-dimensional Fourier Transformation in x- and y-direction.
243	IF ( pdims(2) == 1 .AND. pdims(1) > 1 ) THEN
244
245	!
246	!-- 1d-domain-decomposition along x:
247	!-- FFT along y and transposition y --> x
248	CALL ffty_tr_yx( ar, work, ar )
249
250	!
251	!-- FFT along x, solving the tridiagonal system and backward FFT
252	CALL fftx_tri_fftx( ar )
253
254	!
255	!-- Transposition x --> y and backward FFT along y
256	CALL tr_xy_ffty( ar, work, ar )
257
258	ELSEIF ( pdims(1) == 1 .AND. pdims(2) > 1 ) THEN
259
260	!
261	!-- 1d-domain-decomposition along y:
262	!-- FFT along x and transposition x --> y
263	CALL fftx_tr_xy( ar, work, ar )
264
265	!
266	!-- FFT along y, solving the tridiagonal system and backward FFT
267	CALL ffty_tri_ffty( ar )
268
269	!
270	!-- Transposition y --> x and backward FFT along x
271	CALL tr_yx_fftx( ar, work, ar )
272
273	ELSE
274
275	!
276	!-- 2d-domain-decomposition or no decomposition (1 PE run)
277	!-- Transposition z --> x
278	CALL cpu_log( log_point_s(5), 'transpo forward', 'start' )
279	CALL transpose_zx( ar, work, ar )
280	CALL cpu_log( log_point_s(5), 'transpo forward', 'pause' )
281
282	CALL cpu_log( log_point_s(4), 'fft_x', 'start' )
283	CALL fft_x( ar, 'forward' )
284	CALL cpu_log( log_point_s(4), 'fft_x', 'pause' )
285
286	!
287	!-- Transposition x --> y
288	CALL cpu_log( log_point_s(5), 'transpo forward', 'continue' )
289	CALL transpose_xy( ar, work, ar )
290	CALL cpu_log( log_point_s(5), 'transpo forward', 'pause' )
291
292	CALL cpu_log( log_point_s(7), 'fft_y', 'start' )
293	CALL fft_y( ar, 'forward' )
294	CALL cpu_log( log_point_s(7), 'fft_y', 'pause' )
295
296	!
297	!-- Transposition y --> z
298	CALL cpu_log( log_point_s(5), 'transpo forward', 'continue' )
299	CALL transpose_yz( ar, work, ar )
300	CALL cpu_log( log_point_s(5), 'transpo forward', 'stop' )
301
302	!
303	!-- Solve the tridiagonal equation system along z
304	CALL cpu_log( log_point_s(6), 'tridia', 'start' )
305	CALL tridia( ar )
306	CALL cpu_log( log_point_s(6), 'tridia', 'stop' )
307
308	!
309	!-- Inverse Fourier Transformation
310	!-- Transposition z --> y
311	CALL cpu_log( log_point_s(8), 'transpo invers', 'start' )
312	CALL transpose_zy( ar, work, ar )
313	CALL cpu_log( log_point_s(8), 'transpo invers', 'pause' )
314
315	CALL cpu_log( log_point_s(7), 'fft_y', 'continue' )
316	CALL fft_y( ar, 'backward' )
317	CALL cpu_log( log_point_s(7), 'fft_y', 'stop' )
318
319	!
320	!-- Transposition y --> x
321	CALL cpu_log( log_point_s(8), 'transpo invers', 'continue' )
322	CALL transpose_yx( ar, work, ar )
323	CALL cpu_log( log_point_s(8), 'transpo invers', 'pause' )
324
325	CALL cpu_log( log_point_s(4), 'fft_x', 'continue' )
326	CALL fft_x( ar, 'backward' )
327	CALL cpu_log( log_point_s(4), 'fft_x', 'stop' )
328
329	!
330	!-- Transposition x --> z
331	CALL cpu_log( log_point_s(8), 'transpo invers', 'continue' )
332	CALL transpose_xz( ar, work, ar )
333	CALL cpu_log( log_point_s(8), 'transpo invers', 'stop' )
334
335	ENDIF
336
337	CALL cpu_log( log_point_s(3), 'poisfft', 'stop' )
338
339	END SUBROUTINE poisfft
340
341
342
343	SUBROUTINE ffty_tr_yx( f_in, work, f_out )
344
345	!------------------------------------------------------------------------------!
346	! Fourier-transformation along y with subsequent transposition y --> x for
347	! a 1d-decomposition along x
348	!
349	! ATTENTION: The performance of this routine is much faster on the NEC-SX6,
350	! if the first index of work_ffty_vec is odd. Otherwise
351	! memory bank conflicts may occur (especially if the index is a
352	! multiple of 128). That's why work_ffty_vec is dimensioned as
353	! 0:ny+1.
354	! Of course, this will not work if users are using an odd number
355	! of gridpoints along y.
356	!------------------------------------------------------------------------------!
357
358	USE control_parameters
359	USE cpulog
360	USE indices
361	USE interfaces
362	USE pegrid
363	USE transpose_indices
364
365	IMPLICIT NONE
366
367	INTEGER :: i, iend, iouter, ir, j, k
368	INTEGER, PARAMETER :: stridex = 4
369
370	REAL, DIMENSION(0:ny,stridex) :: work_ffty
371	#if defined( __nec )
372	REAL, DIMENSION(0:ny+1,1:nz,nxl:nxr) :: work_ffty_vec
373	#endif
374	REAL, DIMENSION(1:nz,0:ny,nxl:nxr) :: f_in
375	REAL, DIMENSION(nnx,1:nz,nys_x:nyn_x,pdims(1)) :: f_out
376	REAL, DIMENSION(nxl:nxr,1:nz,0:ny) :: work
377
378	!
379	!-- Carry out the FFT along y, where all data are present due to the
380	!-- 1d-decomposition along x. Resort the data in a way that x becomes
381	!-- the first index.
382	CALL cpu_log( log_point_s(7), 'fft_y_1d', 'start' )
383
384	IF ( host(1:3) == 'nec' ) THEN
385	#if defined( __nec )
386	!
387	!-- Code optimized for vector processors
388	!$OMP PARALLEL PRIVATE ( i, j, k )
389	!$OMP DO
390	DO i = nxl, nxr
391
392	DO j = 0, ny
393	DO k = 1, nz
394	work_ffty_vec(j,k,i) = f_in(k,j,i)
395	ENDDO
396	ENDDO
397
398	CALL fft_y_m( work_ffty_vec(:,:,i), ny+1, 'forward' )
399
400	ENDDO
401
402	!$OMP DO
403	DO k = 1, nz
404	DO j = 0, ny
405	DO i = nxl, nxr
406	work(i,k,j) = work_ffty_vec(j,k,i)
407	ENDDO
408	ENDDO
409	ENDDO
410	!$OMP END PARALLEL
411	#endif
412
413	ELSE
414
415	!
416	!-- Cache optimized code.
417	!-- The i-(x-)direction is split into a strided outer loop and an inner
418	!-- loop for better cache performance
419	!$OMP PARALLEL PRIVATE (i,iend,iouter,ir,j,k,work_ffty)
420	!$OMP DO
421	DO iouter = nxl, nxr, stridex
422
423	iend = MIN( iouter+stridex-1, nxr ) ! Upper bound for inner i loop
424
425	DO k = 1, nz
426
427	DO i = iouter, iend
428
429	ir = i-iouter+1 ! counter within a stride
430	DO j = 0, ny
431	work_ffty(j,ir) = f_in(k,j,i)
432	ENDDO
433	!
434	!-- FFT along y
435	CALL fft_y_1d( work_ffty(:,ir), 'forward' )
436
437	ENDDO
438
439	!
440	!-- Resort
441	DO j = 0, ny
442	DO i = iouter, iend
443	work(i,k,j) = work_ffty(j,i-iouter+1)
444	ENDDO
445	ENDDO
446
447	ENDDO
448
449	ENDDO
450	!$OMP END PARALLEL
451
452	ENDIF
453	CALL cpu_log( log_point_s(7), 'fft_y_1d', 'pause' )
454
455	!
456	!-- Transpose array
457	#if defined( __parallel )
458	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
459	IF ( collective_wait ) CALL MPI_BARRIER( comm2d, ierr )
460	CALL MPI_ALLTOALL( work(nxl,1,0), sendrecvcount_xy, MPI_REAL, &
461	f_out(1,1,nys_x,1), sendrecvcount_xy, MPI_REAL, &
462	comm1dx, ierr )
463	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
464	#endif
465
466	END SUBROUTINE ffty_tr_yx
467
468
469	SUBROUTINE tr_xy_ffty( f_in, work, f_out )
470
471	!------------------------------------------------------------------------------!
472	! Transposition x --> y with a subsequent backward Fourier transformation for
473	! a 1d-decomposition along x
474	!------------------------------------------------------------------------------!
475
476	USE control_parameters
477	USE cpulog
478	USE indices
479	USE interfaces
480	USE pegrid
481	USE transpose_indices
482
483	IMPLICIT NONE
484
485	INTEGER :: i, iend, iouter, ir, j, k
486	INTEGER, PARAMETER :: stridex = 4
487
488	REAL, DIMENSION(0:ny,stridex) :: work_ffty
489	#if defined( __nec )
490	REAL, DIMENSION(0:ny+1,1:nz,nxl:nxr) :: work_ffty_vec
491	#endif
492	REAL, DIMENSION(nnx,1:nz,nys_x:nyn_x,pdims(1)) :: f_in
493	REAL, DIMENSION(1:nz,0:ny,nxl:nxr) :: f_out
494	REAL, DIMENSION(nxl:nxr,1:nz,0:ny) :: work
495
496	!
497	!-- Transpose array
498	#if defined( __parallel )
499	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
500	IF ( collective_wait ) CALL MPI_BARRIER( comm2d, ierr )
501	CALL MPI_ALLTOALL( f_in(1,1,nys_x,1), sendrecvcount_xy, MPI_REAL, &
502	work(nxl,1,0), sendrecvcount_xy, MPI_REAL, &
503	comm1dx, ierr )
504	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
505	#endif
506
507	!
508	!-- Resort the data in a way that y becomes the first index and carry out the
509	!-- backward fft along y.
510	CALL cpu_log( log_point_s(7), 'fft_y_1d', 'continue' )
511
512	IF ( host(1:3) == 'nec' ) THEN
513	#if defined( __nec )
514	!
515	!-- Code optimized for vector processors
516	!$OMP PARALLEL PRIVATE ( i, j, k )
517	!$OMP DO
518	DO k = 1, nz
519	DO j = 0, ny
520	DO i = nxl, nxr
521	work_ffty_vec(j,k,i) = work(i,k,j)
522	ENDDO
523	ENDDO
524	ENDDO
525
526	!$OMP DO
527	DO i = nxl, nxr
528
529	CALL fft_y_m( work_ffty_vec(:,:,i), ny+1, 'backward' )
530
531	DO j = 0, ny
532	DO k = 1, nz
533	f_out(k,j,i) = work_ffty_vec(j,k,i)
534	ENDDO
535	ENDDO
536
537	ENDDO
538	!$OMP END PARALLEL
539	#endif
540
541	ELSE
542
543	!
544	!-- Cache optimized code.
545	!-- The i-(x-)direction is split into a strided outer loop and an inner
546	!-- loop for better cache performance
547	!$OMP PARALLEL PRIVATE ( i, iend, iouter, ir, j, k, work_ffty )
548	!$OMP DO
549	DO iouter = nxl, nxr, stridex
550
551	iend = MIN( iouter+stridex-1, nxr ) ! Upper bound for inner i loop
552
553	DO k = 1, nz
554	!
555	!-- Resort
556	DO j = 0, ny
557	DO i = iouter, iend
558	work_ffty(j,i-iouter+1) = work(i,k,j)
559	ENDDO
560	ENDDO
561
562	DO i = iouter, iend
563
564	!
565	!-- FFT along y
566	ir = i-iouter+1 ! counter within a stride
567	CALL fft_y_1d( work_ffty(:,ir), 'backward' )
568
569	DO j = 0, ny
570	f_out(k,j,i) = work_ffty(j,ir)
571	ENDDO
572	ENDDO
573
574	ENDDO
575
576	ENDDO
577	!$OMP END PARALLEL
578
579	ENDIF
580
581	CALL cpu_log( log_point_s(7), 'fft_y_1d', 'stop' )
582
583	END SUBROUTINE tr_xy_ffty
584
585
586	SUBROUTINE fftx_tri_fftx( ar )
587
588	!------------------------------------------------------------------------------!
589	! FFT along x, solution of the tridiagonal system and backward FFT for
590	! a 1d-decomposition along x
591	!
592	! WARNING: this subroutine may still not work for hybrid parallelization
593	! with OpenMP (for possible necessary changes see the original
594	! routine poisfft_hybrid, developed by Klaus Ketelsen, May 2002)
595	!------------------------------------------------------------------------------!
596
597	USE control_parameters
598	USE cpulog
599	USE grid_variables
600	USE indices
601	USE interfaces
602	USE pegrid
603	USE transpose_indices
604
605	IMPLICIT NONE
606
607	INTEGER :: i, j, k, m, n, omp_get_thread_num, tn
608
609	REAL, DIMENSION(0:nx) :: work_fftx
610	REAL, DIMENSION(0:nx,1:nz) :: work_trix
611	REAL, DIMENSION(nnx,1:nz,nys_x:nyn_x,pdims(1)) :: ar
612	REAL, DIMENSION(:,:,:,:), ALLOCATABLE :: tri
613
614
615	CALL cpu_log( log_point_s(33), 'fft_x_1d + tridia', 'start' )
616
617	ALLOCATE( tri(5,0:nx,0:nz-1,0:threads_per_task-1) )
618
619	tn = 0 ! Default thread number in case of one thread
620	!$OMP PARALLEL DO PRIVATE ( i, j, k, m, n, tn, work_fftx, work_trix )
621	DO j = nys_x, nyn_x
622
623	!$ tn = omp_get_thread_num()
624
625	IF ( host(1:3) == 'nec' ) THEN
626	!
627	!-- Code optimized for vector processors
628	DO k = 1, nz
629
630	m = 0
631	DO n = 1, pdims(1)
632	DO i = 1, nnx
633	work_trix(m,k) = ar(i,k,j,n)
634	m = m + 1
635	ENDDO
636	ENDDO
637
638	ENDDO
639
640	CALL fft_x_m( work_trix, 'forward' )
641
642	ELSE
643	!
644	!-- Cache optimized code
645	DO k = 1, nz
646
647	m = 0
648	DO n = 1, pdims(1)
649	DO i = 1, nnx
650	work_fftx(m) = ar(i,k,j,n)
651	m = m + 1
652	ENDDO
653	ENDDO
654
655	CALL fft_x_1d( work_fftx, 'forward' )
656
657	DO i = 0, nx
658	work_trix(i,k) = work_fftx(i)
659	ENDDO
660
661	ENDDO
662
663	ENDIF
664
665	!
666	!-- Solve the linear equation system
667	CALL tridia_1dd( ddx2, ddy2, nx, ny, j, work_trix, tri(:,:,:,tn) )
668
669	IF ( host(1:3) == 'nec' ) THEN
670	!
671	!-- Code optimized for vector processors
672	CALL fft_x_m( work_trix, 'backward' )
673
674	DO k = 1, nz
675
676	m = 0
677	DO n = 1, pdims(1)
678	DO i = 1, nnx
679	ar(i,k,j,n) = work_trix(m,k)
680	m = m + 1
681	ENDDO
682	ENDDO
683
684	ENDDO
685
686	ELSE
687	!
688	!-- Cache optimized code
689	DO k = 1, nz
690
691	DO i = 0, nx
692	work_fftx(i) = work_trix(i,k)
693	ENDDO
694
695	CALL fft_x_1d( work_fftx, 'backward' )
696
697	m = 0
698	DO n = 1, pdims(1)
699	DO i = 1, nnx
700	ar(i,k,j,n) = work_fftx(m)
701	m = m + 1
702	ENDDO
703	ENDDO
704
705	ENDDO
706
707	ENDIF
708
709	ENDDO
710
711	DEALLOCATE( tri )
712
713	CALL cpu_log( log_point_s(33), 'fft_x_1d + tridia', 'stop' )
714
715	END SUBROUTINE fftx_tri_fftx
716
717
718	SUBROUTINE fftx_tr_xy( f_in, work, f_out )
719
720	!------------------------------------------------------------------------------!
721	! Fourier-transformation along x with subsequent transposition x --> y for
722	! a 1d-decomposition along y
723	!
724	! ATTENTION: The NEC-branch of this routine may significantly profit from
725	! further optimizations. So far, performance is much worse than
726	! for routine ffty_tr_yx (more than three times slower).
727	!------------------------------------------------------------------------------!
728
729	USE control_parameters
730	USE cpulog
731	USE indices
732	USE interfaces
733	USE pegrid
734	USE transpose_indices
735
736	IMPLICIT NONE
737
738	INTEGER :: i, j, k
739
740	REAL, DIMENSION(0:nx,1:nz,nys:nyn) :: work_fftx
741	REAL, DIMENSION(1:nz,nys:nyn,0:nx) :: f_in
742	REAL, DIMENSION(nny,1:nz,nxl_y:nxr_y,pdims(2)) :: f_out
743	REAL, DIMENSION(nys:nyn,1:nz,0:nx) :: work
744
745	!
746	!-- Carry out the FFT along x, where all data are present due to the
747	!-- 1d-decomposition along y. Resort the data in a way that y becomes
748	!-- the first index.
749	CALL cpu_log( log_point_s(4), 'fft_x_1d', 'start' )
750
751	IF ( host(1:3) == 'nec' ) THEN
752	!
753	!-- Code for vector processors
754	!$OMP PARALLEL PRIVATE ( i, j, k )
755	!$OMP DO
756	DO i = 0, nx
757
758	DO j = nys, nyn
759	DO k = 1, nz
760	work_fftx(i,k,j) = f_in(k,j,i)
761	ENDDO
762	ENDDO
763
764	ENDDO
765
766	!$OMP DO
767	DO j = nys, nyn
768
769	CALL fft_x_m( work_fftx(:,:,j), 'forward' )
770
771	DO k = 1, nz
772	DO i = 0, nx
773	work(j,k,i) = work_fftx(i,k,j)
774	ENDDO
775	ENDDO
776
777	ENDDO
778	!$OMP END PARALLEL
779
780	ELSE
781
782	!
783	!-- Cache optimized code (there might be still a potential for better
784	!-- optimization).
785	!$OMP PARALLEL PRIVATE (i,j,k)
786	!$OMP DO
787	DO i = 0, nx
788
789	DO j = nys, nyn
790	DO k = 1, nz
791	work_fftx(i,k,j) = f_in(k,j,i)
792	ENDDO
793	ENDDO
794
795	ENDDO
796
797	!$OMP DO
798	DO j = nys, nyn
799	DO k = 1, nz
800
801	CALL fft_x_1d( work_fftx(0:nx,k,j), 'forward' )
802
803	DO i = 0, nx
804	work(j,k,i) = work_fftx(i,k,j)
805	ENDDO
806	ENDDO
807
808	ENDDO
809	!$OMP END PARALLEL
810
811	ENDIF
812	CALL cpu_log( log_point_s(4), 'fft_x_1d', 'pause' )
813
814	!
815	!-- Transpose array
816	#if defined( __parallel )
817	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
818	IF ( collective_wait ) CALL MPI_BARRIER( comm2d, ierr )
819	CALL MPI_ALLTOALL( work(nys,1,0), sendrecvcount_xy, MPI_REAL, &
820	f_out(1,1,nxl_y,1), sendrecvcount_xy, MPI_REAL, &
821	comm1dy, ierr )
822	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
823	#endif
824
825	END SUBROUTINE fftx_tr_xy
826
827
828	SUBROUTINE tr_yx_fftx( f_in, work, f_out )
829
830	!------------------------------------------------------------------------------!
831	! Transposition y --> x with a subsequent backward Fourier transformation for
832	! a 1d-decomposition along x
833	!------------------------------------------------------------------------------!
834
835	USE control_parameters
836	USE cpulog
837	USE indices
838	USE interfaces
839	USE pegrid
840	USE transpose_indices
841
842	IMPLICIT NONE
843
844	INTEGER :: i, j, k
845
846	REAL, DIMENSION(0:nx,1:nz,nys:nyn) :: work_fftx
847	REAL, DIMENSION(nny,1:nz,nxl_y:nxr_y,pdims(2)) :: f_in
848	REAL, DIMENSION(1:nz,nys:nyn,0:nx) :: f_out
849	REAL, DIMENSION(nys:nyn,1:nz,0:nx) :: work
850
851	!
852	!-- Transpose array
853	#if defined( __parallel )
854	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'start' )
855	IF ( collective_wait ) CALL MPI_BARRIER( comm2d, ierr )
856	CALL MPI_ALLTOALL( f_in(1,1,nxl_y,1), sendrecvcount_xy, MPI_REAL, &
857	work(nys,1,0), sendrecvcount_xy, MPI_REAL, &
858	comm1dy, ierr )
859	CALL cpu_log( log_point_s(32), 'mpi_alltoall', 'stop' )
860	#endif
861
862	!
863	!-- Carry out the FFT along x, where all data are present due to the
864	!-- 1d-decomposition along y. Resort the data in a way that y becomes
865	!-- the first index.
866	CALL cpu_log( log_point_s(4), 'fft_x_1d', 'continue' )
867
868	IF ( host(1:3) == 'nec' ) THEN
869	!
870	!-- Code optimized for vector processors
871	!$OMP PARALLEL PRIVATE ( i, j, k )
872	!$OMP DO
873	DO j = nys, nyn
874
875	DO k = 1, nz
876	DO i = 0, nx
877	work_fftx(i,k,j) = work(j,k,i)
878	ENDDO
879	ENDDO
880
881	CALL fft_x_m( work_fftx(:,:,j), 'backward' )
882
883	ENDDO
884
885	!$OMP DO
886	DO i = 0, nx
887	DO j = nys, nyn
888	DO k = 1, nz
889	f_out(k,j,i) = work_fftx(i,k,j)
890	ENDDO
891	ENDDO
892	ENDDO
893	!$OMP END PARALLEL
894
895	ELSE
896
897	!
898	!-- Cache optimized code (there might be still a potential for better
899	!-- optimization).
900	!$OMP PARALLEL PRIVATE (i,j,k)
901	!$OMP DO
902	DO j = nys, nyn
903	DO k = 1, nz
904
905	DO i = 0, nx
906	work_fftx(i,k,j) = work(j,k,i)
907	ENDDO
908
909	CALL fft_x_1d( work_fftx(0:nx,k,j), 'backward' )
910
911	ENDDO
912	ENDDO
913
914	!$OMP DO
915	DO i = 0, nx
916	DO j = nys, nyn
917	DO k = 1, nz
918	f_out(k,j,i) = work_fftx(i,k,j)
919	ENDDO
920	ENDDO
921	ENDDO
922	!$OMP END PARALLEL
923
924	ENDIF
925	CALL cpu_log( log_point_s(4), 'fft_x_1d', 'stop' )
926
927	END SUBROUTINE tr_yx_fftx
928
929
930	SUBROUTINE ffty_tri_ffty( ar )
931
932	!------------------------------------------------------------------------------!
933	! FFT along y, solution of the tridiagonal system and backward FFT for
934	! a 1d-decomposition along y
935	!
936	! WARNING: this subroutine may still not work for hybrid parallelization
937	! with OpenMP (for possible necessary changes see the original
938	! routine poisfft_hybrid, developed by Klaus Ketelsen, May 2002)
939	!------------------------------------------------------------------------------!
940
941	USE control_parameters
942	USE cpulog
943	USE grid_variables
944	USE indices
945	USE interfaces
946	USE pegrid
947	USE transpose_indices
948
949	IMPLICIT NONE
950
951	INTEGER :: i, j, k, m, n, omp_get_thread_num, tn
952
953	REAL, DIMENSION(0:ny) :: work_ffty
954	REAL, DIMENSION(0:ny,1:nz) :: work_triy
955	REAL, DIMENSION(nny,1:nz,nxl_y:nxr_y,pdims(2)) :: ar
956	REAL, DIMENSION(:,:,:,:), ALLOCATABLE :: tri
957
958
959	CALL cpu_log( log_point_s(39), 'fft_y_1d + tridia', 'start' )
960
961	ALLOCATE( tri(5,0:ny,0:nz-1,0:threads_per_task-1) )
962
963	tn = 0 ! Default thread number in case of one thread
964	!$OMP PARALLEL DO PRIVATE ( i, j, k, m, n, tn, work_ffty, work_triy )
965	DO i = nxl_y, nxr_y
966
967	!$ tn = omp_get_thread_num()
968
969	IF ( host(1:3) == 'nec' ) THEN
970	!
971	!-- Code optimized for vector processors
972	DO k = 1, nz
973
974	m = 0
975	DO n = 1, pdims(2)
976	DO j = 1, nny
977	work_triy(m,k) = ar(j,k,i,n)
978	m = m + 1
979	ENDDO
980	ENDDO
981
982	ENDDO
983
984	CALL fft_y_m( work_triy, ny, 'forward' )
985
986	ELSE
987	!
988	!-- Cache optimized code
989	DO k = 1, nz
990
991	m = 0
992	DO n = 1, pdims(2)
993	DO j = 1, nny
994	work_ffty(m) = ar(j,k,i,n)
995	m = m + 1
996	ENDDO
997	ENDDO
998
999	CALL fft_y_1d( work_ffty, 'forward' )
1000
1001	DO j = 0, ny
1002	work_triy(j,k) = work_ffty(j)
1003	ENDDO
1004
1005	ENDDO
1006
1007	ENDIF
1008
1009	!
1010	!-- Solve the linear equation system
1011	CALL tridia_1dd( ddy2, ddx2, ny, nx, i, work_triy, tri(:,:,:,tn) )
1012
1013	IF ( host(1:3) == 'nec' ) THEN
1014	!
1015	!-- Code optimized for vector processors
1016	CALL fft_y_m( work_triy, ny, 'backward' )
1017
1018	DO k = 1, nz
1019
1020	m = 0
1021	DO n = 1, pdims(2)
1022	DO j = 1, nny
1023	ar(j,k,i,n) = work_triy(m,k)
1024	m = m + 1
1025	ENDDO
1026	ENDDO
1027
1028	ENDDO
1029
1030	ELSE
1031	!
1032	!-- Cache optimized code
1033	DO k = 1, nz
1034
1035	DO j = 0, ny
1036	work_ffty(j) = work_triy(j,k)
1037	ENDDO
1038
1039	CALL fft_y_1d( work_ffty, 'backward' )
1040
1041	m = 0
1042	DO n = 1, pdims(2)
1043	DO j = 1, nny
1044	ar(j,k,i,n) = work_ffty(m)
1045	m = m + 1
1046	ENDDO
1047	ENDDO
1048
1049	ENDDO
1050
1051	ENDIF
1052
1053	ENDDO
1054
1055	DEALLOCATE( tri )
1056
1057	CALL cpu_log( log_point_s(39), 'fft_y_1d + tridia', 'stop' )
1058
1059	END SUBROUTINE ffty_tri_ffty
1060
1061
1062	SUBROUTINE tridia_1dd( ddx2, ddy2, nx, ny, j, ar, tri )
1063
1064	!------------------------------------------------------------------------------!
1065	! Solves the linear system of equations for a 1d-decomposition along x (see
1066	! tridia)
1067	!
1068	! Attention: when using the intel compilers older than 12.0, array tri must
1069	! be passed as an argument to the contained subroutines. Otherwise
1070	! addres faults will occur. This feature can be activated with
1071	! cpp-switch __intel11
1072	! On NEC, tri should not be passed (except for routine substi_1dd)
1073	! because this causes very bad performance.
1074	!------------------------------------------------------------------------------!
1075
1076	USE arrays_3d
1077	USE control_parameters
1078
1079	USE pegrid
1080
1081	IMPLICIT NONE
1082
1083	INTEGER :: i, j, k, nnyh, nx, ny, omp_get_thread_num, tn
1084
1085	REAL :: ddx2, ddy2
1086
1087	REAL, DIMENSION(0:nx,1:nz) :: ar
1088	REAL, DIMENSION(5,0:nx,0:nz-1) :: tri
1089
1090
1091	nnyh = ( ny + 1 ) / 2
1092
1093	!
1094	!-- Define constant elements of the tridiagonal matrix.
1095	!-- The compiler on SX6 does loop exchange. If 0:nx is a high power of 2,
1096	!-- the exchanged loops create bank conflicts. The following directive
1097	!-- prohibits loop exchange and the loops perform much better.
1098	! tn = omp_get_thread_num()
1099	! WRITE( 120+tn, * ) '+++ id=',myid,' nx=',nx,' thread=', omp_get_thread_num()
1100	! CALL local_flush( 120+tn )
1101	!CDIR NOLOOPCHG
1102	DO k = 0, nz-1
1103	DO i = 0,nx
1104	tri(2,i,k) = ddzu_pres(k+1) * ddzw(k+1)
1105	tri(3,i,k) = ddzu_pres(k+2) * ddzw(k+1)
1106	ENDDO
1107	ENDDO
1108	! WRITE( 120+tn, * ) '+++ id=',myid,' end of first tridia loop thread=', omp_get_thread_num()
1109	! CALL local_flush( 120+tn )
1110
1111	IF ( j <= nnyh ) THEN
1112	#if defined( __intel11 )
1113	CALL maketri_1dd( j, tri )
1114	#else
1115	CALL maketri_1dd( j )
1116	#endif
1117	ELSE
1118	#if defined( __intel11 )
1119	CALL maketri_1dd( ny+1-j, tri )
1120	#else
1121	CALL maketri_1dd( ny+1-j )
1122	#endif
1123	ENDIF
1124	#if defined( __intel11 )
1125	CALL split_1dd( tri )
1126	#else
1127	CALL split_1dd
1128	#endif
1129	CALL substi_1dd( ar, tri )
1130
1131	CONTAINS
1132
1133	#if defined( __intel11 )
1134	SUBROUTINE maketri_1dd( j, tri )
1135	#else
1136	SUBROUTINE maketri_1dd( j )
1137	#endif
1138
1139	!------------------------------------------------------------------------------!
1140	! computes the i- and j-dependent component of the matrix
1141	!------------------------------------------------------------------------------!
1142
1143	USE constants
1144
1145	IMPLICIT NONE
1146
1147	INTEGER :: i, j, k, nnxh
1148	REAL :: a, c
1149
1150	REAL, DIMENSION(0:nx) :: l
1151
1152	#if defined( __intel11 )
1153	REAL, DIMENSION(5,0:nx,0:nz-1) :: tri
1154	#endif
1155
1156
1157	nnxh = ( nx + 1 ) / 2
1158	!
1159	!-- Provide the tridiagonal matrix for solution of the Poisson equation in
1160	!-- Fourier space. The coefficients are computed following the method of
1161	!-- Schmidt et al. (DFVLR-Mitteilung 84-15), which departs from Stephan
1162	!-- Siano's original version by discretizing the Poisson equation,
1163	!-- before it is Fourier-transformed
1164	DO i = 0, nx
1165	IF ( i >= 0 .AND. i <= nnxh ) THEN
1166	l(i) = 2.0 * ( 1.0 - COS( ( 2.0 * pi * i ) / &
1167	REAL( nx+1 ) ) ) * ddx2 + &
1168	2.0 * ( 1.0 - COS( ( 2.0 * pi * j ) / &
1169	REAL( ny+1 ) ) ) * ddy2
1170	ELSE
1171	l(i) = 2.0 * ( 1.0 - COS( ( 2.0 * pi * ( nx+1-i ) ) / &
1172	REAL( nx+1 ) ) ) * ddx2 + &
1173	2.0 * ( 1.0 - COS( ( 2.0 * pi * j ) / &
1174	REAL( ny+1 ) ) ) * ddy2
1175	ENDIF
1176	ENDDO
1177
1178	DO k = 0, nz-1
1179	DO i = 0, nx
1180	a = -1.0 * ddzu_pres(k+2) * ddzw(k+1)
1181	c = -1.0 * ddzu_pres(k+1) * ddzw(k+1)
1182	tri(1,i,k) = a + c - l(i)
1183	ENDDO
1184	ENDDO
1185	IF ( ibc_p_b == 1 ) THEN
1186	DO i = 0, nx
1187	tri(1,i,0) = tri(1,i,0) + tri(2,i,0)
1188	ENDDO
1189	ENDIF
1190	IF ( ibc_p_t == 1 ) THEN
1191	DO i = 0, nx
1192	tri(1,i,nz-1) = tri(1,i,nz-1) + tri(3,i,nz-1)
1193	ENDDO
1194	ENDIF
1195
1196	END SUBROUTINE maketri_1dd
1197
1198
1199	#if defined( __intel11 )
1200	SUBROUTINE split_1dd( tri )
1201	#else
1202	SUBROUTINE split_1dd
1203	#endif
1204
1205	!------------------------------------------------------------------------------!
1206	! Splitting of the tridiagonal matrix (Thomas algorithm)
1207	!------------------------------------------------------------------------------!
1208
1209	IMPLICIT NONE
1210
1211	INTEGER :: i, k
1212
1213	#if defined( __intel11 )
1214	REAL, DIMENSION(5,0:nx,0:nz-1) :: tri
1215	#endif
1216
1217
1218	!
1219	!-- Splitting
1220	DO i = 0, nx
1221	tri(4,i,0) = tri(1,i,0)
1222	ENDDO
1223	DO k = 1, nz-1
1224	DO i = 0, nx
1225	tri(5,i,k) = tri(2,i,k) / tri(4,i,k-1)
1226	tri(4,i,k) = tri(1,i,k) - tri(3,i,k-1) * tri(5,i,k)
1227	ENDDO
1228	ENDDO
1229
1230	END SUBROUTINE split_1dd
1231
1232
1233	SUBROUTINE substi_1dd( ar, tri )
1234
1235	!------------------------------------------------------------------------------!
1236	! Substitution (Forward and Backward) (Thomas algorithm)
1237	!------------------------------------------------------------------------------!
1238
1239	IMPLICIT NONE
1240
1241	INTEGER :: i, k
1242
1243	REAL, DIMENSION(0:nx,nz) :: ar
1244	REAL, DIMENSION(0:nx,0:nz-1) :: ar1
1245	REAL, DIMENSION(5,0:nx,0:nz-1) :: tri
1246
1247	!
1248	!-- Forward substitution
1249	DO i = 0, nx
1250	ar1(i,0) = ar(i,1)
1251	ENDDO
1252	DO k = 1, nz-1
1253	DO i = 0, nx
1254	ar1(i,k) = ar(i,k+1) - tri(5,i,k) * ar1(i,k-1)
1255	ENDDO
1256	ENDDO
1257
1258	!
1259	!-- Backward substitution
1260	!-- Note, the add of 1.0E-20 in the denominator is due to avoid divisions
1261	!-- by zero appearing if the pressure bc is set to neumann at the top of
1262	!-- the model domain.
1263	DO i = 0, nx
1264	ar(i,nz) = ar1(i,nz-1) / ( tri(4,i,nz-1) + 1.0E-20 )
1265	ENDDO
1266	DO k = nz-2, 0, -1
1267	DO i = 0, nx
1268	ar(i,k+1) = ( ar1(i,k) - tri(3,i,k) * ar(i,k+2) ) &
1269	/ tri(4,i,k)
1270	ENDDO
1271	ENDDO
1272
1273	!
1274	!-- Indices i=0, j=0 correspond to horizontally averaged pressure.
1275	!-- The respective values of ar should be zero at all k-levels if
1276	!-- acceleration of horizontally averaged vertical velocity is zero.
1277	IF ( ibc_p_b == 1 .AND. ibc_p_t == 1 ) THEN
1278	IF ( j == 0 ) THEN
1279	DO k = 1, nz
1280	ar(0,k) = 0.0
1281	ENDDO
1282	ENDIF
1283	ENDIF
1284
1285	END SUBROUTINE substi_1dd
1286
1287	END SUBROUTINE tridia_1dd
1288
1289
1290	SUBROUTINE tridia( ar )
1291
1292	!------------------------------------------------------------------------------!
1293	! solves the linear system of equations:
1294	!
1295	! -(4 pi^2(i^2/(dx^2nnx^2)+j^2/(dy^2nny^2))+
1296	! 1/(dzu(k)dzw(k))+1/(dzu(k-1)dzw(k)))*p(i,j,k)+
1297	! 1/(dzu(k)dzw(k))p(i,j,k+1)+1/(dzu(k-1)dzw(k))p(i,j,k-1)=d(i,j,k)
1298	!
1299	! by using the Thomas algorithm
1300	!------------------------------------------------------------------------------!
1301
1302	USE arrays_3d
1303
1304	IMPLICIT NONE
1305
1306	INTEGER :: i, j, k
1307
1308	!$acc declare create( tri )
1309	REAL, DIMENSION(nxl_z:nxr_z,nys_z:nyn_z,0:nz-1,2) :: tri
1310
1311	REAL :: ar(nxl_z:nxr_z,nys_z:nyn_z,1:nz)
1312
1313
1314	CALL split( tri )
1315	CALL substi( ar, tri )
1316
1317	END SUBROUTINE tridia
1318
1319
1320	SUBROUTINE maketri
1321
1322	!------------------------------------------------------------------------------!
1323	! Computes the i- and j-dependent component of the matrix
1324	!------------------------------------------------------------------------------!
1325
1326	USE arrays_3d, ONLY: tric
1327	USE constants
1328	USE control_parameters
1329	USE grid_variables
1330
1331	IMPLICIT NONE
1332
1333	INTEGER :: i, j, k, nnxh, nnyh
1334
1335	!$acc declare create( ll )
1336	REAL :: ll(nxl_z:nxr_z,nys_z:nyn_z)
1337
1338
1339	nnxh = ( nx + 1 ) / 2
1340	nnyh = ( ny + 1 ) / 2
1341
1342	!
1343	!-- Provide the constant coefficients of the tridiagonal matrix for solution
1344	!-- of the Poisson equation in Fourier space.
1345	!-- The coefficients are computed following the method of
1346	!-- Schmidt et al. (DFVLR-Mitteilung 84-15), which departs from Stephan
1347	!-- Siano's original version by discretizing the Poisson equation,
1348	!-- before it is Fourier-transformed.
1349
1350	!$acc kernels present( tric )
1351	!$acc loop vector( 32 )
1352	DO j = nys_z, nyn_z
1353	DO i = nxl_z, nxr_z
1354	IF ( j >= 0 .AND. j <= nnyh ) THEN
1355	IF ( i >= 0 .AND. i <= nnxh ) THEN
1356	ll(i,j) = 2.0 * ( 1.0 - COS( ( 2.0 * pi * i ) / &
1357	REAL( nx+1 ) ) ) / ( dx * dx ) + &
1358	2.0 * ( 1.0 - COS( ( 2.0 * pi * j ) / &
1359	REAL( ny+1 ) ) ) / ( dy * dy )
1360	ELSE
1361	ll(i,j) = 2.0 * ( 1.0 - COS( ( 2.0 * pi * ( nx+1-i ) ) / &
1362	REAL( nx+1 ) ) ) / ( dx * dx ) + &
1363	2.0 * ( 1.0 - COS( ( 2.0 * pi * j ) / &
1364	REAL( ny+1 ) ) ) / ( dy * dy )
1365	ENDIF
1366	ELSE
1367	IF ( i >= 0 .AND. i <= nnxh ) THEN
1368	ll(i,j) = 2.0 * ( 1.0 - COS( ( 2.0 * pi * i ) / &
1369	REAL( nx+1 ) ) ) / ( dx * dx ) + &
1370	2.0 * ( 1.0 - COS( ( 2.0 * pi * ( ny+1-j ) ) / &
1371	REAL( ny+1 ) ) ) / ( dy * dy )
1372	ELSE
1373	ll(i,j) = 2.0 * ( 1.0 - COS( ( 2.0 * pi * ( nx+1-i ) ) / &
1374	REAL( nx+1 ) ) ) / ( dx * dx ) + &
1375	2.0 * ( 1.0 - COS( ( 2.0 * pi * ( ny+1-j ) ) / &
1376	REAL( ny+1 ) ) ) / ( dy * dy )
1377	ENDIF
1378	ENDIF
1379	ENDDO
1380	ENDDO
1381
1382	!$acc loop
1383	DO k = 0, nz-1
1384	DO j = nys_z, nyn_z
1385	!$acc loop vector( 32 )
1386	DO i = nxl_z, nxr_z
1387	tric(i,j,k) = ddzuw(k,3) - ll(i,j)
1388	ENDDO
1389	ENDDO
1390	ENDDO
1391	!$acc end kernels
1392
1393	IF ( ibc_p_b == 1 ) THEN
1394	!$acc kernels present( tric )
1395	!$acc loop
1396	DO j = nys_z, nyn_z
1397	DO i = nxl_z, nxr_z
1398	tric(i,j,0) = tric(i,j,0) + ddzuw(0,1)
1399	ENDDO
1400	ENDDO
1401	!$acc end kernels
1402	ENDIF
1403	IF ( ibc_p_t == 1 ) THEN
1404	!$acc kernels present( tric )
1405	!$acc loop
1406	DO j = nys_z, nyn_z
1407	DO i = nxl_z, nxr_z
1408	tric(i,j,nz-1) = tric(i,j,nz-1) + ddzuw(nz-1,2)
1409	ENDDO
1410	ENDDO
1411	!$acc end kernels
1412	ENDIF
1413
1414	END SUBROUTINE maketri
1415
1416
1417	SUBROUTINE substi( ar, tri )
1418
1419	!------------------------------------------------------------------------------!
1420	! Substitution (Forward and Backward) (Thomas algorithm)
1421	!------------------------------------------------------------------------------!
1422
1423	USE control_parameters
1424
1425	IMPLICIT NONE
1426
1427	INTEGER :: i, j, k
1428
1429	REAL :: ar(nxl_z:nxr_z,nys_z:nyn_z,1:nz)
1430	REAL, DIMENSION(nxl_z:nxr_z,nys_z:nyn_z,0:nz-1,2) :: tri
1431
1432	!$acc declare create( ar1 )
1433	REAL, DIMENSION(nxl_z:nxr_z,nys_z:nyn_z,0:nz-1) :: ar1
1434
1435	!
1436	!-- Forward substitution
1437	DO k = 0, nz - 1
1438	!$acc kernels present( ar, tri )
1439	!$acc loop
1440	DO j = nys_z, nyn_z
1441	DO i = nxl_z, nxr_z
1442
1443	IF ( k == 0 ) THEN
1444	ar1(i,j,k) = ar(i,j,k+1)
1445	ELSE
1446	ar1(i,j,k) = ar(i,j,k+1) - tri(i,j,k,2) * ar1(i,j,k-1)
1447	ENDIF
1448
1449	ENDDO
1450	ENDDO
1451	!$acc end kernels
1452	ENDDO
1453
1454	!
1455	!-- Backward substitution
1456	!-- Note, the 1.0E-20 in the denominator is due to avoid divisions
1457	!-- by zero appearing if the pressure bc is set to neumann at the top of
1458	!-- the model domain.
1459	DO k = nz-1, 0, -1
1460	!$acc kernels present( ar, tri )
1461	!$acc loop
1462	DO j = nys_z, nyn_z
1463	DO i = nxl_z, nxr_z
1464
1465	IF ( k == nz-1 ) THEN
1466	ar(i,j,k+1) = ar1(i,j,k) / ( tri(i,j,k,1) + 1.0E-20 )
1467	ELSE
1468	ar(i,j,k+1) = ( ar1(i,j,k) - ddzuw(k,2) * ar(i,j,k+2) ) &
1469	/ tri(i,j,k,1)
1470	ENDIF
1471	ENDDO
1472	ENDDO
1473	!$acc end kernels
1474	ENDDO
1475
1476	!
1477	!-- Indices i=0, j=0 correspond to horizontally averaged pressure.
1478	!-- The respective values of ar should be zero at all k-levels if
1479	!-- acceleration of horizontally averaged vertical velocity is zero.
1480	IF ( ibc_p_b == 1 .AND. ibc_p_t == 1 ) THEN
1481	IF ( nys_z == 0 .AND. nxl_z == 0 ) THEN
1482	!$acc kernels loop present( ar )
1483	DO k = 1, nz
1484	ar(nxl_z,nys_z,k) = 0.0
1485	ENDDO
1486	ENDIF
1487	ENDIF
1488
1489	END SUBROUTINE substi
1490
1491
1492	SUBROUTINE split( tri )
1493
1494	!------------------------------------------------------------------------------!
1495	! Splitting of the tridiagonal matrix (Thomas algorithm)
1496	!------------------------------------------------------------------------------!
1497
1498	USE arrays_3d, ONLY: tric
1499
1500	IMPLICIT NONE
1501
1502	INTEGER :: i, j, k
1503
1504	REAL, DIMENSION(nxl_z:nxr_z,nys_z:nyn_z,0:nz-1,2) :: tri
1505
1506	!
1507	!-- Splitting
1508	!$acc kernels present( tri, tric )
1509	!$acc loop
1510	DO j = nys_z, nyn_z
1511	!$acc loop vector( 32 )
1512	DO i = nxl_z, nxr_z
1513	tri(i,j,0,1) = tric(i,j,0)
1514	ENDDO
1515	ENDDO
1516	!$acc end kernels
1517
1518	DO k = 1, nz-1
1519	!$acc kernels present( tri, tric )
1520	!$acc loop
1521	DO j = nys_z, nyn_z
1522	!$acc loop vector( 32 )
1523	DO i = nxl_z, nxr_z
1524	tri(i,j,k,2) = ddzuw(k,1) / tri(i,j,k-1,1)
1525	tri(i,j,k,1) = tric(i,j,k) - ddzuw(k-1,2) * tri(i,j,k,2)
1526	ENDDO
1527	ENDDO
1528	!$acc end kernels
1529	ENDDO
1530
1531	END SUBROUTINE split
1532
1533	#endif
1534
1535	END MODULE poisfft_mod

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: palm/trunk/SOURCE/poisfft.f90 @ 1111

Download in other formats: