source: palm/trunk/SOURCE/cpu_statistics.f90 @ 1111

Last change on this file since 1111 was 1111, checked in by raasch, 11 years ago

New:
---

GPU porting of pres, swap_timelevel. Adjustments of openACC directives.
Further porting of poisfft, which now runs completely on GPU without any
host/device data transfer for serial an parallel runs (but parallel runs
require data transfer before and after the MPI transpositions).
GPU-porting of tridiagonal solver:
tridiagonal routines split into extermal subroutines (instead using CONTAINS),
no distinction between parallel/non-parallel in poisfft and tridia any more,
tridia routines moved to end of file because of probable bug in PGI compiler
(otherwise "invalid device function" is indicated during runtime).
(cuda_fft_interfaces, fft_xy, flow_statistics, init_3d_model, palm, poisfft, pres, prognostic_equations, swap_timelevel, time_integration, transpose)
output of accelerator board information. (header)

optimization of tridia routines: constant elements and coefficients of tri are
stored in seperate arrays ddzuw and tric, last dimension of tri reduced from 5 to 2,
(init_grid, init_3d_model, modules, palm, poisfft)

poisfft_init is now called internally from poisfft,
(Makefile, Makefile_check, init_pegrid, poisfft, poisfft_hybrid)

CPU-time per grid point and timestep is output to CPU_MEASURES file
(cpu_statistics, modules, time_integration)

Changed:


resorting from/to array work changed, work now has 4 dimensions instead of 1 (transpose)
array diss allocated only if required (init_3d_model)

pressure boundary condition "Neumann+inhomo" removed from the code
(check_parameters, header, poisfft, poisfft_hybrid, pres)

Errors:


bugfix: dependency added for cuda_fft_interfaces (Makefile)
bugfix: CUDA fft plans adjusted for domain decomposition (before they always
used total domain) (fft_xy)

  • Property svn:keywords set to Id
File size: 11.6 KB
Line 
1 SUBROUTINE cpu_statistics
2
3!--------------------------------------------------------------------------------!
4! This file is part of PALM.
5!
6! PALM is free software: you can redistribute it and/or modify it under the terms
7! of the GNU General Public License as published by the Free Software Foundation,
8! either version 3 of the License, or (at your option) any later version.
9!
10! PALM is distributed in the hope that it will be useful, but WITHOUT ANY
11! WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
12! A PARTICULAR PURPOSE.  See the GNU General Public License for more details.
13!
14! You should have received a copy of the GNU General Public License along with
15! PALM. If not, see <http://www.gnu.org/licenses/>.
16!
17! Copyright 1997-2012  Leibniz University Hannover
18!--------------------------------------------------------------------------------!
19!
20! Current revisions:
21! -----------------
22! output of grid point numbers and average CPU time per grid point and timestep
23!
24! Former revisions:
25! -----------------
26! $Id: cpu_statistics.f90 1111 2013-03-08 23:54:10Z raasch $
27!
28! 1092 2013-02-02 11:24:22Z raasch
29! unused variables removed
30!
31! 1036 2012-10-22 13:43:42Z raasch
32! code put under GPL (PALM 3.9)
33!
34! 1015 2012-09-27 09:23:24Z raasch
35! output of accelerator board information
36!
37! 683 2011-02-09 14:25:15Z raasch
38! output of handling of ghostpoint exchange
39!
40! 622 2010-12-10 08:08:13Z raasch
41! output of handling of collective operations
42!
43! 222 2009-01-12 16:04:16Z letzel
44! Bugfix for nonparallel execution
45!
46! 197 2008-09-16 15:29:03Z raasch
47! Format adjustments in order to allow CPU# > 999,
48! data are collected from PE0 in an ordered sequence which seems to avoid
49! hanging of processes on SGI-ICE
50!
51! 82 2007-04-16 15:40:52Z raasch
52! Preprocessor directives for old systems removed
53!
54! RCS Log replace by Id keyword, revision history cleaned up
55!
56! Revision 1.13  2006/04/26 12:10:51  raasch
57! Output of number of threads per task, max = min in case of 1 PE
58!
59! Revision 1.1  1997/07/24 11:11:11  raasch
60! Initial revision
61!
62!
63! Description:
64! ------------
65! Analysis and output of the cpu-times measured. All PE results are collected
66! on PE0 in order to calculate the mean cpu-time over all PEs and other
67! statistics. The output is sorted according to the amount of cpu-time consumed
68! and output on PE0.
69!------------------------------------------------------------------------------!
70
71    USE control_parameters
72    USE cpulog
73    USE indices,  ONLY: nx, ny, nz
74    USE pegrid
75
76    IMPLICIT NONE
77
78    INTEGER    ::  i, ii(1), iii, sender
79    REAL       ::  average_cputime
80    REAL, SAVE ::  norm = 1.0
81    REAL, DIMENSION(:),   ALLOCATABLE ::  pe_max, pe_min, pe_rms, sum
82    REAL, DIMENSION(:,:), ALLOCATABLE ::  pe_log_points
83
84
85!
86!-- Compute cpu-times in seconds
87    log_point%mtime  = log_point%mtime  / norm
88    log_point%sum    = log_point%sum    / norm
89    log_point%vector = log_point%vector / norm
90    WHERE ( log_point%counts /= 0 )
91       log_point%mean = log_point%sum / log_point%counts
92    END WHERE
93
94
95!
96!-- Collect cpu-times from all PEs and calculate statistics
97    IF ( myid == 0 )  THEN
98!
99!--    Allocate and initialize temporary arrays needed for statistics
100       ALLOCATE( pe_max( SIZE( log_point ) ), pe_min( SIZE( log_point ) ), &
101                 pe_rms( SIZE( log_point ) ),                              &
102                 pe_log_points( SIZE( log_point ), 0:numprocs-1 ) )
103       pe_min = log_point%sum
104       pe_max = log_point%sum    ! need to be set in case of 1 PE
105       pe_rms = 0.0
106
107#if defined( __parallel )
108!
109!--    Receive data from all PEs
110       DO  i = 1, numprocs-1
111          CALL MPI_RECV( pe_max(1), SIZE( log_point ), MPI_REAL, &
112                         i, i, comm2d, status, ierr )
113          sender = status(MPI_SOURCE)
114          pe_log_points(:,sender) = pe_max
115       ENDDO
116       pe_log_points(:,0) = log_point%sum   ! Results from PE0
117!
118!--    Calculate mean of all PEs, store it on log_point%sum
119!--    and find minimum and maximum
120       DO  iii = 1, SIZE( log_point )
121          DO  i = 1, numprocs-1
122             log_point(iii)%sum = log_point(iii)%sum + pe_log_points(iii,i)
123             pe_min(iii) = MIN( pe_min(iii), pe_log_points(iii,i) )
124             pe_max(iii) = MAX( pe_max(iii), pe_log_points(iii,i) )
125          ENDDO
126          log_point(iii)%sum = log_point(iii)%sum / numprocs
127!
128!--       Calculate rms
129          DO  i = 0, numprocs-1
130             pe_rms(iii) = pe_rms(iii) + ( &
131                                 pe_log_points(iii,i) - log_point(iii)%sum &
132                                         )**2
133          ENDDO
134          pe_rms(iii) = SQRT( pe_rms(iii) / numprocs )
135       ENDDO
136    ELSE
137!
138!--    Send data to PE0 (pe_max is used as temporary storage to send
139!--    the data in order to avoid sending the data type log)
140       ALLOCATE( pe_max( SIZE( log_point ) ) )
141       pe_max = log_point%sum
142       CALL MPI_SEND( pe_max(1), SIZE( log_point ), MPI_REAL, 0, myid, comm2d, &
143                      ierr )
144#endif
145
146    ENDIF
147
148!
149!-- Write cpu-times
150    IF ( myid == 0 )  THEN
151!
152!--    Re-store sums
153       ALLOCATE( sum( SIZE( log_point ) ) )
154       WHERE ( log_point%counts /= 0 )
155          sum = log_point%sum
156       ELSEWHERE
157          sum = -1.0
158       ENDWHERE
159
160!
161!--    Get total time in order to calculate CPU-time per gridpoint and timestep
162       IF ( nr_timesteps_this_run /= 0 )  THEN
163          average_cputime = log_point(1)%sum / REAL( (nx+1) * (ny+1) * nz ) / &
164                            REAL( nr_timesteps_this_run ) * 1E6  ! in micro-sec
165       ELSE
166          average_cputime = -1.0
167       ENDIF
168
169!
170!--    Write cpu-times sorted by size
171       CALL check_open( 18 )
172#if defined( __parallel )
173       WRITE ( 18, 100 )  TRIM( run_description_header ),                          &
174                          numprocs * threads_per_task, pdims(1), pdims(2),         &
175                          threads_per_task, nx+1, ny+1, nz, nr_timesteps_this_run, &
176                          average_cputime
177                         
178       IF ( num_acc_per_node /= 0 )  WRITE ( 18, 108 )  num_acc_per_node
179       WRITE ( 18, 110 )
180#else
181       WRITE ( 18, 100 )  TRIM( run_description_header ),                          &
182                          numprocs * threads_per_task, 1, 1,                       &
183                          threads_per_task, nx+1, ny+1, nz, nr_timesteps_this_run, &
184                          average_cputime
185
186       IF ( num_acc_per_node /= 0 )  WRITE ( 18, 109 )  num_acc_per_node
187       WRITE ( 18, 110 )
188#endif
189       DO
190          ii = MAXLOC( sum )
191          i = ii(1)
192          IF ( sum(i) /= -1.0 )  THEN
193             WRITE ( 18, 102 ) &
194                log_point(i)%place, log_point(i)%sum,                &
195                log_point(i)%sum / log_point(1)%sum * 100.0,         &
196                log_point(i)%counts, pe_min(i), pe_max(i), pe_rms(i)
197             sum(i) = -1.0
198          ELSE
199             EXIT
200          ENDIF
201       ENDDO
202    ENDIF
203
204
205!
206!-- The same procedure again for the individual measurements.
207!
208!-- Compute cpu-times in seconds
209    log_point_s%mtime  = log_point_s%mtime  / norm
210    log_point_s%sum    = log_point_s%sum    / norm
211    log_point_s%vector = log_point_s%vector / norm
212    WHERE ( log_point_s%counts /= 0 )
213       log_point_s%mean = log_point_s%sum / log_point_s%counts
214    END WHERE
215
216!
217!-- Collect cpu-times from all PEs and calculate statistics
218#if defined( __parallel )
219!
220!-- Set barrier in order to avoid that PE0 receives log_point_s-data
221!-- while still busy with receiving log_point-data (see above)
222    CALL MPI_BARRIER( comm2d, ierr )   
223#endif
224    IF ( myid == 0 )  THEN
225!
226!--    Initialize temporary arrays needed for statistics
227       pe_min = log_point_s%sum
228       pe_max = log_point_s%sum    ! need to be set in case of 1 PE
229       pe_rms = 0.0
230
231#if defined( __parallel )
232!
233!--    Receive data from all PEs
234       DO  i = 1, numprocs-1
235          CALL MPI_RECV( pe_max(1), SIZE( log_point ), MPI_REAL, &
236                         MPI_ANY_SOURCE, MPI_ANY_TAG, comm2d, status, ierr )
237          sender = status(MPI_SOURCE)
238          pe_log_points(:,sender) = pe_max
239       ENDDO
240       pe_log_points(:,0) = log_point_s%sum   ! Results from PE0
241!
242!--    Calculate mean of all PEs, store it on log_point_s%sum
243!--    and find minimum and maximum
244       DO  iii = 1, SIZE( log_point )
245          DO  i = 1, numprocs-1
246             log_point_s(iii)%sum = log_point_s(iii)%sum + pe_log_points(iii,i)
247             pe_min(iii) = MIN( pe_min(iii), pe_log_points(iii,i) )
248             pe_max(iii) = MAX( pe_max(iii), pe_log_points(iii,i) )
249          ENDDO
250          log_point_s(iii)%sum = log_point_s(iii)%sum / numprocs
251!
252!--       Calculate rms
253          DO  i = 0, numprocs-1
254             pe_rms(iii) = pe_rms(iii) + ( &
255                                 pe_log_points(iii,i) - log_point_s(iii)%sum &
256                                         )**2
257          ENDDO
258          pe_rms(iii) = SQRT( pe_rms(iii) / numprocs )
259       ENDDO
260    ELSE
261!
262!--    Send data to PE0 (pe_max is used as temporary storage to send
263!--    the data in order to avoid sending the data type log)
264       pe_max = log_point_s%sum
265       CALL MPI_SEND( pe_max(1), SIZE( log_point ), MPI_REAL, 0, 0, comm2d, &
266                      ierr )
267#endif
268
269    ENDIF
270
271!
272!-- Write cpu-times
273    IF ( myid == 0 )  THEN
274!
275!--    Re-store sums
276       WHERE ( log_point_s%counts /= 0 )
277          sum = log_point_s%sum
278       ELSEWHERE
279          sum = -1.0
280       ENDWHERE
281
282!
283!--    Write cpu-times sorted by size
284       WRITE ( 18, 101 )
285       DO
286          ii = MAXLOC( sum )
287          i = ii(1)
288          IF ( sum(i) /= -1.0 )  THEN
289             WRITE ( 18, 102 ) &
290                log_point_s(i)%place, log_point_s(i)%sum, &
291                log_point_s(i)%sum / log_point(1)%sum * 100.0, &
292                log_point_s(i)%counts, pe_min(i), pe_max(i), pe_rms(i)
293             sum(i) = -1.0
294          ELSE
295             EXIT
296          ENDIF
297       ENDDO
298
299!
300!--    Output of handling of MPI operations
301       IF ( collective_wait )  THEN
302          WRITE ( 18, 103 )
303       ELSE
304          WRITE ( 18, 104 )
305       ENDIF
306       IF ( synchronous_exchange )  THEN
307          WRITE ( 18, 105 )
308       ELSE
309          WRITE ( 18, 106 )
310       ENDIF
311
312!
313!--    Empty lines in order to create a gap to the results of the model
314!--    continuation runs
315       WRITE ( 18, 107 )
316
317!
318!--    Unit 18 is not needed anymore
319       CALL close_file( 18 )
320
321    ENDIF
322
323
324100 FORMAT (A/11('-')//'CPU measures for ',I5,' PEs (',I5,'(x) * ',I5,'(y', &
325            &') tasks *',I5,' threads):'//                                  &
326            'gridpoints (x/y/z): ',20X,I5,' * ',I5,' * ',I5/                &
327            'nr of timesteps: ',22X,I6/                                     &
328            'cpu time per grid point and timestep: ',5X,F8.5,' * 10**-6 s')
329
330101 FORMAT (/'special measures:'/ &
331            &'-----------------------------------------------------------', &
332            &'--------------------')
333
334102 FORMAT (A20,2X,F9.3,2X,F7.2,1X,I7,3(1X,F9.3))
335103 FORMAT (/'Barriers are set in front of collective operations')
336104 FORMAT (/'No barriers are set in front of collective operations')
337105 FORMAT (/'Exchange of ghostpoints via MPI_SENDRCV')
338106 FORMAT (/'Exchange of ghostpoints via MPI_ISEND/MPI_IRECV')
339107 FORMAT (//)
340108 FORMAT ('Accelerator boards per node: ',14X,I2)
341109 FORMAT ('Accelerator boards: ',23X,I2)
342110 FORMAT ('----------------------------------------------------------',   &
343            &'------------'//&
344            &'place:                        mean        counts      min  ', &
345            &'     max       rms'/ &
346            &'                           sec.      %                sec. ', &
347            &'     sec.      sec.'/  &
348            &'-----------------------------------------------------------', &
349            &'-------------------')
350
351 END SUBROUTINE cpu_statistics
352
Note: See TracBrowser for help on using the repository browser.