MOSAIK/PALM-4U simulation status and results

First building-resolving large-eddy simulations for entire Berlin

Presentation at ICUC10, New York City, August 2018

Validation runs

VDI 3783 Part 9

Status message: completed

Description Date of issue Closing date
Validation completed 23/07/2019 23/07/2019

Validation protocol and details: here


Run 01 (VALM01): Winter 2017 Berlin, Jan 17 06:00 UTC - Jan 18 06:00

Status message: debugging

Description Date of issue Closing date Further remarks
Start of VALM01 testing 01/01/2019
... ...
Crash in nested runs. April-May ... Smaller test simulations run well.
Bug in radiation in a nested run 21/05/2019 22/05/2019
... ...
Fix numerical issues that lead to unrealistic concentrations of chemical compounds in case of (offline) nesting. July end of July
Memory demand for calculation of view factors (radiative transfer) 01/08/2019 03/08/2019 OOM killer aborted processes randomly. Using not all cores on node fixed this.
Parent and child grids do not overlap (required after revision of the nesting) 05/08/2019 05/08/2019 New child drivers are required as number of grid points changed.
Driver problem with bridges at the boundary. 05/08/2019 05/08/2019 Child was moved a few meter northward.
Bug in building parameters, wrong dimension in static input file 08/08/2019 08/08/2019
Bug when green roofs are present. 09/08/2019 10/10/2019 Green roofs where disabled in the simulation. Fixed now.
MPI network problems on the cray machine in Berlin 10/08/2019 19/8/2019 Recurring MPI failures at different locations in the code, appeared only in the large winter IOP simulations, not in smaller ones. As a consequence, all runs were carried-out on the Atos machine in Göttingen.
Minor bug in new implementation of external radiative forcing 21/08/2019 21/08/2019
Failure due to MPI errors. 05/09/2019 21/09/2019 Appeared also in smaller test simulations. Need to be fixed on HLRN side.
Emission module caused model crash 23/09/2019 27/09/2019
MPI error just at the beginning of the run - HLRN internal problem: ofi fabric is not available ... 24/09/2019 ? Error does not appear any more.
Crashed with error message corrupted double-linked list 01/10/2019 07/10/2019 Seems that scheduling is not working properly, had been queued for 4 days!!! Further debug messages implemented to narrow down the location. Message comes from the parent domain. However, memory consuming sky-view factors were calculated.
Crashed with An allocatable array is already allocated 09/10/2019 10/10/2019
Crashed by an MPI error 13/10/2019 Parent finished initialization. Crashes in a MPI_ALLGATHER call in surface_data_output_init. Might be connected to the HLRN-network problem (14/10/2019).
Crashed again with error message corrupted double-linked list in child simulation. 17/10/2019 Parent finished initialization. Crashes again in surface_data_output_init. Next step: switch-off surface-data output as this has no priority at the moment. Note, due to limited resources on HLRN site, the queuing times are quite long for simulations, sometimes several days.
Crashed with Floating divide by zero 23/10/2019 Error seems to be raised within routine drydepo_aero_zhang_vd. Error occurs after time stepping started (initialization finished). Further debugging for this error is ongoing.
Start child-only simulation 23/10/2019 Due to continuous errors within the nested simulation, a non-nested (child-only) simulation is started to get first results for evaluation. Simulation is still running (06/11/2019).
Finished child-only simulation 27/11/2019 Simulation crashed at 11:57:34.95UTC with input/output error. Data up to that point is saved.
Crashes by MPI_INIT 03/11/2019 Simulation crashed several time in MPI_INIT (environment problems)
Crash 03/12/2019 program abort due to check of surface_fractions, check was revised so that surface fractions can also be set at building grid points
Crash 06/12/2019 HDF5 Error - could not reproduced
Crash 08/12/2019 Floating invalid in advection for u-component at first timestep. Unfortunately, this error could not reproduced. Remark: Jobs were queued for about a week on HLRN due to too low capacities, so that investigations and bug tracing was delayed.
Parent simulation 23/12/2019 Proceed investigation on HLRN Berlin. Parent simulation runs for an hour, results looks plausible.
Nested simulation - numerical issues 27/12/2019 Nested simulation ran for 1 minute. However, large oscillation in the u- and v-component could be observed within the child. I hypothesize that this is due to the 3D-initialization of the child from the parent. Due to mismatches in the building configuration (due to the large grid aspect ratio), many grid points in the child remain zero after initialization, even though these grid points belong to the atmosphere. Since the mass-flux is largely affected by this, strong oscillations arise within the child, finally lead to a crash.
Nested simulation 02/01/2020 Simulations repeatedly hang / crash. The Lustre system in Berlin is still not full setup so that simulations repeatedly hang / crash due to slow filesystem.
Nested simulation - initial run 03/01/2020 Lustre filesystem issues seem to be solved for now. Initialization of the child has been changed. Child is now initialized via dynamic driver rather than via the coupler. This way all atmosphere grid points are initialized appropriately. The nested simulation is at t=30min. First estimate of duration: in 12 h real time on 6720 cores we will simulate about 1 h. With 30 hrs simulation time (00:00:00 UTC - 06:00:00 UTC, next day), we will need about 30 restarts. Since the machine in Berlin starts to fill up now with other users, we are only be able to do 1 simulation at a day (optimistic scenario), so this will take at least one month.
Nested simulation - restart run 09/01/2020 Simulation crashes in reading the restart data for one PE in the child.
Nested simulation - restart run 29/01/2020 After recurrent maintenance-related breaks on HLRN, restart simulation started again. Simulation alternately crashes either with a HDF 5 error in the parent or in reading the restart data. In the parent this happens while reading the Netcdf input data. At most of the ranks there is no problem with the NetCDF input, however, at some ranks the NF90_INQUIRE and NF90_INQUIRE_VARIABLE produces NetCDF error codes. In the child, the error is reproducible, even if the initial simulation is run again the problem occurs. This happens only at specific ranks. We will downscale the simulation to debug this more efficiently. (Un)fortunately these problems do not occur any more after HLRN runs more stable, so that the reason for these crashes cannot be traced back.
Nested simulation 06/02/2020 After several fixes on HLRN side, I started the whole simulation with debug prints again. Initial simulation did not show any problems. The following restart run also run fine, no problem with NF90_INQUIRE as well as with empty binary files. The second restart run is queued now. We are at t ~ 2940 s.
Nested simulation 10/02/2020 We are at 03:00 UTC. Model run crashed in biometeorology_mod at first timestep after restart. The crash could be traced back to a NaN in pt_av at a single grid point. All other quantities, including pt, look reasonable.
Nested simulation 27/02/2020 Simulation was started again. This time we reached 04:00 UTC. Simulation crashes now again after a restart in reading the array "surf_h(0)%end_index", where some unreasonable values occur. On all other processes values for this array look correct.
Nested simulation 12/03/2020 Simulation was started again. After several optimizations where made in the synthetic turbulence generator and some minor bugs were fixed, the simulation was started again. Berlin complex is under maintenance now.
Nested simulation 25/03/2020 Simulation was running until exactly 05:00UTC. Crashed by floating overflow in the child domain. Last restart time was at 04:55 UTC, flow fields, surface data look reasonable. Restarting from last restart step using traceback option and print statements revealed an floating overflow in output of averaged 3D variable 'theta' at grid point (k,j,i) = (97,117,968), which is far away from any building. Think this is also related to a restart problem where faulty data is read for pt_av. Proceeding without averaged data output worked.
Nested simulation 01/04/2020 Simulation has reached 06:05 UTC. At the moment we are out of computing time. The IOP has been started, i.e. measurements are output. However, it turned out that the unstructured output of the virtual measurements consumes far too much CPU time at the moment. With smaller number of processes in test simulations this did not become obvious, however, with large number of processes the probability that IO processes interfere with each other becomes higher so that the slowdown of IO becomes more pronounced. First we need to accelerate the output before we can proceed. Moreover, with further debugging the reason of restart failures could be most probably narrowed down to file-system issues rather than palm-internal problems (sending trouble ticket to the computing center).

Run 02 (VALM02): Summer 2018 Berlin, Jul 16 06:00 UTC - Jul 18 06:00

Status message: preparation

Description Date of issue Closing date Further remarks
Preparing input files for VALM02 30/01/2020
Dynamic driver 13/03/2020 Error in inifor prevents dynamic-driver creation: inifor: ERROR: PALM-4U grid extends above COSMO-DE model top.. Bug-fixing is in progress. DWD created a preliminary driver with which further testing can be done.

Run 03 (VALM03): Winter 2017 Stuttgart, Feb 14 06:00 UTC - Feb 16 06:00

Status message: unscheduled


Run 04 (VALM04): Winter 2017 Berlin, Jul 08 04:00 UTC - Jul 09 19:00

Status message: unscheduled


Run 05 (VALM05): Hamburg, Wind tunnel

Status message: completed

Description Date of issue Closing date
Production run 18/04/2019 29/04/2019


Run 06 (VALM06): Summer 2017 Berlin, Jul 30 06:00 UTC - Aug 01 06:00

Status message: unscheduled

Last modified 7 days ago Last modified on Apr 2, 2020 1:03:09 PM

Attachments (6)

                                                                                                                                                                                                                                                                                                                                                                               
  | Impressum | ©Leibniz Universität Hannover |