Changes between Version 18 and Version 19 of doc/app/runs


Ignore:
Timestamp:
Apr 13, 2021 2:13:06 PM (4 years ago)
Author:
raasch
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • doc/app/runs

    v18 v19  
    11
    2 = Simulation chains / Restart runs =
     2= Job chains / Restart runs =
    33
    4 Batch systems generally limit the CPU time that is allowed to be requested by a job, e.g. to a maximum of 12 hours or 24 hours. If a simulation needs more time to run, it has to be split into several parts/jobs. The first job is called the ''initial'' job, the others ''restart'' jobs. Together they form a so-called ''job chain''. Restart jobs require as input the values of all flow variables as calculated in the final time step of the previous job. They need to be output by the previous job in a so-called ''restart-file'' which is a required input file for the restart job.
     4Batch systems generally limit the CPU time that is allowed to be requested by a job, e.g. to a maximum of 12 hours or 24 hours. If a simulation needs more time to run, it has to be split into several parts/jobs. The first job is called the ''initial'' run or job, the others are called ''restart'' runs/jobs. Together they form a so-called ''job chain''. Restart runs require as input the state of all flow variables as they were calculated in the final time step of the previous run. They need to be output by the previous run into a so-called ''restart-file'' which is a required input file for the (next) restart run.
    55
    6 {{{palmrun}}} allows you to automatically generate job chains and to handle the restart files. Of course, automatic generation does not work if you run PALM in interactive mode. The following chapter describes
     6{{{palmrun}}} allows you to automatically generate job chains and to handle the restart files. Of course, automatic generation does not work if you run PALM in interactive mode.
    77
    8 A job started by '''[../../app/palmrun palmrun]''' will - according to its requested computing time, its memory size requirement and the number of necessary processing elements (on parallel computers) - be queued by the queuing-system of the local or remote computer into a suitable job class which fulfills these requirements. Each job class permits only jobs with certain maximum requirements (e.g. the allowed CPU time or the maximum number of cores that can be used by the job). The job classes are important for the scheduling process of the computer. Jobs with small requirements usually come to execution very fast, jobs with higher requirements must wait longer (sometimes several days).\\\\
    9 Before the start of a model run the user must estimate how much CPU time the model will need for the simulation. The necessary time in seconds has to be indicated with the '''palmrun''' option {{{-t}}} and may have an influence on the job class into which the job is queued. Due to the fact that the model usually uses a variable time step and thus the number of time steps to be executed and consequently the time needed by the model is not known at the beginning, this can be measured only very roughly in many cases. So it may happen that the model needs more time than indicated by the option {{{-t}}}, which normally causes the scheduler to terminate the job as soon as the available CPU time is consumed. In principle one could solve this problem by setting a very generously estimated value for {{{-t}}}, but this will possibly lead to the disadvantage that the queued job has to wait longer for execution.\\\\
     8A job started by '''[../../app/palmrun palmrun]''' will be queued by the queuing-system of the local or remote computer into a suitable job class which fulfills the requirements that are set via {{{palmrun}}} options {{{-t}}}, {{{-m}}}, {{{-X}}}, and {{{-T}}}, which define the requested CPU-time, memory, and number of cores. Each job class permits only jobs with certain maximum requirements (e.g. the allowed CPU time or the maximum number of cores that can be used by the job). Some queuing systems automatically sort the jobs into the respective job class, others require an explicit setting of the class for which the job shall be queued. You can set the job class via {{{palmrun}}} option {{{-q}}}. The job classes are important for the scheduling process of the computer. Jobs with small requirements usually come to execution very fast, jobs with larger requirements must wait longer (sometimes several days). Be aware that the available job classes vary a lot among different computer centers.
     9
     10Before starting a run, you have to estimate how much CPU time your complete simulation will need. The required time in seconds has to be given with '''palmrun''' option {{{-t}}}. Due to the fact that the model uses a variable time step by default, the number of time steps to be carried out and consequently the time required to finish the simulation can often only roughly be estimated. So it may happen that more time is needed to finish the simulation than indicated by option {{{-t}}}. That will normally cause the job scheduler to terminate the job as soon as the available CPU time is consumed. In principle, you may avoid this problem by setting a very generously estimated value for {{{-t}}}, but the maximum allowed CPU-time is often limited due to job class restrictions.
     11
    1012To avoid this problem '''palmrun''' offers the possibility of so-called '''restart runs'''. During the model run PALM continuously examines how much time is left for the execution of the job. If the run is not completed and finished shortly before expiration of this time, the model stops and writes the values of (nearly) all model variables (especially the 3d-prognostic quantities) in binary form to a file (local name [../iofiles#BINOUT BINOUT]). After copying the output files requested by the user, '''palmrun''' automatically starts a restart run. For this purpose a new '''palmrun''' call is set off automatically on the local computer of the user; '''palmrun''' thus calls itself. The options with this call correspond to a large extent to those which the user had selected with his initial call of '''palmrun'''. The model restarts and this time at the beginning it reads in the binary data written before and continues the run with them. If in this job the CPU time is not sufficient either, in order to terminate the run, at the end of the job another restart run is started, etc., until the time which shall be simulated by the model, is reached. Thus a set of restart runs can develop - a so-called job chain. The first run of this chain (model start at t=0) is called '''initial run'''.\\\\
    1113Working with restart runs and their generation through '''palmrun''' requires certain entries in the palmrun-configuration file and in the parameter file, which are described and explained in the following. The configuration file should contain the following entries: