Changes between Version 1 and Version 2 of doc/app/runs
- Timestamp:
- Sep 16, 2010 9:14:57 AM (14 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
doc/app/runs
v1 v2 3 3 4 4 A job started by '''mrun''' will - according to its requested computing time, its memory size requirement and the number of necessary processing elements (on parallel computers) - be queued by the queuing-system of the remote computer into a suitable job class which fulfills these requirements. Each job class permits only jobs with certain maximum requirements (e.g. the job class {{{cdev}}} on the IBM Regatta "hanni" of the HLRN permits only jobs with no more than 7200 seconds required computing time and with using no more than 32 processing elements). The job classes are important for the scheduling process of the computer. Jobs with small requirements usually come to execution very fast, jobs with higher requirements must wait longer (sometimes several days).\\\\ 5 Before the start of a model run the user must estimate how much CPU time the model will need for the simulation. The necessary time in seconds has to be indicated with the '''mrun option''' [[ {{{-t}}}]] and has an influence on the job class into which the job is queued. Due to the fact that the model usually uses a variable time step and thus the number of time steps to be executed and consequently the time needed by the model is not known at the beginning, this can be measured only very roughly in many cases. So it may happen that the model needs more time than indicated for the option {{{-t}}}, which normally leads to an abort of the job as soon as the available CPU time is consumed. In principle one could solve this problem by setting a very generously estimated value for {{{-t}}}, but this will possibly lead to the disadvantage that the queued job has to wait longer for execution.\\\\5 Before the start of a model run the user must estimate how much CPU time the model will need for the simulation. The necessary time in seconds has to be indicated with the '''mrun option''' [[`-t`]] and has an influence on the job class into which the job is queued. Due to the fact that the model usually uses a variable time step and thus the number of time steps to be executed and consequently the time needed by the model is not known at the beginning, this can be measured only very roughly in many cases. So it may happen that the model needs more time than indicated for the option {{{-t}}}, which normally leads to an abort of the job as soon as the available CPU time is consumed. In principle one could solve this problem by setting a very generously estimated value for {{{-t}}}, but this will possibly lead to the disadvantage that the queued job has to wait longer for execution.\\\\ 6 6 To avoid this problem '''mrun''' offers the possibility of so-called '''restart runs'''. During the model run PALM continuously examines how much time is left for the execution of the job. If the run is not completed and finished shortly before expiration of this time, the model stops and writes down the values of (nearly) all model variables in binary form to a file (local name [../iofiles#BINOUT BINOUT]). After copying the output files required by the user, '''mrun''' automatically starts a restart run. For this purpose a new '''mrun''' call is set off automatically on the local computer of the user; '''mrun''' thus calls itself. The options with this call correspond to a large extent to those which the user had selected with his initial call of '''mrun'''. The model restarts and this time at the beginning it reads in the binary data written before and continues the run with them. If in this job the CPU time is not sufficient either, in order to terminate the run, at the end of the job another restart run is started, etc., until the time which shall be simulated by the model, is reached. Thus a set of restart runs can develop - a so-called job chain. The first run of this chain (model start at t=0) is called '''initial run'''.\\\\ 7 7 Working with restart runs and their generation through '''mrun''' requires certain entries in the mrun-configuration file and in the parameter file, which are described and explained in the following. The configuration file must contain the following entries (example for the IBM Regatta of the HLRN): … … 19 19 mrun -h ibmh -d abcde -t 900 -r "d3# restart" 20 20 }}} 21 The specification of the environment variable {{{write_binary}}}, which must be assigned the value {{{true}}}, is essential. Only in this case the model writes binary-coded data for a possible restart run to the local file [../iofiles#BINOUT BINOUT] at the end of the run. Then of course this output file must be stored on a permanent file with an appropriate file connection statement (last line of the example above). As you can see, both instructions (variable declaration and connection statements) are only carried out by '''mrun''', if the character string {{{restart}}} is given for the option {{{-r}}} in the '''mrun''' call. Thus the example above can also be used if no restart runs are intended. In such cases the character string {{{restart}}} with the option {{{-r}} can simply be omitted.\\\\22 Only by the specification of {{{write_binary=true}}} the model is instructed to compute the remaining CPU time after each time step and stop, if the run is not going to be completed and finished briefly before expiration of this time. Actually the stop takes place when the difference from the available job time (determined by the '''mrun''' option {{{-t}}}) and the time used so far by the job becomes smaller than the time given by the model variable [../d3par#termination_time_needed termination_time_needed]. With the variable '''termination_time_needed''' the user determines, how much time is needed for binary copying of the data for restart runs, as well as for the following data archiving and transfer of result data etc. (as long as this is part of the job). Thus, as soon as the remaining job time is less than '''termination_time_needed''', the model stops the time step procedure and copies the data for a restart run to the local binary file [../iofiles#BINOUT BINOUT]. The so-called initialization parameters are also written to this file (see [[chapter 4.0]]). In a last step the model produces another file with the local name [ ../iofiles#CONTINUE_RUN CONTINUE_RUN]. The presence of this file signals '''mrun''' the fact that a restart run must be started and leads to the start of an appropriate job.\\\\21 The specification of the environment variable {{{write_binary}}}, which must be assigned the value {{{true}}}, is essential. Only in this case the model writes binary-coded data for a possible restart run to the local file [../iofiles#BINOUT BINOUT] at the end of the run. Then of course this output file must be stored on a permanent file with an appropriate file connection statement (last line of the example above). As you can see, both instructions (variable declaration and connection statements) are only carried out by '''mrun''', if the character string {{{restart}}} is given for the option {{{-r}}} in the '''mrun''' call. Thus the example above can also be used if no restart runs are intended. In such cases the character string {{{restart}}} with the option {{{-r}}} can simply be omitted.\\\\ 22 Only by the specification of {{{write_binary=true}}} the model is instructed to compute the remaining CPU time after each time step and stop, if the run is not going to be completed and finished briefly before expiration of this time. Actually the stop takes place when the difference from the available job time (determined by the '''mrun''' option {{{-t}}}) and the time used so far by the job becomes smaller than the time given by the model variable [../d3par#termination_time_needed termination_time_needed]. With the variable '''termination_time_needed''' the user determines, how much time is needed for binary copying of the data for restart runs, as well as for the following data archiving and transfer of result data etc. (as long as this is part of the job). Thus, as soon as the remaining job time is less than '''termination_time_needed''', the model stops the time step procedure and copies the data for a restart run to the local binary file [../iofiles#BINOUT BINOUT]. The so-called initialization parameters are also written to this file (see [[chapter 4.0]]). In a last step the model produces another file with the local name [[CONTINUE_RUN]]. The presence of this file signals '''mrun''' the fact that a restart run must be started and leads to the start of an appropriate job.\\\\ 23 23 During the initial phase of a restart run different actions than during the initial phase of an initial run of the model are neccessary. In this case the model must read in the binary data written by the preceding run at the beginning of the run. Beyond that it also reads the initialization parameters from this file. Therefore these do not need to be indicated in the parameter file (local name [../iofiles#PARIN PARIN]). If they are indicated nevertheless and if their value deviates from their value of the initial run, then this is ignored. There is exactly one exception to this rule: with the help of the initialization parameter [../inipar#initializing_actions initializing_actions] it is determined whether the job is a restart run or an initial run. If '''initializing_actions''' = ''read_restart_data'', then it is a restart run, otherwise an initial run. The previous remarks make it clear that the model obviously needs two different parameter files (local name PARIN) for the case of job chains. One is needed for the initial run and contains all initialization parameters set by the user and the other one is needed for restart runs. The last one only contains the initialization parameter '''initializing_actions''' (also, initialization parameters with values different from the initial run may appear in this file, but they will be ignored), which must have the value ''read_restart_data''. Therefore the user must produce two different parameter files if he wants to operate job chains. Since the model always expects the parameter file on the local file PARIN, two different file connection statements must be given for this file in the configuration file. One may be active only at the initial run, the other one only at restart runs. The '''mrun''' call for the initial run shown above activates the first of the two specified connection statements, because the character string {{{d3#}}} with the option {{{-r}}} coincides with the character string in the third column of the connection statement. Obviously the next statement must be active 24 24 {{{