= Job chains / Restart runs = Batch systems generally limit the CPU time that is allowed to be requested by a job, e.g. to a maximum of 12 hours or 24 hours. If a simulation needs more time to run, it has to be split into several parts/jobs. The first job is called the ''initial'' run or job, the others are called ''restart'' runs/jobs. Together they form a so-called ''job chain''. Restart runs require as input the state of all flow variables as they were calculated in the final time step of the previous run. They need to be output by the previous run into a so-called ''restart-file'' which is a required input file for the (next) restart run. '''[wiki:doc/app/palmrun palmrun]''' allows you to automatically generate job chains and to handle the restart files. Of course, automatic generation does not work if you run PALM in interactive mode. A job started by '''[wiki:doc/app/palmrun palmrun]''' will be queued by the queuing-system of the local or remote computer into a suitable job class which fulfills the requirements that are set via '''[wiki:doc/app/palmrun palmrun]''' options {{{-t}}}, {{{-m}}}, {{{-X}}}, and {{{-T}}}, which define the requested CPU-time, memory, and number of cores. Each job class permits only jobs with certain maximum requirements (e.g. the allowed CPU time or the maximum number of cores that can be used by the job). Some queuing systems automatically sort the jobs into the respective job class, others require an explicit setting of the class for which the job shall be queued. You can set the job class via '''[wiki:doc/app/palmrun palmrun]''' option {{{-q}}}. The job classes are important for the scheduling process of the computer. Jobs with small requirements usually come to execution very fast, jobs with larger requirements must wait longer (sometimes several days). Be aware that the available job classes vary a lot among different computer centers. Before starting a run, you have to estimate how much CPU time your complete simulation will need. The required time in seconds has to be given with '''[wiki:doc/app/palmrun palmrun]''' option {{{-t}}}. Due to the fact that the model uses a variable time step by default, the number of time steps to be carried out and consequently the time required to finish the simulation can often only roughly be estimated. So it may happen that more time is needed to finish the simulation than indicated by option {{{-t}}}. That will normally cause the job scheduler to terminate the job as soon as the available CPU time is consumed. In principle, you may avoid this problem by setting a very generously estimated value for {{{-t}}}, but the maximum allowed CPU-time is often limited due to job class restrictions. '''Restart runs''' are the method to circumvent these job class restrictions. During the time stepping, PALM is able to continuously check how much time is left for the execution of the job. If the run can not be completed and finished before expiration of this time, the PALM stops and outputs (nearly) all model variables (especially the 3d-prognostic quantities) in binary format to a file (or folder) with local name [../iofiles#BINOUT BINOUT]). After the local output files have been saved, '''[wiki:doc/app/palmrun palmrun]''' automatically generates a restart run. For this purpose a new '''[wiki:doc/app/palmrun palmrun]''' call is automatically initiated, i.e. '''[wiki:doc/app/palmrun palmrun]''' recursively calls itself. The '''[wiki:doc/app/palmrun palmrun]''' options of this call correspond to those of the initial call. PALM restarts and this time, at the beginning, it reads the binary data that have been written by the previous run, and continues the run based on this final state of the previous run. If the simulation still cannot be finished, another restart run is generated, etc., until the time to be simulated is reached (this is the one set via parameter {{{end_time}}}). This way a whole set of restart runs may be generated - a so-called job chain. Restart runs require certain entries in the file-connection file (see [source:palm/trunk/SCRIPTS/.palm.iofiles .palm.iofiles], and its [wiki:doc/app/palm_iofiles description]) and in the parameter file, which will be described and explained now. The following entries are important and are already contained in the default file-connection file: {{{ PARIN in:tr d3# $base_data/$run_identifier/INPUT _p3d* PARIN in:tr d3r $base_data/$run_identifier/INPUT _p3dr* BININ in:lnpe d3r $restart_data_path/$run_identifier/RESTART _d3d* # BINOUT* out:lnpe restart $restart_data_path/$run_identifier/RESTART _d3d }}} The '''[wiki:doc/app/palmrun palmrun]''' call for the initialization run of the job chain reads e.g.: {{{ palmrun -c default -r abcde -t 900 -X 96 -T 48 -a “d3# restart” -b -q queue_name }}} Giving the activation string {{{restart}}} as argument of option {{{-a}}} is essential. Only for that case the model writes binary data for a restart run to the local file [../iofiles#BINOUT BINOUT] (in case of running on more than 1 core, BINOUT is a folder). The local output file is then saved to a permanent file as defined in the file connection statement for BINOUT. The last line of the example above shows that saving of the file is only be done of the activation string {{{restart}}} has been set. Only by specifying {{{restart}}} as activation string, PALM is instructed to compute the remaining CPU time after each time step and to stop, if the run is not going to be completed and finished briefly before expiration of this time. Actually the stop takes place when the difference between the available job time (determined by the '''[wiki:doc/app/palmrun palmrun]''' option {{{-t}}}) and the time used by the job so far becomes smaller than the time given by the runtime parameter [../runtime_parameters#termination_time_needed termination_time_needed]. The runtime parameter '''termination_time_needed''' can be used to inform PALM, how much time is required for copying the binary data for restart runs, as well as for other pre- or post-processing steps that are done within the job. Thus, as soon as the remaining job time is less than '''termination_time_needed''', PALM interrupts the time stepping and outputs the restart data to local file/folder [../iofiles#BINOUT BINOUT]. The [../initialization_parameters initialization parameters] are also added to that file. In a last step, PALM creates a flag file with local name {{{CONTINUE_RUN}}}. The presence of this file signals '''[wiki:doc/app/palmrun palmrun]''' that a restart run needs to be generated and initiates and starts a respective job. Within PALM, the initial phase of a restart run requires different actions than during an initial run. In case of a restart, PALM first needs to read the data written by the preceding run and also reads the initialization parameters from the same file. Therefore, these parameters do not need to be provided in the parameter file (local name [../iofiles#PARIN PARIN]). Anyhow, if they are provided and if their value differ from the respective value of the initial run, these settings are ignored. There is exactly one exception to this rule: the initialization parameter [../initialization_parameters#initializing_actions initializing_actions] determines whether the job is a restart run or an initial run. If '''initializing_actions''' = '' 'read_restart_data','' then it is a restart run, otherwise an initial run. The previous explanation make it clear that the model obviously needs two different parameter files (local name PARIN) for the case of job chains. One is required for the initial run and contains all initialization parameters and the other one is needed for restart runs. The last one only contains the initialization parameter '''initializing_actions''' (any other initialization parameters may appear in this file, but they will be ignored), which needs to be set to '' 'read_restart_data'.'' So you need to provide two different parameter files if you like to carry out restart runs. Since PALM always expects the parameters to be in the local file PARIN, regardless if it is an initial or a restart run, two different file connection statements must be given for that file in the file-connection file. One is active for the initial run only, the other one only for restart runs. The '''[wiki:doc/app/palmrun palmrun]''' call for the initial run shown above activates the first of the two specified file connection statements for PARIN, because the activation string {{{d3#}}} with the option {{{-r}}} coincides with the string in the third column of the file connection statement. Obviously the next statement {{{ PARIN in:tr d3r $base_data/$run_identifier/INPUT _p3dr* }}} must be active for the restart runs. Given that this statement only becomes active with option {{{-r "d3r"}}}, and that the '''[wiki:doc/app/palmrun palmrun]''' call for this restart run is generated automatically (thus not yourself), '''[wiki:doc/app/palmrun palmrun]''' obviously needs to replace {{{"d3#"}}} of the initial run with {{{"d3r"}}} for the restart run. Actually, with restart runs all {{{"#"}}} characters within the arguments given for options {{{-r}}} are replaced by {{{"r"}}}. This way, following the above '''[wiki:doc/app/palmrun palmrun]''' example, the initial run will use the permanent file {{{ ~/palm/current_version/JOBS/abcde/INPUT/abcde_p3d }}} while restart runs will use {{{ ~/palm/current_version/JOBS/abcde/INPUT/abcde_p3dr }}} The binary restart data (see [../iofiles#BININ BININ]) is provided as input file only in case of restart runs, because {{{"d3r"}}} appears as activation string in the respective file connection statement (see the above example). The permanent names of this input file (local name BININ) and the corresponding output file (local name [../iofiles#BINOUT BINOUT]) are identical and read {{{ $restart_data_path/abcde/RESTART/abcde_d3d }}} However, '''[wiki:doc/app/palmrun palmrun]''' does not overwrite the restart data from the previous job with the new data that is output at the end of the current run. Instead of that, the local output file BINOUT is copied to a permanent file with a cycle number suffix, i.e. {{{ $restart_data_path/abcde/RESTART/abcde_d3d.001 }}} If a file with that cycle number already exists, it will be incremented and {{{abcde_d3d.002}}} will be created. Concerning the restart data input file, the highest existing cycle of the respective permanent file will be used. Concerning the example given above, the initial run creates the permanent file {{{.../abcde_d3d.000}}}, the first restart run uses this file and creates {{{.../abcde_d3d.001}}}, the second restart run creates {{{.../abcde_d3d.002}}} etc. You can still access all files created by the runs after the job chain has finished. For example, this allows you to re-run the model starting from different positions of the job chain by manually calling '''[wiki:doc/app/palmrun palmrun]''' with argument {{{d3r}}}. You also need to remove all file cycles beyond the one you like to start from. = Handling of large (restart) files = Copying very large files like restart data files to and from '''[wiki:doc/app/palmrun palmrun's]''' temporary working directory may need much time. During that time the requested cores for the job run idle and may consume significant amount of the job time without doing anything. The copy time can be spared by using a file link instead of copying the data. {{{ cp large_local_file large_permanent file # may take long time ln existing_large_local_TARGET_file LINK_NAME_to_large_local_file # is done immediately, i.e. requires almost no time }}} You can tell '''[wiki:doc/app/palmrun palmrun]''' to use {{{ln}}} instead of {{{cp}}} by setting the file attribute {{{ln}}} in the respective file connection statement, e.g.: {{{ BININ in:lnpe d3r $restart_data_path/$run_identifier/RESTART _d3d* BINOUT* out:lnpe restart $restart_data_path/$run_identifier/RESTART _d3d }}} However, generating a link requires that both the target as well as the linked file are located on the same physical file system. Otherwise, a normal copy will be done instead and the advantage of using the {{{ln}}} command is lost. Most computing centers provide a file systems for fast I/O and this should be used as '''[wiki:doc/app/palmrun palmrun's]''' temporary working directory, which you can set in via environment variable {{{restart_data_path}}} in the configuration file. Since the LINK_NAME should be on the same file system, the user should provide a directory on that file system for storing the large files. = Short instructions for carrying out job chains / restart runs = 1. In your {{{INPUT}}} folder, create an additional parameter file with suffix {{{_p3dr}}} as a copy of an existing file with suffix {{{_p3d}}}, e.g. {{{ cd ..../JOBS/abcde/INPUT cp abcde_p3d abcde_p3dr }}} 2. Edit the new file with suffix {{{_p3dr}}} and set the initialization parameter to {{{initializing_actions = 'read_restart_data'}}}. 3. Start the initial run of the job via command {{{ palmrun .... -a "d3# restart" -b }}} If restarts are required, '''[wiki:doc/app/palmrun palmrun]''' will generate and submit the restart jobs automatically. 4. For manually generating a restart job, replace the activation string {{{d3#}}} with {{{d3r}}}: {{{ palmrun .... -a "d3r restart" -b }}}