= Monitoring batch jobs: PALM watchdog From revision r1611 a batch job monitoring tool (watchdog), called '''palm_wd''' is available. It is based on python 2.7 and Qt4. For using the watchdog it is essential to have Qt and python version 2.7 or higher running on the local as well as on each remote host to be monitored. Please note that the watchdog will show only those jobs that have been captured before, i.e. very short jobs and job protocol transfer jobs might not be shown.\\\\ [[Image(palm_wd.png, 20%, border=0, center, nolink)]] == Configuration of the watchdog The watchdog consists of two scripts, palm_wd (watchdog client to be run on the local host) and palm_wdd (server to be located on each remote host to be monitored). Before running the watchdog, both client and server require system-specific configuration that have to been provided in configuration files {{{.wd.config}}} and {.wdd.config}}}: 1. Make a copy of {{{trunk/SCRIPTS/palm_wd_files/.wd.config.default}}}, rename it to .wd.config, and move it to your local palm directory (e.g. {{{~/palm/current_version/}}}, then edit the file (e.g. here for HLRN-III): {{{ [Hannover] hostname=hlogin.hlrn.de username= [Berlin] hostname=blogin.hlrn.de username= [Settings] update_frequency=10 }}} For each remote host to be monitored, create a separate section with Description of your choice (here "Hannover" and "Berline"). hostname is the IP/name of the remote host (assuming that a passwordless login via ssh-key is available), username is the user name on the remote host. The automatical update frequency must be given in minutes. 2. palm_wdd requires system-specific configurations. Make a copy of {{{trunk/SCRIPTS/palm_wd_files/.wdd.config.default}}} for each host to be monitored, rename it to {{{.wdd.config}}}, and edit the files appropriately. For HLRN-III {{{.wdd.config}}} reads: {{{ [Settings] readqueue="showq | egrep" tmpdir="/gfs1/tmp/" canceljob="canceljob" checkjob="checkjob" realname_grep="AName" starttime="showstart" starttime_grep="start in" }}} As the queuing system on different computing systems may vary, it is not possible to provide detailed instructions how to set this configuration. In case you are struggling with the configuration, please feel free to create a [/newticket new ticket] 3. Now copy palm_wdd and the configuration files into the $HOME directory of each of the remote hosts, i.e. for HLRN-III: {{{ scp palm_wdd nikname@hlogin.hlrn.de: scp .wdd.config.hlrnIII nikname@hlogin.hlrn.de:.wdd.config scp palm_wdd nikname@blogin.hlrn.de: scp .wdd.config.hlrnIII nikname@blogin.hlrn.de:.wdd.config }}} == Running the watchdog The watchdog can be either started by typing {{{ palm_wd }}} into the shell, or via the palmrungui (Start -> Start watchdog). A window (see screenshot) should appear on the screen. [[Image(palm_wd_action.png, 70%, center, margin-right=2, margin-bottom=5, margin-top=2, border=0, nolink)]] == Documentation The watchdog is to large extent self-explanatory. The following features, however, might require a short description. === Progress bar The progress bar displays the progress of each job that is currently running (status "Running"). The progress is calculated from the current simulation time and the end_time/restart_time of the job. PALM write these information in the file PROGRESS that resides in the temporary directory of the job. === Right mouse click: Show details Information on the job from the queuing system is displayed === Right mouse click: Cancel job The job will be canceled without any security query. === Right mouse click: Force stop Unlike "Cancel job" the job will initiate a proper termination of the run, i.e. data will be processed and possibly restart binary data will be written. === Right mouse click: Force restart The job will initiate a proper termination of the run, followed by an automatic restart of the job chain. This will only work properly, when the job has been configured for restart runs, otherwise this action is identical to "Force stop". === Status: "Completed." A jobs with status "Completed." has not necessarily finished successfully. All jobs that have been removed from the queuing system (either by cancelation, finishing, or crashing) are labeled as "Completed.". The Progress bar for completed jobs will show the last value obtained by the watchdog. === Tooltips on "Job" and "Remaining time" When hovering over an item in the "Job" column, the queuing name will be shown. When hovering over items in the "Remaining time" column, the estimated start time of the queuing system will be shown.