= Monitoring batch jobs: PALM watchdog From revision r1611 a batch job monitoring tool (watchdog), called '''palm_wd''' is available. It is based on python 2.7 and Qt4.\\\\ [[Image(palm_logo_wd.ico, border=0, center, nolink)]] == Configuration of the watchdog The watchdog consists of two scripts, palm_wd (watchdog client to be run on the local host) and palm_wdd (server to be located on each remote host to be monitored). Before running the watchdog, both client and server require system-specific configuration: 1. in palm_wd, create one item for each remote host in the following three lists hostname, username and description, e.g. {{{ hostname = ["hlogin.hlrn.de", "blogin.hlrn.de"] username = ["nikname" , "nikname" ] description = ["Hannover" , "Berlin" ] }}} here hostname is the IP of the remote host (assuming that a passwordless login via ssh-key is available), username is the user name on the remote host, and description is an arbritrary string to identify the host. Additionally, the update_frequency can be adjusted: {{{ update_frequency = 600*1000 }}} 2. in palm_wdd, system-specific configurations must be made. The default is configured to be used on the Cray-XC40 at HLRN-III and reads {{{ cmd_readqueue = "showq | egrep " cmd_tmpdir = "/gfs1/tmp/" cmd_canceljob = "canceljob" cmd_checkjob = "checkjob" cmd_realname_grep = "AName" cmd_starttime = "showstart" cmd_starttime_grep = "start in" }}} For other hosts, the parameters above must be adjusted appropriately. 3. Copy palm_wdd into the $HOME directory of each of the remote hosts, i.e. for HLRN-III: {{{ scp palm_wdd, nikname@hlogin.hlrn.de scp palm_wdd, nikname@blogin.hlrn.de }}} 4. Create database files for the watchdog in your working directory: {{{ cp $PALM_BIN/palm_wd_files/.wd.olddata $HOME/palm/current_version cp $PALM_BIN/palm_wd_files/.wd.newdata $HOME/palm/current_version }}} == Running the watchdog The watchdog can be either started by typing {{{ palm_wd }}} into the shell, or via the mrungui (Start -> Start watchdog). A window (see screenshot) should appear on the screen. [[Image(palm_wd_action.png, 70%, center, margin-right=2, margin-bottom=5, margin-top=2, border=0, nolink)]] == Documentation The watchdog is to large extent self-explanatory. The following features, however, require a short description. === Progress bar The progress bar displays the progress of each job that is currently running (status "Running"). The progress is calculated from the current simulation time and the end_time/restart_time of the job. PALM write these information in the file PROGRESS that resides in the temporary directory of the job. === Right mouse click: Show details Information on the job from the queuing system is displayed === Right mouse click: Cancel job The job will be canceled without any security query. === Right mouse click: Force stop Unlike "Cancel job" the job will initiate a proper termination of the run, i.e. data will be processed and possibly restart binary data will be written. === Right mouse click: Force restart The job will initiate a proper termination of the run, followed by an automatic restart of the job chain. This will only work properly, when the job has been configured for restart runs, otherwise this action is identical to "Force stop".