Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.eso.org/~qc/dfos/clMonitor.html
Дата изменения: Mon Sep 26 12:29:57 2011
Дата индексирования: Tue Oct 2 01:31:46 2012
Кодировка:

Поисковые слова: arp 220
clMonitor

Common DFOS tools:
Documentation

dfos = Data Flow Operations System, the common tool set for DFO
*make printable new: see also:
 

- v1.0: released
- v1.1:
suppression of "Interactive jobs" (not needed anymore)

The tool visualizes the processing queue on the QC cluster.
  Note: this tool is not part of the standard dfos tool suite. It needs to be installed only for the instruments on the QC cluster, and for those on the dfo blades which are enabled for exporting jobs to the QC cluster using XportJob.
[ used databases ] databases none
[ used dfos tools ] dfos tools [condor_status, condor_q] XportJob
[ output used by ] output clMonitor.html, exported to http://qcweb.hq.eso.org/CLUSTER/monitor
[ upload/download ] upload/download upload: clMonitor.html (see output); down: config.clMonitor (with wget)
MEF processing other tools involved in MEF processing are:
createJob
, getStatusAB , processQC , scoreQC
what's MEF processing?

clMonitor

Description

tool monitors MEF cascades
tool monitors condor execution

This tool visualizes the processing situation on the QC cluster. The QC cluster is currently used as "home platform" for VIRCAM and HAWKI, and can also be used as "QC grid" to export compute jobs from the dfo blades to the QC cluster whenever resources are available there.

The clMonitor gives an overview of the currently condor activity (nodes executing condor jobs) and the pending queue. Its main purpose is to give feedback about the current and future processing jobs, in order to decide, as an external user, about submission of additional compute jobs

It has four links:
cluster queue ganglia installation


Cluster.

The command condor_status is called, its response is visualized:

Cluster overview:
          qc05
          qc10
          qc15
          qc20

busy | idle | reserved/not available

'busy' indicates that condor currently executes a job on that node; 'idle' means no current condor job; 'reserved/not available' means the node is not configured for condor jobs, or is currently not available. Note that this overview indicates only the condor situation. Some nodes are reserved for QC jobs or as condor_master, so they might be busy with non-condor jobs that would not be indicated on this monitor. Likewise, an idle condor node might actually be running a non-condor job.

If a node is condor-active, the job is displayed (e.g. processAB), along with the AB name and the user ID.

node CPU status load CMD AB user
qc01 #1 Busy 0.000 processAB HAWKI.2010-03-30T10:05:39.231_tpl.ab hawki
qc01 #2 Idle 0.000  
qc02 #1 Idle 0.000  
qc02 #2 Idle 0.000  
qc03 #1 Idle 0.000  
qc03 #2 Idle 0.000  
qc06 #1 Idle 0.120  
qc06 #2 Idle 0.000  
qc07 #1 Idle 0.000  
qc07 #2 Idle 0.000  
qc09 #1 Idle 0.000  
qc09 #2 Idle 0.000  
qc10 #1 Busy 0.000 processAB HAWKI.2010-03-30T10:03:47.149_tpl.ab hawki
qc10 #2 Idle 0.240  

The clMonitor has a flag monitoring interactive jobs (certifyProducts, updateDP, releaseDP) of the MAIN_USER and of OTHER_USERs (see configuration below). If any such interactive job is detected on the configured QC blades, the flag turns red, to indicate the need for coordination with the process owner before starting compute jobs.

Queue. If there are more jobs in the condor queue than currently executable, they are listed in the second tab called "queue".

Ganglia. This is a link to the Ganglia performance monitor for the QC cluster blades.

Installation. This link is potentially useful only for the guest accounts using the XportJob mechanism. The completeness of the installation of these guest accounts is monitored here.


Output

How to install

How to use

Type clMonitor -h for a quick help, and clMonitor -v for the version number. Type

clMonitor

to create or refresh http://qcweb/~qc/CLUSTER/monitor/clMonitor.html. You can call this tool only on the QC cluster.

While calling on the command line is possible anytime, the tool is running operationally in an infinite loop in one main instance, currently on vircam, as 'watch -n 60 clMonitor'. Thereby it is automatically refreshed every 60 seconds. The HTML output forces the browser to refresh at the same rate. Because that near-real time mode causes some load, any additional calls using 'watch' should be avoided.

Configuration file

There is a master configuration file http://www.eso.org/~qc/dfos/tools/config.clMonitor that is downloaded automatically. It overwrites the local version under $DFO_CONFIG_DIR/config.clMonitor.local. The download is done via wget and has a timeout protection (10 sec). On timeout, the local version is read instead. It is pointless to edit the local version (it gets overwritten on the next execution).

The central configuration file can in principle be edited by anyone, after coordination.

It contains the geometry of the node table, and a label marking its current function (e.g. condor_master or condor_execution):

Section 1: Users on cluster

MAIN_USER: can check for OTHER_USERs' interactive processes
OTHER_USER: they have full dfos accounts there
GUEST users: they use the QC grid (only dfos AB execution with XPORT tools)

MAIN_USER
vircam name of main user (the one who runs 'watch -n 60 clMonitor'
OTHER_USER hawki qc08 name and QC blade for home account
[there will be more in the future]
Section 2: Cluster node table
CPU@node
slot1@qc01 CPU name and blade name
geom ROW1 this node to appear in row #1
role condor_execution 'condor_execution', or e.g. 'VIRCAM_dfo' if not condor_execution

Operational aspects