Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://www.eso.org/~qc/dfos/clMonitor.html
Дата изменения: Mon Sep 26 12:29:57 2011 Дата индексирования: Tue Oct 2 01:31:46 2012 Кодировка: Поисковые слова: http astro.uni-altai.ru pub article.html id 1542 |
Common DFOS tools:
|
dfos = Data Flow Operations System, the common tool set for DFO |
make printable | new: | see also: | ||||||||||
- v1.0: released |
The tool visualizes the processing queue on the QC cluster. | |||||||||||
Note: this
tool is not part of the standard dfos tool suite. It needs to be installed only for
the instruments on the QC cluster, and for those on the dfo blades which are enabled
for exporting jobs to the QC cluster using XportJob. |
|
tool monitors MEF cascades | |
tool monitors condor execution |
This tool visualizes the processing situation on the QC cluster. The QC cluster is currently used as "home platform" for VIRCAM and HAWKI, and can also be used as "QC grid" to export compute jobs from the dfo blades to the QC cluster whenever resources are available there.
The clMonitor gives an overview of the currently condor activity (nodes executing condor jobs) and the pending queue. Its main purpose is to give feedback about the current and future processing jobs, in order to decide, as an external user, about submission of additional compute jobs
It has four links:
cluster | queue | ganglia | installation |
Cluster.
The command condor_status is called, its response is visualized:
Cluster overview: |
qc05 |
qc10 |
qc15 |
qc20 |
busy | idle | reserved/not available
'busy' indicates that condor currently executes a job on that node; 'idle' means no current condor job; 'reserved/not available' means the node is not configured for condor jobs, or is currently not available. Note that this overview indicates only the condor situation. Some nodes are reserved for QC jobs or as condor_master, so they might be busy with non-condor jobs that would not be indicated on this monitor. Likewise, an idle condor node might actually be running a non-condor job.
If a node is condor-active, the job is displayed (e.g. processAB), along with the AB name and the user ID.
node | CPU | status | load | CMD | AB | user |
qc01 | #1 | Busy | 0.000 | processAB | HAWKI.2010-03-30T10:05:39.231_tpl.ab | hawki |
qc01 | #2 | Idle | 0.000 | |||
qc02 | #1 | Idle | 0.000 | |||
qc02 | #2 | Idle | 0.000 | |||
qc03 | #1 | Idle | 0.000 | |||
qc03 | #2 | Idle | 0.000 | |||
qc06 | #1 | Idle | 0.120 | |||
qc06 | #2 | Idle | 0.000 | |||
qc07 | #1 | Idle | 0.000 | |||
qc07 | #2 | Idle | 0.000 | |||
qc09 | #1 | Idle | 0.000 | |||
qc09 | #2 | Idle | 0.000 | |||
qc10 | #1 | Busy | 0.000 | processAB | HAWKI.2010-03-30T10:03:47.149_tpl.ab | hawki |
qc10 | #2 | Idle | 0.240 |
The clMonitor has a flag monitoring interactive jobs (certifyProducts, updateDP, releaseDP) of the MAIN_USER and of OTHER_USERs (see configuration below). If any such interactive job is detected on the configured QC blades, the flag turns red, to indicate the need for coordination with the process owner before starting compute jobs.
Queue. If there are more jobs in the condor queue than currently executable, they are listed in the second tab called "queue".
Ganglia. This is a link to the Ganglia performance monitor for the QC cluster blades.
Installation. This link is potentially useful only for the guest accounts using the XportJob mechanism. The completeness of the installation of these guest accounts is monitored here.
This tool is not part of the standard dfos tool suite. Download the installation tarball from http://www.eso.org/~qc/dfos/ or directly from this link into $DFO_INSTALL_DIR/clMonitor. Then:
cd $DFO_INSTALL_DIR/clMonitor
tar xvf clMonitor.tar
cat README_clMonitor
mv clMonitor $DFO_BIN_DIR
Type clMonitor -h for a quick help, and clMonitor -v for the version number. Type
clMonitor
to create or refresh http://qcweb/~qc/CLUSTER/monitor/clMonitor.html. You can call this tool only on the QC cluster.
While calling on the command line is possible anytime, the tool is running operationally in an infinite loop in one main instance, currently on vircam, as 'watch -n 60 clMonitor'. Thereby it is automatically refreshed every 60 seconds. The HTML output forces the browser to refresh at the same rate. Because that near-real time mode causes some load, any additional calls using 'watch' should be avoided.
There is a master configuration file http://www.eso.org/~qc/dfos/tools/config.clMonitor that is downloaded automatically. It overwrites the local version under $DFO_CONFIG_DIR/config.clMonitor.local. The download is done via wget and has a timeout protection (10 sec). On timeout, the local version is read instead. It is pointless to edit the local version (it gets overwritten on the next execution).
The central configuration file can in principle be edited by anyone, after coordination.
It contains the geometry of the node table, and a label marking its current function (e.g. condor_master or condor_execution):
Section 1: Users on cluster |
|||
MAIN_USER |
vircam | name of main user (the one who runs 'watch -n 60 clMonitor' | |
OTHER_USER | hawki | qc08 | name and QC blade for home account |
[there will be more in the future] | |||
Section 2: Cluster node table | |||
CPU@node |
slot1@qc01 | CPU name and blade name | |
geom | ROW1 | this node to appear in row #1 | |
role | condor_execution | 'condor_execution', or e.g. 'VIRCAM_dfo' if not condor_execution |