Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.eso.org/~qc/dfos/clMonitor.html
Äàòà èçìåíåíèÿ: Mon Sep 26 12:29:57 2011
Äàòà èíäåêñèðîâàíèÿ: Tue Oct 2 01:31:46 2012
Êîäèðîâêà:
Ïîèñêîâûå ñëîâà: massive stars

clMonitor

Common DFOS tools:
Documentation

dfos = Data Flow Operations System, the common tool set for DFO

make printable

new:

see also:

- v1.0: released
- v1.1: suppression of "Interactive jobs" (not needed anymore)

The tool visualizes the processing queue on the QC cluster.

Note: this tool is not part of the standard dfos tool suite. It needs to be installed only for the instruments on the QC cluster, and for those on the dfo blades which are enabled for exporting jobs to the QC cluster using XportJob.

databases	none
dfos tools	[condor_status, condor_q] XportJob
output	clMonitor.html, exported to http://qcweb.hq.eso.org/CLUSTER/monitor
upload/download	upload: clMonitor.html (see output); down: config.clMonitor (with wget)
MEF processing	other tools involved in MEF processing are: createJob, getStatusAB , processQC , scoreQC what's MEF processing?

clMonitor

Description

	tool monitors MEF cascades
	tool monitors condor execution

This tool visualizes the processing situation on the QC cluster. The QC cluster is currently used as "home platform" for VIRCAM and HAWKI, and can also be used as "QC grid" to export compute jobs from the dfo blades to the QC cluster whenever resources are available there.

The clMonitor gives an overview of the currently condor activity (nodes executing condor jobs) and the pending queue. Its main purpose is to give feedback about the current and future processing jobs, in order to decide, as an external user, about submission of additional compute jobs

It has four links:

cluster queue ganglia installation

Cluster.

The command condor_status is called, its response is visualized:

Cluster overview:

qc05

qc10

qc15

qc20

busy | idle | reserved/not available

'busy' indicates that condor currently executes a job on that node; 'idle' means no current condor job; 'reserved/not available' means the node is not configured for condor jobs, or is currently not available. Note that this overview indicates only the condor situation. Some nodes are reserved for QC jobs or as condor_master, so they might be busy with non-condor jobs that would not be indicated on this monitor. Likewise, an idle condor node might actually be running a non-condor job.

If a node is condor-active, the job is displayed (e.g. processAB), along with the AB name and the user ID.

node	CPU	status	load	CMD	AB	user
qc01	#1	Busy	0.000	processAB	HAWKI.2010-03-30T10:05:39.231_tpl.ab	hawki
qc01	#2	Idle	0.000

qc02	#1	Idle	0.000
qc02	#2	Idle	0.000

qc03	#1	Idle	0.000
qc03	#2	Idle	0.000

qc06	#1	Idle	0.120
qc06	#2	Idle	0.000

qc07	#1	Idle	0.000
qc07	#2	Idle	0.000

qc09	#1	Idle	0.000
qc09	#2	Idle	0.000

qc10	#1	Busy	0.000	processAB	HAWKI.2010-03-30T10:03:47.149_tpl.ab	hawki
qc10	#2	Idle	0.240

The clMonitor has a flag monitoring interactive jobs (certifyProducts, updateDP, releaseDP) of the MAIN_USER and of OTHER_USERs (see configuration below). If any such interactive job is detected on the configured QC blades, the flag turns red, to indicate the need for coordination with the process owner before starting compute jobs.

Queue. If there are more jobs in the condor queue than currently executable, they are listed in the second tab called "queue".

Ganglia. This is a link to the Ganglia performance monitor for the QC cluster blades.

Installation. This link is potentially useful only for the guest accounts using the XportJob mechanism. The completeness of the installation of these guest accounts is monitored here.

Output

clMonitor.html | clMonitor1.html | clMonitor2.html exported to http://qcweb/~qc/CLUSTER/monitor/clMonitor.html

How to install

This tool is not part of the standard dfos tool suite. Download the installation tarball from http://www.eso.org/~qc/dfos/ or directly from this link into $DFO_INSTALL_DIR/clMonitor. Then:

cd $DFO_INSTALL_DIR/clMonitor
tar xvf clMonitor.tar
cat README_clMonitor
mv clMonitor $DFO_BIN_DIR

How to use

Type clMonitor -h for a quick help, and clMonitor -v for the version number. Type

clMonitor

to create or refresh http://qcweb/~qc/CLUSTER/monitor/clMonitor.html. You can call this tool only on the QC cluster.

While calling on the command line is possible anytime, the tool is running operationally in an infinite loop in one main instance, currently on vircam, as 'watch -n 60 clMonitor'. Thereby it is automatically refreshed every 60 seconds. The HTML output forces the browser to refresh at the same rate. Because that near-real time mode causes some load, any additional calls using 'watch' should be avoided.

Configuration file

There is a master configuration file http://www.eso.org/~qc/dfos/tools/config.clMonitor that is downloaded automatically. It overwrites the local version under $DFO_CONFIG_DIR/config.clMonitor.local. The download is done via wget and has a timeout protection (10 sec). On timeout, the local version is read instead. It is pointless to edit the local version (it gets overwritten on the next execution).

The central configuration file can in principle be edited by anyone, after coordination.

It contains the geometry of the node table, and a label marking its current function (e.g. condor_master or condor_execution):

Section 1: Users on cluster MAIN_USER: can check for OTHER_USERs' interactive processes OTHER_USER: they have full dfos accounts there GUEST users: they use the QC grid (only dfos AB execution with XPORT tools)
MAIN_USER	vircam		name of main user (the one who runs 'watch -n 60 clMonitor'
OTHER_USER	hawki	qc08	name and QC blade for home account
[there will be more in the future]
Section 2: Cluster node table
CPU@node	slot1@qc01		CPU name and blade name
geom	ROW1		this node to appear in row #1
role	condor_execution		'condor_execution', or e.g. 'VIRCAM_dfo' if not condor_execution

Operational aspects

The output is linked to the dfoMonitor as optional frame on the right side.
The output is linked to the public monitor page as "QC cluster monitor".

Common DFOS tools: Documentation

clMonitor

Description

Output

How to install

How to use

Configuration file

Operational aspects

Common DFOS tools:
Documentation