Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ
îðèãèíàëüíîãî äîêóìåíòà
: http://hpc.msu.ru/?q=node/151
Äàòà èçìåíåíèÿ: Sat Apr 9 22:44:49 2016 Äàòà èíäåêñèðîâàíèÿ: Sat Apr 9 22:44:49 2016 Êîäèðîâêà: IBM-866 |
http://github.com/srcc-msu/octotron (in English)
A state-of-the-art supercomputer is an extremely complex, expensive and energy-saturated system. Its every component is unreliable and can fail any time. That may lead not only to application failures but even to equipment damages.
Keeping this in mind, we maintain a project aimed to provide the highest possible safety of supercomputer hardware as well as the highest possible rate of computing resources usage. General requirements we specified for the system
called Octotron are:
Key feature of the Octotron system is representing the supercomputer functioning model in the form of expanded multi-graph. The vertices of the graph represent supercomputer components (nodes, queues, software, etc.), while its edges represent relations between them (òÀÜcontainòÀÝ, òÀÜchillòÀÝ, òÀÜpoweròÀÝ, etc.). The vertices have attributes which represent componentsòÀÙ properties received from the monitoring system (temperatures, counters, etc.), and rules òÀÓ functions for failure detection. In case of a failure a rule calls a reaction, e.g. sending a message or running a script. Graph structure allows us to investigate a propagation of failures from the top of failure source by implementing appropriate rules. We use Neo4j as a graph storage engine and Python for model, rules and reactions description.
Octotron is available under an open MIT license. We use Octotron now to control MSU òÀÜChebyshevòÀÝ and òÀÜLomonosovòÀÝ supercomputers. Examples of errors detected by Octotron are going beyond temperature thresholds; time drift, SSH/MPI unavailability, and high load average level on computing nodes; suspicious states of the job queue; growing of errors in network interfaces.
Presentation materials: