Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://theory.sinp.msu.ru/pipermail/ru-ngi/2012q1/000399.html
Дата изменения: Fri Feb 10 13:27:37 2012 Дата индексирования: Tue Oct 2 03:14:34 2012 Кодировка: |
Евгений, а не могли бы вы поделиться скриптиком, который делает такую красивую картиночку по очередям? Виктор on 10.02.2012 12:02, Eygene Ryabinkin wrote: > Fri, Feb 10, 2012 at 11:52:53AM +0400, Victor Kotlyar wrote: >> Что-то я заметил, за последние два дня, pilot Атласа изменил свое поведение. >> >> Были запущены какие-то "длинные" задачи: >> >> resources_used.cput = 48:32:40 >> resources_used.mem = 1151908kb >> resources_used.vmem = 2387800kb >> resources_used.walltime = 48:47:45 >> >> В panda мониторе у нас упало число analysis задач, а в секции production >> - 0, и стоит слово test (как и у RRC-KI) > > У нас за 3 последних дня ситуация с задачами ATLAS более-менее стабильная: > {{{ > 08.02.2012 > ========== > > *queue atlas, 3391 jobs, failed 0.06%, killed 0.00%, canceled 0.03%: > 1 canceled jobs > 3388 jobs with code 0 > 2 jobs with code 1 > quality assessor says: wow, shit, 2% or even less of errors? > Does our cluster work at all? Or you're killing every job? ;)) > Memory consumption > 0 - 1Mb ==> 19 ] > 1Mb - 100Mb ==> 3221 ]============================== > 100Mb - 1Gb ==> 126 ]= > 1Gb - 2Gb ==> 22 ] > 2Gb - 3Gb ==> 3 ] > Vmem consumption > 0 - 1Mb ==> 19 ] > 1Mb - 100Mb ==> 5 ] > 100Mb - 1Gb ==> 3272 ]============================== > 1Gb - 2Gb ==> 90 ] > 2Gb - 3Gb ==> 2 ] > 3.2Gb - 3.5Gb* ==> 3 ] > CPU time consumption > 0 - 1min ==> 3224 ]============================== > 1min - 10min ==> 101 ] > 10min - 1hour ==> 62 ] > 6hours - 1day ==> 4 ] > Walltime consumption > 0 - 1min ==> 50 ] > 1min - 10min ==> 3009 ]============================== > 10min - 1hour ==> 312 ]=== > 1hour - 6hours ==> 15 ] > 6hours - 1day ==> 4 ] > 1day - 2days ==> 1 ] > > > 09.02.2012 > ========== > > *queue atlas, 11395 jobs, failed 0.00%, killed 0.00%, canceled 0.01%: > 1 canceled jobs > 11394 jobs with code 0 > quality assessor says: wow, shit, 2% or even less of errors? > Does our cluster work at all? Or you're killing every job? ;)) > Memory consumption > 0 - 1Mb ==> 1764 ]===== > 1Mb - 100Mb ==> 8884 ]============================== > 100Mb - 1Gb ==> 433 ]= > 1Gb - 2Gb ==> 314 ]= > Vmem consumption > 0 - 1Mb ==> 1764 ]===== > 1Mb - 100Mb ==> 165 ] > 100Mb - 1Gb ==> 8955 ]============================== > 1Gb - 2Gb ==> 461 ]= > 2Gb - 3Gb ==> 50 ] > CPU time consumption > 0 - 1min ==> 10690 ]============================== > 1min - 10min ==> 275 ] > 10min - 1hour ==> 112 ] > 1hour - 6hours ==> 261 ] > 6hours - 1day ==> 57 ] > Walltime consumption > 0 - 1min ==> 2230 ]======= > 1min - 10min ==> 8408 ]============================== > 10min - 1hour ==> 392 ]= > 1hour - 6hours ==> 280 ] > 6hours - 1day ==> 84 ] > 1day - 2days ==> 1 ] > > > 10.02.2012 > ========== > > *queue atlas, 2162 jobs, failed 0.00%, killed 0.00%, canceled 0.00%: > 2162 jobs with code 0 > quality assessor says: wow, shit, 2% or even less of errors? > Does our cluster work at all? Or you're killing every job? ;)) > Memory consumption > 0 - 1Mb ==> 165 ]=== > 1Mb - 100Mb ==> 1547 ]============================== > 100Mb - 1Gb ==> 129 ]== > 1Gb - 2Gb ==> 302 ]===== > 2Gb - 3Gb ==> 19 ] > Vmem consumption > 0 - 1Mb ==> 165 ]=== > 1Mb - 100Mb ==> 14 ] > 100Mb - 1Gb ==> 1610 ]============================== > 1Gb - 2Gb ==> 298 ]===== > 2Gb - 3Gb ==> 75 ]= > CPU time consumption > 0 - 1min ==> 1729 ]============================== > 1min - 10min ==> 228 ]=== > 10min - 1hour ==> 80 ]= > 1hour - 6hours ==> 125 ]== > Walltime consumption > 0 - 1min ==> 199 ]=== > 1min - 10min ==> 1526 ]============================== > 10min - 1hour ==> 285 ]===== > 1hour - 6hours ==> 152 ]== > }}} > Вчера, конечно, было немного длинных задач, но это пока копейки, менее > 1/2 процента от всех. > > Но если ATLAS что-то поменял в стратегии распределения или запуска > задач или в чем-то другом, то об этом, конечно, хочется знать.