|
Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://theory.sinp.msu.ru/pipermail/ru-ngi/2012q1/000399.html
Дата изменения: Fri Feb 10 13:27:37 2012 Дата индексирования: Tue Oct 2 03:14:34 2012 Кодировка: |
Евгений, а не могли бы вы поделиться скриптиком, который делает такую
красивую картиночку по очередям?
Виктор
on 10.02.2012 12:02, Eygene Ryabinkin wrote:
> Fri, Feb 10, 2012 at 11:52:53AM +0400, Victor Kotlyar wrote:
>> Что-то я заметил, за последние два дня, pilot Атласа изменил свое поведение.
>>
>> Были запущены какие-то "длинные" задачи:
>>
>> resources_used.cput = 48:32:40
>> resources_used.mem = 1151908kb
>> resources_used.vmem = 2387800kb
>> resources_used.walltime = 48:47:45
>>
>> В panda мониторе у нас упало число analysis задач, а в секции production
>> - 0, и стоит слово test (как и у RRC-KI)
>
> У нас за 3 последних дня ситуация с задачами ATLAS более-менее стабильная:
> {{{
> 08.02.2012
> ==========
>
> *queue atlas, 3391 jobs, failed 0.06%, killed 0.00%, canceled 0.03%:
> 1 canceled jobs
> 3388 jobs with code 0
> 2 jobs with code 1
> quality assessor says: wow, shit, 2% or even less of errors?
> Does our cluster work at all? Or you're killing every job? ;))
> Memory consumption
> 0 - 1Mb ==> 19 ]
> 1Mb - 100Mb ==> 3221 ]==============================
> 100Mb - 1Gb ==> 126 ]=
> 1Gb - 2Gb ==> 22 ]
> 2Gb - 3Gb ==> 3 ]
> Vmem consumption
> 0 - 1Mb ==> 19 ]
> 1Mb - 100Mb ==> 5 ]
> 100Mb - 1Gb ==> 3272 ]==============================
> 1Gb - 2Gb ==> 90 ]
> 2Gb - 3Gb ==> 2 ]
> 3.2Gb - 3.5Gb* ==> 3 ]
> CPU time consumption
> 0 - 1min ==> 3224 ]==============================
> 1min - 10min ==> 101 ]
> 10min - 1hour ==> 62 ]
> 6hours - 1day ==> 4 ]
> Walltime consumption
> 0 - 1min ==> 50 ]
> 1min - 10min ==> 3009 ]==============================
> 10min - 1hour ==> 312 ]===
> 1hour - 6hours ==> 15 ]
> 6hours - 1day ==> 4 ]
> 1day - 2days ==> 1 ]
>
>
> 09.02.2012
> ==========
>
> *queue atlas, 11395 jobs, failed 0.00%, killed 0.00%, canceled 0.01%:
> 1 canceled jobs
> 11394 jobs with code 0
> quality assessor says: wow, shit, 2% or even less of errors?
> Does our cluster work at all? Or you're killing every job? ;))
> Memory consumption
> 0 - 1Mb ==> 1764 ]=====
> 1Mb - 100Mb ==> 8884 ]==============================
> 100Mb - 1Gb ==> 433 ]=
> 1Gb - 2Gb ==> 314 ]=
> Vmem consumption
> 0 - 1Mb ==> 1764 ]=====
> 1Mb - 100Mb ==> 165 ]
> 100Mb - 1Gb ==> 8955 ]==============================
> 1Gb - 2Gb ==> 461 ]=
> 2Gb - 3Gb ==> 50 ]
> CPU time consumption
> 0 - 1min ==> 10690 ]==============================
> 1min - 10min ==> 275 ]
> 10min - 1hour ==> 112 ]
> 1hour - 6hours ==> 261 ]
> 6hours - 1day ==> 57 ]
> Walltime consumption
> 0 - 1min ==> 2230 ]=======
> 1min - 10min ==> 8408 ]==============================
> 10min - 1hour ==> 392 ]=
> 1hour - 6hours ==> 280 ]
> 6hours - 1day ==> 84 ]
> 1day - 2days ==> 1 ]
>
>
> 10.02.2012
> ==========
>
> *queue atlas, 2162 jobs, failed 0.00%, killed 0.00%, canceled 0.00%:
> 2162 jobs with code 0
> quality assessor says: wow, shit, 2% or even less of errors?
> Does our cluster work at all? Or you're killing every job? ;))
> Memory consumption
> 0 - 1Mb ==> 165 ]===
> 1Mb - 100Mb ==> 1547 ]==============================
> 100Mb - 1Gb ==> 129 ]==
> 1Gb - 2Gb ==> 302 ]=====
> 2Gb - 3Gb ==> 19 ]
> Vmem consumption
> 0 - 1Mb ==> 165 ]===
> 1Mb - 100Mb ==> 14 ]
> 100Mb - 1Gb ==> 1610 ]==============================
> 1Gb - 2Gb ==> 298 ]=====
> 2Gb - 3Gb ==> 75 ]=
> CPU time consumption
> 0 - 1min ==> 1729 ]==============================
> 1min - 10min ==> 228 ]===
> 10min - 1hour ==> 80 ]=
> 1hour - 6hours ==> 125 ]==
> Walltime consumption
> 0 - 1min ==> 199 ]===
> 1min - 10min ==> 1526 ]==============================
> 10min - 1hour ==> 285 ]=====
> 1hour - 6hours ==> 152 ]==
> }}}
> Вчера, конечно, было немного длинных задач, но это пока копейки, менее
> 1/2 процента от всех.
>
> Но если ATLAS что-то поменял в стратегии распределения или запуска
> задач или в чем-то другом, то об этом, конечно, хочется знать.