Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.stsci.edu/spst/lrpg/documentation/procedures/lrpg_debugging_batch_lrp_crashes.html
Дата изменения: Fri Apr 2 19:16:19 2004
Дата индексирования: Sun Mar 2 12:45:28 2014
Кодировка:
Debugging Batch LRP crashes This is the start of a procedure on how to determine what is causing spike to crash, be it the batch_lrp or the batch_marker code.

Typically, SPIKE worked the day before and now does not. So what probably happened is something changed in the system:

How to figure it out.

Batch_lrp

First look in the log files. They currently are stored:

/data/aphotic1/lrp/batch_lrp_log.xxx

where xxx is mon,tue,wed,...

Look at the last entry in the log file. If it says something about not enough memory. examine the machine on which the process runs and see if there is another spike image on it. That process needs to be killed.

If it died in the middle of an activity, determine the proposal it was working on. Try running that particular proposal through a simulation of batch_lrp process and try to duplicate the crash.

To simulate the steps that the batch_lrp uses:

Start a new spike image in emacs:

(setf *casm-break-on-errors* t) ;;might help Larry in debuging

(load-a-control-file "nocturnal-control-file")

(load-props :props '(pppp))

(load-pws "97017A")

(run-scheduler :sogs-start "97.044" :sogs-end "97.250" :to-schedule :both)

If that does not work, try to make a list of the visits that changed since the last time it worked correctly:

cd $TRANS

ls -l *tic | grep 'Mmm dd' | >~/a.a

where Mmm dd is Jan 23, Jul 3, etc. Be careful about the exact date format (you must include two spaces if it is a single digit date). From the visit list in ~/a.a form a proposal list and run it through the batch process.

If that does not work, determine if a calendar was loaded. If so, from that calendar make a SU list ==>proposal list, and do the batch process.

If that does not work, examine the control files, the autoschedut criteia files, etc. and see if they were changed in the failure time frame. If the problem is in one of these files, it should show up using the above process using any external unexecuted proposal.

If that does not work, load all the proposals that the lrp uses, and when it crashes call a software developer:

Start a new spike image in emacs:

(setf *casm-break-on-errors* t) ;;might help Larry in debuging

(load-a-control-file "nocturnal-control-file")

(load "/marian/u1/kinzel/find-lrp.lis")

(load-props)

(load-pws "97017A")

(run-scheduler :sogs-start "97.044" :sogs-end "97.250" :to-schedule :both)

Note that these steps are not sequential. You could try the different proposal lists on different machines at the same time.

Also, depending on the crash, spike might write out a

/data/aphtoic1/lrp/images/crash-odd.image (or even).

This image should be cleaned up as soon as it is analysed. First, it can be fairly large, and second, if spike crashes again and tries to save to that image it will die and not overwrite the image so the crash information can be lost.

batch_markers problems

While not a crash, it can still cause problems:

We get email for the lrp account:
Your "cron" job

/home/lrp/batch_markers.tcl spike-control-file m aphotic.sogs.stsci.edu gandalf. sogs.stsci.edu mingus.sogs.stsci.edu
hal.sogs.stsci.edu shanara.sogs.stsci.edu s tyx.sogs.stsci.edu

produced the following output:

permission denied
This has been tracted to one of the processes that the marker queue runs having a problem. To find it, do al directory:


cd /cerb/data1/operational/markers
ls *error*
6904-marker-error.acf
Listing the file should find the offending file:


cat 6904-marker-error.acf

(load-marker-error-info :prop 6904  :version "C" :marker-type "ACF" :start 3064010709 :end 3064015035 :time 4326 :user "LRP" :machine "gullveig" 
:error-string "creating #p\"/cerb/data1/operational/casm-diagnostics/6904-97.cv-diag\"
(which translates to
\"/cerb/data1/operational/casm-diagnostics/6904-97.cv-diag\") resulted in
error: Permission denied."
-- many more lines ---

A listing should show the problem:


> ls -l /cerb/data1/operational/casm-diagnostics/6904-97.cv-diag
-rw-r--r--   1 aroman   spb          166 Dec  4 14:10 /cerb/data1/operational/casm-diagnostics/6904-97.cv-diag

This shows that Tony is the owner and no one else in the group can change the file including spike. Look to see if other files have the same protection problems and ask the owners to do a chmod g+w on the files. That should solve the problems.
LRPG Procedures home page