Start times in 4 wapps differs in pulsar observing
06jan07
An archive of the headers from the wapp
pulsar files is made each month. At the end of the month, all of the pulsar
files on disc are scanned and the headers are archived. Many pulsar data
files are removed from the online disc before the end of each month, so
this process only archives a fraction of the data actually taken.
An observation can use 1 to 4 wapps. When starting
an observation, the 4 wapps are told to configure themselves and then start
taking data on the next hardware 1 second tick. The start time for each
wapp is stored in the header for that wapps datafile.
The start of datataking is determined by a hardware
1 second tick (locked to the hydrogen maser). The time recorded in the
header comes from the local clock on each wapp. This time is maintained
by the ntp Daemon on each wapp.
There has been a problem where the time on each
wapp was drifting (see ntp
problems with wapps). This seemed to clear up when the wapp linux kernels
were upgraded.
I recently went through the wapp pulsar archive and
looked at the number of times the start times for the different wapps used
in an observation differed. The data covered jan05 through jan07.
The time difference was usually 1 second.
The plots shows when
the wapp start_times differed (.ps) (.pdf):
-
Top plot. Total observations, total mismatched observations: The
black line is the total number of observations that were looked at by month.
The red line is the total number of these observations that had mismatches
in at least one of the header start times.
-
Bottom: The fraction of time there was a mismatch: This
is the ratio of the mismatches divided by the total observations. There
was large drop in jun05 when the wapp4 kernel was updated. This updated
solved the ntp time drift. The fraction of time there is a mismatch is
still runs between 5 and 20% of the time.
The question is whether the start time in the header
is the real start time (one or more wapps started late) or all of the wapps
started together and the time in the header is wrong. When the ntp time
was drifting then the time in the header could have been wrong. After the
kernels were upgraded my guess is that the times in the header are correct
and the wapps are actually starting at different times.
How pulsar observations are
started.
As far as i can tell, the wapp startup sequence
for pulsare observing is:
-
The cima gui in wapp.tcl wapp_sendhdr() loops filling in the the
wapp header for each wapp and then sending the message to wappcon
on each wapp. Each wapp receives the request spaced by the time it
takes to go through this loop (looks like computing the polyco coef might
take a little while).
-
wappcon on each wapp receives the wapp header from a socket. It
waits for wapprt to be not busy and then loads
he header into shared memory for wapprt to grab it.
-
wapprt gets the header out of shared memory and then prepares to
take data. This includes:
-
Configures the wapp for requested data taking mode.
-
Checks available disc space on all discs
-
preallocates all files need for the observations
-
It hen waits for the clock time to be .5 seconds before the next second.
It does this by:
-
Get the current fraction of a second from the cpu time.
-
If the FractSec < .5 seconds, wait for .5 - FracSec . You should wake
up at .5 seconds before the next tick. You stay in the current second
-
If FracSec > .5 then wait for 1.5 - FracSec. This will put you at .5 seconds
on the next 1 second.
-
The above algorithm has a built in race condition.
-
If a wapp gets to the wait code before .5 seconds then it starts on the
next tick
-
If a wapp gets to the wait code on or after .5 seconds then it will wait
1 extra second.
-
The algorithm has the following problems:
-
The requests to each wapp are not issued simultaneously.
-
After receiving the start request different wapps could have different
setup times (especially allocating and checking the discs).
-
There does not seem to be any time synchronization when the initial request
from the gui is sent. It is whenever the user pushed the button. You'd
like to see it happen around a 1 second tick. That would the maximum time
before the .5 second threshold arrived. The time would be if the user pushed
the button a few milliseconds before .5 seconds of the current second.
<-
page up
home_~phil