Kaplan-Meier Estimator (Equation of the Week)
The Kaplan-Meier (K-M) estimator is the non-parametric maximum likelihood estimator of the survival probability of items in a sample. “Survival” here is a historical holdover because this method was first developed to estimate patient survival chances in medicine, but in general it can be thought of as a form of cumulative probability. It is of great importance in astronomy because so much of our data are limited and this estimator provides an excellent way to estimate the fraction of objects that may be below (or above) certain flux levels. The application of K-M to astronomy was explored in depth in the mid-80′s by Jurgen Schmitt (1985, ApJ, 293, 178), Feigelson & Nelson (1985, ApJ 293, 192), and Isobe, Feigelson, & Nelson (1986, ApJ 306, 490). [See also Hyunsook's primer.] It has been coded up and is available for use as part of the ASURV package.
Consider a simple case where you have N observations of the luminosities of a source. Let us say that all N sources have been detected and their luminosities are estimated to be Li, i=1..N, and that they are ordered such that Li < Li+1 Then, it is easy to see that the fraction of sources above each Li can be written as the sequence
{ N-1, N-2, N-3, … 2, 1, 0}/N
The K-M estimator is a generalized form that describes this sequence, and is written as a product. The probability that an object in the sample has luminosity greater than Lk is
S(L>L1) = (N-1)/N
S(L>L2) = (N-1)/N * ((N-1)-1)/(N-1) = (N-1)/N * (N-2)/(N-1) = (N-2)/N
S(L>L3) = (N-1)/N * ((N-1)-1)/(N-1) * ((N-2)-1)/(N-2) = (N-3)/N
…
S(L>Lk) = Πi=1..k (ni-1)/ni = (N-k)/N
where nk are the number of objects still remaining at luminosity level L ≥ Lk, and at each stage one object is decremented to account for the drop in the sample size.
Now that was for the case when all the objects are detected. But now suppose some are not, and only upper limits to their luminosities are available. A specific value of L cannot be assigned to these objects, and the only thing we can say is that they will “drop out” of the set at some stage. In other words, the sample will be “censored”. The K-M estimator is easily altered to account for this, by changing the decrement in each term of the product to include the censored points. Thus, the general K-M estimator is
S(L>Lk) = Πi=1..k (ni-ci)/ni
where ci are the number of objects that drop out between Li-1 and Li.
Note that the K-M estimator is a maximum likelihood estimator of the cumulative probability (actually one minus the cumulative probability as it is usually understood), and uncertainties on it must be estimated via Monte Carlo or bootstrap techniques [or not.. see below].
brianISU:
Quick note: One is not only limited to Monte Carlo or bootstrap techniques for an understanding of the uncertainty in the estimation. There are large sample pointwise approximations to the estimated CDF that uses a Taylor approximation to S(L) (otherwise known as the delta method or propagation of error). If one is comfortable using this approach then the approximate variance of S(L) is known as Greenwood’s formula. This then leads to approximate confidence intervals for S(L). However, if the sample size is not large, one could get nonsense intervals like S(t) < 0. Then another approach is to apply the logit function and base confidence intervals on the sampling distribution of this transformation. Then take the preimage of the logit function to translate your logit intervals back to the probability scale. These can be quick and dirty ways to get an understanding of the uncertainty involved in the estimation. See the book “Statistical Methods for Reliability Data” by Meeker and Escobar for more info.
07-09-2008, 9:21 pmhttp://www.amazon.com/Statistical-Methods-Reliability-Probability-Statistics/dp/0471143286/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1215652613&sr=8-1
vlk:
Thanks for the correction! I didn’t know about this at all.
07-09-2008, 9:45 pmbrianISU:
No problem. I find it interesting that there are bootstrap methods as well. Do you perhaps have an example of this that I could study? I am really interested in bootsrtap.
07-09-2008, 10:17 pmvlk:
Take a look at Appendix G of Schmitt 1985 linked to above. This paper is very detailed, but the following papers that cite it (usually because they have done some kind of bootstrap KM estimator of X-ray luminosity functions) largely skip over the methodology.
07-09-2008, 11:01 pmvlk:
PS: I notice now that Feigelson and Nelson (1985) actually discuss the Greenwood formula in their paper!
07-09-2008, 11:07 pmhlee:
Thanks Both! Your discussion helped me to be guided in survival analysis. Vinay’s post particularly helped me to relate survival curve and luminosity function. Once I retrieve relevant statistics papers I saw years back (I couldn’t comprehend at that time), I’ll put some references and probably examples if there are some similarities.
I’d like to ask an off-the-track question about bootstrap and Monte Carlo. Either parametric or nonparametric, and other modification of bootstrapping involves resampling from data, whereas, regardless parameters are known or not, Monte Carlo simulates sample from a distribution of known or estimated parameters (the latter uses data to estimate parameters and brings up testing issues, like goodness-of-fit test). One impression that I also obtained from reading astro-ph preprints is that sometimes bootstrap is quoted in replacement of Monte Carlo method.
No one taught what Monte Carlo is. I just acquired it through books and papers. Sometimes I felt that the concept of Monte Carlo simulation is different across disciplines (based on conversations with college students in engineering).
07-10-2008, 1:42 pmvlk:
I suspect that Monte Carlo Bootstrap is a term and technique only astronomers use, where if you have measurements y_i, i=1..N, you draw N times from {y_i} with replacement (the bootstrap part), and then jiggle them according to sigma(y), usually N(y_i,s_i) (the Monte Carlo part), before computing whatever summary statistic you are interested in. There may be other variations on that.
07-10-2008, 5:14 pmbrianISU:
Another note just for completeness. One thing that I recommend to do with K-M estimate in hand is to plot this on some sort of “probability paper” (figure I’d reference a term from an earlier topic). This is a simple and effective way to test distributional fit. This is especially effective if the CDF has been linearized (this way if the distribution is a good fit it will appear as a straight line on the plot, which is much easier to recognize than a curve).
07-10-2008, 9:46 pmhlee:
Are there proofs or empirical studies from a mathematics/statistics viewpoint that this Monte Carlo Bootstrap works?
07-15-2008, 1:03 pmvlk:
Not that I know of.
From an empirical standpoint, I haven’t seen it give nonsense results. And why should it? It is a frequentist universe on your desktop.
07-15-2008, 1:32 pmhlee:
You mean Bayesian, right? According to your description, “jiggle them” corresponds to Monte Carlo or “data augmentation”, and the Bootstrap corresponds to estimation or “parameter draws” if I make an analogy with Gibbs sampling. I don’t think a frequentist would do Bootstrap and Monte Carlo at the same time. They complement each other in some sense, i.e. redundant. Bootstrap describes a methodology for statistical inference and Monte Carlo implies a random number generating mechanism. I was asking references with sensible results since there’s almost no chance that someone just throws Monte Carlo Bootstrap without justification (proofs or empirical results).
I understand you don’t like Bootstrap and I don’t like methods with heavy computations (Monte Carlo Bootstrap sounds wasting to me). Unless someone makes a point that why this Monte Carlo Bootstrap works (theoretic) or clearly shows the results (empirical), we cannot discuss results are sensible or not.
07-16-2008, 10:19 amhlee:
This commenting within comment cannot be edited. There’s something called Bootstrap after Bootstrap and Jackknife after Bootstrap, which I want to distinguish from Monte Carlo Bootstrap.
07-16-2008, 10:26 amvlk:
No, not Bayesian. Frequentist. Like simulating large numbers of independent observations non-parametrically.
07-16-2008, 2:27 pm