Astrostatistics: Goodness-of-Fit and All That!
During the International X-ray Summer School, as a project presentation, I tried to explain the inadequate practice of о‡^2 statistics in astronomy. If your best fit is biased (any misidentification of a model easily causes such bias), do not use о‡^2 statistics to get 1оѓ error for the 68% chance of capturing the true parameter.
Later, I decided to do further investigation on that subject and this paper came along: Astrostatistics: Goodness-of-Fit and All That! by Babu and Feigelson.
First, the authors pointed out that the о‡^2 method 1) is inappropriate when errors are non-gaussian, 2) does not provide clear decision procedures between models with different numbers of parameters or between acceptable models, and 3) is possibly difficult to obtain confidence intervals on parameters when complex correlations between the parameters are present. As a remedy to the о‡^2 method, they introduced distribution free tests, such as Kolmogorov-Smirnoff (K-S) test, Cramer-von Mises (C-vM) test, and Anderson-Darling (A-D) test. Among these distribution free tests, the K-S test is well known to astronomers but it has been ignored that the results from these tests become unreliable when the data come from a multivariate distribution. Furthermore, K-S tests fail when the data set is used for parameter estimation and computing the empirical distribution function.
The authors proposed resampling schemes to overcome the above shortcomings by showing both parametric and nonparametric bootstrap methods, and advanced to model comparison particularly when models are not nested. The best fit model can be chosen among other candidate models based on their distances (e.g. Kullback-Leibler distance) to the unknown hypothetical true model.
vlk:
Is it a fair summary of their paper to say that “you can use K-S, C-vM, and A-D to do parameter estimation and goodness-of-fit, provided that you recalibrate the statistic used in each case with parametric or non-parametric bootstrap”? If so, surely you do the same with chi^2? What is the specific advantage of K-S, etc., over chi^2?
08-17-2007, 6:22 pmhlee:
First, the K-S test is a distribution free test. Second, о‡^2 methods produce results from an approximation based on the о‡^2 distribution. To satisfy such approximation, data and models have to meet the regularity conditions, which in general are never discussed in astronomical papers. To avoid the checking procedure for the regularity conditions, it’s convenient to use a distribution free test. Yet, this distribution free tests are not always efficient.
Another point I like to make is these distribution free tests are just tests (Ho: your model is a good fit vs Ha: not). Parameter estimation is another process. It is well known that using the data for estimating parameters and using the same data for testing introduces bias. The astronomers’ о‡^2 methods haven’t been paying attention to this bias. The way that confidence interval is constructed from the о‡^2 is built on no-bias regime.
In conclusion, the summary is not fair. Your summary was not the authors’ intention.
08-17-2007, 6:40 pmvlk:
OK, I see that they replace the usual goodness-of-fit with the Kullback-Leibler distance measure. But otherwise, intention or not, Babu & Feigelson do say in the abstract that they “combine K-S statistics with bootstrap resampling to achive unbiased parameter confidence bands.” That reads to me as recalibrating their preferred statistic on the fly. Would you please therefore explain why my summary was not fair?
On the separate question of using the chi^2, please note that I am not suggesting the use of the standard chisq distribution, but the calculation of chi^2, the number, and a recalibration of chi^2, the distribution, the same way as B&F do for the K-S statistic. What is wrong with that?
08-17-2007, 7:27 pmhlee:
The K-S tests were build upon the null hypothesis, F_n(x)=F(x), not on \hat{\theta}=\theta_o (parameter). The confidence band is laid upon the empirical distribution, if I put this in a simple word, a confidence band on the estimated p-value. I didn’t see that they included any parameter estimation and its confidence band.
The о‡^2 method provides a best fit based on least square methods. If the errors of the response variable are normally distributed, then this least square method provides an equal solution to the maximum likelihood estimator. Because of Wald (1949) and other series of works, we know that this ML estimator is consistent and asymptotically normal, so that the о‡^2 number +1 could lead a 68% confidence interval (one parameter). This interval is only valid when the ML estimator is consistent. Protossov et.al. (2001) showed some of the regularity conditions on the model to reach this consistent estimator. I’m not sure all astronomical models satisfy these conditions to adopt the о‡^2 method to get the desired confidence interval. Many suffer from identifiability, the type of parameter space (compact, open, closed), continuity, or differentibility.
Beyond the least square and maximum likelihood methods, there are numerous ways to obtain a best fit based on the conditions of the target model. Although I took strong words to say your statement is not fair, if you say those physical models satisfy regularity conditions, or empirically the chosen model is physically always right, or the о‡^2 number is absolutely right due to physics, I have no reason to object the traditional practice of о‡^2.
However, I want to make a point that parameter estimation and hypothesis testing (including confidence interval) are better to be separate. I don’t think it is suitable unless you have a very strong evidence of consistency to use your estimated parameter value for testing (or confidence interval) based upon the о‡ method. You may use restricted maximum likelihood (REML) estimator if your know the range (in general, many observations and parameters should be positive). You can penalize the о‡^2. You can use the best linear unbiased estimator (BLUE, the least square method equivalent for simple linear models). You can use M-estimators, and so on. By showing these methods produce a consistent estimator to the true (unknown) parameter value, then the modified о‡^2 (modification due to penalty, change of norms, restriction on parameter space, etc) leads to a confidence interval for the parameter of interest.
I know I beat around the bush. The cited paper does not provide an absolute solution, but it shows how to improve the current practice of the goodness-of-fit tests and why the bootstrap works on improving the K-S tests. The more complicated (physical) models become, the more careful choice of statistics is required, which I consider has been ignored in the astronomical society. If one wants to summarize data (extract information through statistics), one should use correct statistics. This didn’t happen with 100% success rate. I hope the rate goes up, close to 100%.
[Comment: Regularity conditions, consistency, asymptotics, and other arcane terminology make astronomers driven away from classical statistics and pursue Bayesian statistics as long as physical models exist.]
08-19-2007, 10:34 pmhlee:
Dr. Babu, Director of Center for Astrostatistics kindly sent a comment on his paper for the slog:
10-01-2007, 2:07 pmGracie:
“If one wants to summarize data (extract information through statistics), one should use correct statistics.” Agree. DOn’t compare apples to oranges.
11-19-2008, 1:15 pmJoseph Hilbe:
Everyone interested in astrostatistics is invited to attend the Astrostatistics Interest Group business and papers meetings at the International Statistical Institute’s binannual meetings from August 16-22, 2009. We are scheculed to have the business/organizational meeting on August 21 at the Durban, South Africa Convention Center at 11:15AM. If interested please contact me at either hilbe -at- asu.edu or j.m.hilbe -at- gmail.com.
Joseph Hilbe
03-14-2009, 5:32 pmChair, ISI Astrostatistics Interest Group
JPL/CalTech and Arizona State University