Документ взят из кэша поисковой машины. Адрес оригинального документа : http://hea-www.harvard.edu/AstroStat/astro193/Loader_locfit_tutorial.pdf
Дата изменения: Tue Apr 7 00:29:32 2015
Дата индексирования: Sun Apr 10 13:10:58 2016
Кодировка:
Поисковые слова: m 15

Locfit: An Introduction1
By Catherine Loader

1

What is Lo cfit?

Locfit is a software package performing local regression, likelihood and related smoothing procedures. It is designed to be used in S/Splus, making extensive use of the data management and graphical facilities in that language. The interface uses the modeling language and classes and methods of S, so users familiar with the existing modeling software in S (Chambers and Hastie 1992) should find Locfit easy to use. Most of the numerical routines of Locfit are written in C, and it is also possible to use Locfit as a stand-alone C program. Locfit can be obtained via the WWW at http://www.locfit.info/; substantial online documentation can also be found at this address.

The bandwidth h controls the smoothness of the fit. A large h may result in oversmoothing, or miss important features in the data, while a small h may result in a fit that is too noisy. The simplest choice is to take h constant; often it may be desirable to vary h with the fitting point x. The local least squares criterion (1) is easily minimized to produce estimates a0 and a1 . The local ^ ^ linear estimate of µ(x) is µ(x) = a0 . ^ ^ Note that each least squares problem produces µ(x) ^ for a single point x; to estimate at additional points, the local weights change and a new least squares problem must be solved. Our example uses the ethanol dataset, studied extensively in Cleveland (1993), and fits a local quadratic model: > fit.et <- locfit(NOx~E, data=ethanol, + alpha=0.5) > plot(fit.et, get.data = T)

Three arguments are given to the locfit() funcLocal regression was applied in a variety of fields in tion. The model formula, NOx~E, specifies a response late 19th and early 20th centuries; see for example variable NOx and predictor E. The data=ethanol arHenderson (1916). The current popularity of local gument provides a data frame, where the variables regression as a statistical procedure is largely due may be found. The smoothing parameter is given by to the Lowess procedure (Cleveland 1979) and Loess alpha=0.5; this gives a nearest-neighbor based bandwidth covering 50% of the data. Figure 1 shows the (Cleveland and Devlin 1988). plot of the fit. The underlying model for local regression is Bivariate local regression arises when there are two Yi = µ(xi ) + i ; predictor variables: xi = (xi,1 , xi,2 ). The localization the function µ(x) is assumed to be smooth and is weights then become estimated by fitting a polynomial model (most comxi - x monly, linear or quadratic) within a sliding window. , wi (x) = W h That is, for each fitting point x, we consider a locally weighted least squares criterion: where · denotes the Euclidean norm. The local n quadratic model around a point x = (u, v ) is xi - x 2 (Yi - (a0 + a1 (xi - x))) (1) W h µ(xi ) a0 + a1 (xi,1 - u) + a2 (xi,2 - v ) i=1 +a3 (xi,1 - u)2 + a4 (xi,1 - u)(xi,2 - v ) By default, Locfit uses the weight function +a5 (xi,2 - v )2 . (1 - |v |3 )3 |v | < 1 W (v ) = . 0 otherwise These are substituted into the local least squares cri1 Statistical Computing and Graphics Newsletter, April terion (1). Again, we minimize over the coefficients, and take µ(x) = a0 . ^ ^ 1997 1

2

Lo cal Regression

o oo o o o o o o oo oo ooo o o o o oo o oo o o 0.6 0.8

NOx 2

o o o o o o o oo o o ooo oo o o ooo o o

Z -1 0 1 2 3 4

3

o o o oo o

o

oo ooo o ooo o oo o o o oo oo oo o o o

4

18

16

14 C

12

10 8

0.6

0.7

1

0.8

0.9 E

1.2 1 1.1

1.0 E

1.2

Figure 2: Bivariate local quadratic fit for ethanol dataset

as an error structure. In local likelihood, we simply replace the local least squares criterion by an approIn Locfit, multivariate local regression is requested priate local log-likelihood criterion. For binary data, by adding additional terms on the right hand side of the local log-likelihood is the formula. For example, the ethanol dataset conn tains a second predictor variable C: w (x) (Y log(p ) + (1 - Y ) log(1 - p ))
i i i i i

Figure 1: Local quadratic fit for the ethanol dataset

where pi = p(xi ). We could model p(x) directly using local polynomials; however, it is usually preferThe scale argument allows the user to specify dif- able (and the Locfit default) to model via the logistic ferent scales for each variable, used in computing link function, (x) = log(p(x)/(1 - p(x))). As in the neighborhood weights. When scale=0 is given, each local regression case, we approximate (x) locally by variable is scaled by its standard deviation. The two a polynomial, then choose the polynomial coefficients dimensional fit can be displayed in several formats: to maximize the likelihood. By changing the likelihood and weighting scheme, contour plots (the default); perspective plots (Figure local likelihood estimates can be obtained in numer2) and cross sections, using trellis displays (Becker, ous different settings. Those supported in Locfit inCleveland, and Clark 1997). clude likelihood regression models; density estimation; conditional hazard rate estimation and censored likelihood models. 3 Lo cal Likeliho o d As an example, we consider a mortality dataset Local likelihood fitting was developed by Tibshirani from Henderson and Sheppard (1919). This consists (1984) and Tibshirani and Hastie (1987). The proce- of three variables: age; number of patients of each dure is applicable is situations such as binary data, age, and number of deaths for each age. Local logistic when an additive Gaussian model is inappropriate regression is requested by the family="binomial" 2

> fit.et2 <- locfit(NOx~E+C, data=ethanol, + alpha=0.5, scale=0) > plot(fit.et2, type="persp")

i=1

argument, and the number of patients is passed as the weights argument: > fit.mo <- locfit(deaths~age,weights=n, + family="binomial",data=morths, + alpha=0.5) > plot(fit.mo, get.data=T) Figure 3 displays the result. Note that while estimation is performed on the logistic scale, the result is automatically back-transformed to the probability scale.

Our example estimates the density of the durations of 107 eruptions of the Old Faithful geyser: > + > > + > fit.of <- locfit(~geyser,flim=c(1,6), alpha=c(0.15,0.9)) plot(fit.of,get.data=T,mpv=200) fit.og <- locfit(~geyser,flim=c(1,6), alpha=c(0.15,0.7),link="ident") plot(fit.og,get.data=T,mpv=200)

o

deaths 0.4 0.6

o o oo ooooo ooo o o o o ooo o oo o ooo o oooooooooo o o o oooo 60 70 80 age 90 100

Note the two components to the smoothing parameter alpha: the first is a nearest neighbor component, and the second a fixed component. At each fitting point, both components are evaluated, and the larger bandwidth is used in the local likelihood. From figure 4 the log-link provides a visually more appealing estimate; Loader (1995) provides substantial evidence that it's also a better estimate. While asymptotic theory suggests the choice of link should have little effect on the estimate, it is often advantageous in practice to choose a link mapping the parameter space to (-, ). The Locfit default satisfies this for all models.

0.0

0.2

0.8

1.0

4

Mo del Assessment

It is well known that smoothing parameters have a critical influence on the smooth curve: A large bandFigure 3: Mortality data of Henderson and Shepherd: width leads to an oversmoothed curve that may inadequately model or completely miss important feaLocal logistic fit tures, while a small bandwidth may undersmooth the For density estimation, the appropriate local like- curve, resulting in a fit that is visually too noisy. A number of tools are available to help assess the lihood criterion is performance of smooths. Global criteria such as cross n u-x validation and generalized cross validation (Craven wi (x) log(f (xi )) - n W f (u)du; and Wahba 1979) estimate the average squared preh i=1 diction error, while the M statistic of Cleveland and see Loader (1996). By default, we use the log-link; Devlin (1988) estimates the average squared estimation error. Other tools, such as residual plots and that is, log(f (x)) is modeled by local polynomials. In Locfit, density estimation is requested with confidence bands, attempt to assess the importance family="density"; this becomes the default if no re- of individual features and check other modeling assponse is given in the formula. When link="ident" sumptions. is given, a local polynomial model for the density is For example, in Figure 3, the smooth looks to fit used. Local quadratic fitting with the identity link is the data nicely up to age 90, while the data becomes one construction of the fourth order kernel estimate wild for larger ages. But the number of patients is discussed in section 6.2.3.1 of Scott (1992). very small for these larger ages, making it impossible 3

o 2 o residual 0 1 o oo o oo o o o ooo o o o o o oo o oo o oo o o ooo o o o o o o o o 70 80 age 90 100 o o o oo

0.6

density 0.4

-2

-1

0.2

60

Figure 5: dataset.

0.0

Deviance Residuals for the mortality

1

2

3 4 geyser

5

6

to tell from Figure 3 whether there's any problem. Instead, we examine residuals: > plot(morths$age, residuals(fit.mo), + xlab="age", ylab="residual") Several possible definitions of residuals for generalized linear models are given by McCullagh and Nelder (1989), section 2.4. By default, Locfit uses deviance residuals; when the smooth is adequate, the distribution is very approximately N (0, 1). Figure 5 shows there is no problem at the right end; if anything, a small amount of overdispersion in the range 60 < age < 80 is possible. Global goodness of fit criteria are readily computed. For example, the generalized cross validation statistic is GCV = n
n i=1

0.0 1

density 0.2 0.4

2

3 4 geyser

5

6

Figure 4: Local quadratic density estimates for Old Faithful data: log link (top) and identity link (4th order kernel) (bottom).

(Yi - µ(xi ))2 ^ (n - tr(H))2

where H is the hat matrix. The "locfit" ob ject contains all the necessary components to compute this; the gcv() function provided with Locfit makes a call to locfit() and extracts the necessary components: > gcv(NOx~E,data=ethanol,alpha=0.5) lik infl vari gcv -4.53376 7.013307 6.487449 0.1216589 4

o o o o oo o o o oo o o

being too variable and unreliable. In fact, a careful analysis shows the variability of cross validation is not the problem, but rather a symptom of how difficult data-based model selection is; for example, reflecting the uncertainty as to the correct amount of smoothing of the peak in Figure 1. Plug-in selectors, often claimed to be less variable, have not magically answered this uncertainty, but effectively make strong prior assumptions as to the correct amount of smoothing. This point is discussed further in Loader (1995), where it is shown overreliance on bandwidth selectors has led to questionable conclusions on some standard examples.

0.0

0.05

GCV 0.10

0.15

0.20

5
4 6 8 10 12 Fitted DF 14 16

Lo cally Adaptive Fitting

Sometimes, we may be blessed with large datasets with low noise. In such cases, we can try to choose a separate bandwidth for each smoothing point. Locfit Figure 6: Generalized cross validation plot for the provides one such method for doing so, based on a ethanol dataset. localized version of AIC. Figure 7 shows an example, using one of the The lik component is -0.5 times the residual sum of four examples popularized by Donoho and Johnstone squares; infl is tr(H); vari is tr(HT H) and gcv is (1994). The S commands producing this example are the GCV score. We can easily loop through gcv() > x <- seq(0, 1, length.out=2048) for several smoothing parameters: > y <- 20*sqrt(x*(1-x))*sin((2*pi*1.05)/ + (x+0.05))+rnorm(2048) > alpha <- seq(0.2,0.8,by=0.05) > plot(y~x) > g <- matrix(nrow=length(alpha), ncol=4) > fit.ad <- locfit(y~x, maxk=500, > for(i in 1:length(alpha)) + alpha=c(0,0,log(2048))) + g[i, ] <- gcv(NOx~E, data=ethanol, > plot(fit.ad, mpv = 2048) + alpha=alpha[i]) > plot(predict(fit.ad, what="band"), > plot(g[,3], g[,4], ylim=c(0,0.2), + type="p") + xlab="Fitted DF", ylab="GCV") The plot is displayed in Figure 6. Note the plot is fairly flat from about 6 to 16 degrees of freedom. This situation is not uncommon, and reflects the difficulty of purely data-based bandwidth selection. Looking at Figure 1, one might argue that the peak should be much flatter than the smooth displays, or possibly even bimodal. The flatness of GCV simply reflects this uncertainty. Recent literature on bandwidth selection (e.g. Ruppert, Sheather, and Wand 1995) has strongly criticized cross validation and related procedures as 5 The locally adaptive fit is requested by providing a third component to the smoothing parameter alpha; this specifies a penalty for the `number of parameters' in the local AIC criterion. Usually, a value slightly larger than 2 2 produces the most satisfactory results. While examples like Figure 7 are challenging from an algorithmic viewpoint (it's hard to make a computer draw a smooth curve through the data), they are quite simple from a statistical viewpoint (the structure is quite obvious, and a perfectly adequate

oo ooo o oo oo o ooo o oooo oooooooo oo oo ooo o ooooo ooo ooooo oo oo oo o oooo ooooooooooooo ooo oo o oo o o o oo ooo ooooooooooooooo oo o oooo o o oo oooooo o oo ooo oo ooooooo o oo oooooooooooooooooo ooooooooooooooooo o oo ooo oo o oo oooo o o o o o o ooooooo oooooooooo o o o oo oo ooo oooo oo ooooo o oo o o o o ooooo oo o oooo oo o oo oo o oo o o o o o oo o oo oo oo oooo oooooo o o o o o oooo o oo ooo o o oo o o o o o o oooooooooo o oooo oooo o oo oo o ooooo ooo o oo o oo o o o o oooo oo o o oooooooooo o o o oooooooo o ooo o o oo oo o o o ooooooo o oo o o o o o o o oo oo ooooo ooooo o o o ooo o oo oo oooo o o oo oooo o ooooo o o o o o oooo o o o o o o o o o o ooo ooo o ooooooo o oo o oooo o o o o o o o o o o ooo ooo o o oo oo ooo ooo oo oo o o o o o o o o oo oo ooo o o o ooooo o o oo oo oooo ooo o o o o o o oo ooo oo oo oo o o o oo oo o oooo o o oo oo ooo oo o oo oo o oooo o o o o o o oo o o o oo o oo o oo oo oooo o oo o ooooo ooooo o oo o o oo o o o oo oo oo oo ooo oooooo ooo oooo oo o o ooo oooo o oooooooooo oo o oo o o o oo oooo o o oo o ooooooo oo ooooooo o ooo ooo o ooo o o ooooo oo oo o o o ooooo o oo o oooo o o 0.0 0.2 0.4 x 0.6 0.8 1.0

-10 0.0 Bandwidth 0.10 0.25

y 0

5 10

-10

y 05

0.2

0.4 x

0.6

0.8

1.0 o

0.0

oo o o o oooooo oo oooooooo oo ooooooo o ooooooooo ooooooooooooo oooooooooooo o 0.0 0.2 0.4 x

o

oo o o oo o 0.6

o o

o

o

o o

0.8

1.0

Figure 7: Locally adaptive fit. Dataset (top); Smooth fit (middle) and bandwidths used (bottom).

6

petal.len 5

smooth would be obtained with a pen). For the types of data statisticians often face - more noise, and less obvious structure - locally adaptive smoothing is likely to be less satisfactory. The model selection uncertainty identified in the previous section applies equally to locally adaptive selection, and there can be no guarantee that the smooth produced by Locfit (or any other locally adaptive smoother) matches what the user desires.

7

O versicolor + virginica 6 + +O + O OO OO O O

+ O O O O

6

Classification

Classification problems, where one attempts to classify observations into two or more classes, can be addressed using either logistic regression or density estimation. As an example, we consider classifying the versicolor and virginica species from Fisher's Iris data set, based on the petal measurements. Local logistic regression (the default for a T/F response) is used: > fit.ir <- locfit(I(species=="virginica")~ + petal.wid+petal.len, scale=0,data=iris) > plot(fit.ir, v = 0.5) > plotbyfactor(petal.wid,petal.len,species, + data=iris, pch=c("O","+"), col=c(1,1), + add=T, lg=c(1,7)) Figure 8 shows the resulting fit, with the single contour plotting the classification boundary. We can also estimate the error rate, using cross validation:

O O OO OOO O OO O O O0.5 1.0

O O O O O O OO O

+ ++ + + + + ++ O++ + O +

+++ +

+

+ + + +++ +++++ + + + + + + ++ + + + +

3

4

1.5 petal.wid

2.0

2.5

Figure 8: Classification boundary for Fisher's iris data.

For larger datasets, or iterative procedures such as local likelihood, computational approximations become useful.

The computational methods used in Locfit develop the ideas used in Loess (Cleveland and Grosse 1991). Roughly, the local regression/likelihood is performed > table(fitted(fit.ir,cv=T)>0.5, directly at a small number of points, and the fit at + iris$species) these points is smoothly interpolated to obtain the fit versicolor virginica at remaining points. Locfit differs from Loess in the FALSE 47 2 way points are selected for direct fitting. While Loess TRUE 3 48 bases its choice on the density of data points, LocHere, we estimate the cross validated fit, and hence fit uses the bandwidths at fitting points. This allows Locfit to adapt to different bandwidth schemes: fixed, the misclassification rate is estimated as 5/100. nearest-neighbor, and locally adaptive. The power of this approach becomes apparent in the third panel of Figure 7: the direct fit is performed at just 108 7 Computational Algorithms points; far less than the 2048 data points, and most Modern work stations and PC's are sufficiently fast of the fitting points are in the interval 0 x 0.2, that direct implementation of local regression for where the smallest bandwidths are used and the lodatasets with a few hundred points is not a problem. cally adaptive procedure is relatively cheap. 7

8

Conclusions

This article has outlined the main ideas underlying Locfit, and presented examples showing some of the main capabilities. The web pages referred to in Section 1 contain a number of other applications, including several models with censored data, and details of many more options. Of course, the best way for readers to decide whether Locfit is useful is to download and try it!

Henderson, R. (1916). Note on graduation by adjusted average. Transactions of the Actuarial Society of America 17, 4348. Henderson, R. and Sheppard, H. N. (1919). Graduation of Mortality and other Tables. New York: Acturial Society of America. Loader, C. R. (1995). Old Faithful erupts: Bandwidth selection reviewed. Technical report, Bell Laboratories, Murray Hill, NJ. http://cm.bell-labs.com/stat/doc/95.9.ps Loader, C. R. (1996). Local likelihood density estimation. The Annals of Statistics 24, 1602 1618. McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. London: Chapman and Hall.

Acknowldegments
I thank Mark Hansen for inviting this article, and for helpful comments on presentation.

References

Ruppert, D., Sheather, S. J., and Wand, M. P. (1995). An effective bandwidth selector for loBecker, R. A., Cleveland, W. S., and Clark, L. cal least squares regression. Journal of the (1997). Trel lis graphics. Web page. http://cm.bell-labs.com/stat/project/trellis/ American Statistical Association 90, 1257 1270. Chambers, J. M. and Hastie, T. J. (1992). StaScott, D. W. (1992). Multivariate Density Estimatistical Models in S. Pacific Grove, California: tion: Theory, Practice and Visualization. New Wadsworth & Brooks-Cole. York: John Wiley & Sons. Cleveland, W. S. (1979). Robust locally weighted Tibshirani, R. J. (1984). Local Likelihood Estimaregression and smoothing scatterplots. Jourtion. Ph. D. thesis, Department of Statistics, nal of the American Statistical Association 74, Stanford University. 829836. Cleveland, W. S. (1993). Visualizing Data. Summit, New Jersey: Hobart Press. Cleveland, W. S. and Devlin, S. J. (1988). Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association 83, 596610. Cleveland, W. S. and Grosse, E. H. (1991). Computational methods for local regression. Statistics and Computing 1, 4762. Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions. Numerische Mathematik 31, 377403. Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425455. 8 Tibshirani, R. J. and Hastie, T. J. (1987). Local likelihood estimation. Journal of the American Statistical Association 82, 559567. Catherine Loader Lucent Technologies locfit@herine.net

A

C version commands

For C version users, this appendix gives the commands to produce (or approximate) the S figures in this paper. Figure 1: locfit> locfit NOx~E data=ethanol alpha=0.5 locfit> plotfit data=T Figure 2: locfit> locfit NOx~E+C data=ethanol alpha=0.5 scale=0 locfit> plotfit type=w Figure 3: locfit> locfit deaths~age weights=n family=binomial data=morths alpha=0.5 locfit> plotfit data=T Figure 4: locfit> locfit> locfit> locfit> locfit plotfit locfit plotfit ~geyser data=T ~geyser data=T data=geyser flim=1,6 alpha=0.15,0.9 m=200 data=geyser flim=1,6 alpha=0.15,0.7 link=ident m=200

Figure 5: locfit> locfit deaths~age weights=n family=binomial data=morths alpha=0.5 locfit> res=residuals locfit> plotdata age res xlab=age ylab=residual Figure 6: This example is in two parts. The commands for the first part must be stored in a file (say gcv.cmd) and run in batch mode (locfit gcv.cmd), The second part, actually producing the plot, can be run interactively. gcv=def like infl vari -2*88*like/((88-infl)*(88-infl)) readdata ethanol outf gcv.out for 4 16 locfit NOx~E alpha=i/20 gcv endfor exit locfit> readfile gcv.out lk nu1 nu2 g locfit> plotdata nu1 g xlab=Fitted_DF ylab=GCV Figure 7: 9

locfit> locfit> locfit> locfit> locfit> locfit> locfit>

design n=2048 x=seq(0,1) design y=20*sqrt(x*(1-x))*sin(2.1*3.14159265/(x+0.05))+rnorm() plotdata x y locfit y~x maxk=500 alpha=0,0,log(2048) plotfit m=2048 knots x h plotdata x h

Figure 8: locfit> locfit> locfit> locfit> locfit> locfit species~petwid+petlen scale=0 family=binomial data=iris plotfit split=0.5 data=T type=cq fit=fitted cv=T design pred=fit>0.5 table species pred 010.2 1 47 3 2 48

0- 0.2 11

The split=0.5 argument to plotfit requests a single contour at the 0.5 level. data=T adds data to the plot. type=cq specifies a contour plot (type=c) for the fit component of the plot, and colour-coded points (type=q) for the data.

10