Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://zebu.uoregon.edu/1999/es202/l19.html
Дата изменения: Tue Mar 2 21:53:19 1999 Дата индексирования: Tue Oct 2 01:15:10 2012 Кодировка: Поисковые слова: moon |
Yet correlation analysis is argueably the single most important thing
that one does with a data set. Such an analysis can help define
trends, make predictions and uncover root causes for certain
phenomena.
While there are standard tools for performing correlation
analyses (this will be provided to you later) it is often done
poorly. As a result, lots of erroneous analysis gets published,
in virtually all fields.
Often times, there is simply not enough
data to adequately define a correlation. This allows one to make
ridiculous predictions which, although they can be supported by
the data, make no sense.
A favorite example:
Here is a prediction that I made in the year 1839 (that was in the pre-internet era):
All presidents that are elected in a year that ends with a zero will die in office:
Gee 7 events in a row, pretty good prediction, huh!
Conclusion: Its probably safe to run for office in the year 2000
Basics of Correlation:
Correlation can be used to summarise the amount of association between 2 continuous variables. Plotting a "scatter" yields a "cloud"of points :
In general, we measure correlation by a parameter known as the correlation coefficient, r .
r is between - 1 and 1
Mathematically, r is defined as
But we don't really care about this - we only care about using the value of r as a rough guide to how well two variables are correlated. Usually your eye is a good estimator of r.
Regression is now built into the tool
Let's look at the some examples using correlation and regression analysis.
An example data set:
The Goal here is to find the best relation between, Y the dependent variable, and X- the independent variable.
X is the variable that would measure because Y is more difficult, and in some cases might be impossble to Measure.
Since we are measuring X - the role of measurement error will be come important. More on that later.
X Y Other 10.0 12.5 22.0 8.5 11.1 18.0 16.8 22.3 19.5 11.2 15.4 15.5 17.8 25.3 12.2 5.4 8.4 11.6 21.6 32.6 7.4 9.6 18.5 0.8 14.0 15.3 30.5 13.5 16.8 22.7
The correlation between X and Y is shown here:
Ypred = 1.39X + 0.03 ; dispersion = 2.53 ; r = 0.94
Let's calculate the residuals for each data point now.
X Y Y-pred Residual Significance 10.0 12.5 13.93 1.43 0.56 8.5 11.1 11.85 0.75 0.29 16.8 22.3 23.38 1.08 0.43 11.2 15.4 15.60 0.20 0.08 17.8 25.3 24.77 -0.52 -0.21 5.4 8.4 7.53 -0.86 -0.34 21.6 32.6 30.05 -2.54 -1.00 9.6 18.5 13.37 -5.12 -2.02 14.0 15.3 19.49 4.19 1.65 13.5 16.8 18.80 2.00 0.79
Try rejection analysis to improve the fit (mainly lower the scatter). Reject the most deviant point in the above.
That new relation is plotted here:
Ypred = 1.48X -1.76 ; dispersion = 1.96
This representation of the data is a more reliable and robust. Can anything be done to further reduce the scatter?
In most cases usually not. In this case, however, we have a third variable labelled other. What happens if we plot the residuals against other?
residual = 0.30*Other -4.75
since residual = Yactual - Ypred = Yactual - 1.39X + 0.03
so we have Yactual = Ypred + residual
or Yactual = 1.39X +0.03 + 0.30*Other - 4.75
The good correlation between the residuals and another variable in this case allows us to make a linear combination to further reduce the scatter. To wit
X Y Other Y-pred Residual 10.0 12.5 22.0 12.08 -0.42 8.5 11.1 18.0 11.20 0.10 16.8 22.3 19.5 22.28 -0.02 11.2 15.4 15.5 15.70 0.30 17.8 25.3 12.2 25.86 0.56 5.4 8.4 11.6 8.80 0.40 21.6 32.6 7.4 32.58 -0.02 9.6 18.5 0.8 17.88 -0.61 14.0 15.3 30.5 15.09 -0.21 13.5 16.8 22.7 16.73 -0.07
The dispersion has lowered from 2.53 to 0.3 with the addition of this second term. Hence, Y can be predicted from X but can be predicted very accurately from X and Other.
One final point about measurement errors in X.
Suppose I have two relations involving different quantities but both use the same independent variable X.
Relation 1:
Y1 = 1.5X + 1.5 ; with a dispersion of 0.5 units
Relation 2:
Y2 = 6.0X + 2.5 ; with a dispersion of 0.3 units
Suppose that I can only make measurements of X which are accurate to 10%. This means that, despite a lower disperions, Y2 is less well determined than Y1!
Example: x = 10 +/- 1
Y1 = 1.5*10 +1.5 = 16.5
Y1 = 1.5*11 +1.5 = 18.0
So 10% uncertainty in X translates into +/- 1.5 unit uncertainty
in Y.
For Y2 = 6.0*10 +2.5 = 62.5
For Y2 = 6.0*11 +2.5 = 68.5
So 10% uncertainty in X translates into +/- 6.0 unit uncertainty
So relations which have steep slopes require that X be measured very accurately.
Final points about regression: