Correlations in Data

Документ взят из кэша поисковой машины. Адрес оригинального документа : http://zebu.uoregon.edu/1999/es202/l19.html
Дата изменения: Tue Mar 2 21:53:19 1999
Дата индексирования: Tue Oct 2 01:15:10 2012
Кодировка:
Поисковые слова: легирование

Undoubtedly the biggest arguments that exist in social and/or natural sciences are about correlations. The rift between social scientists and physical scientists is largely driven by disagreements about how rigorous correlations have been derived.

Yet correlation analysis is argueably the single most important thing that one does with a data set. Such an analysis can help define trends, make predictions and uncover root causes for certain phenomena.

While there are standard tools for performing correlation analyses (this will be provided to you later) it is often done poorly. As a result, lots of erroneous analysis gets published, in virtually all fields.

Often times, there is simply not enough data to adequately define a correlation. This allows one to make ridiculous predictions which, although they can be supported by the data, make no sense.

A favorite example:

Here is a prediction that I made in the year 1839 (that was in the pre-internet era):

All presidents that are elected in a year that ends with a zero will die in office:

1840: William Henry Harrison dies in April of 1841
1860: Abraham Lincoln Assassinated
1880: Garfield Assassinated
1900: McKinley Assassinated
1920: Harding Died of illness
1940: Roosevelt Died of illness
1960: Kennedy Assassinated
Gee 7 events in a row, pretty good prediction, huh!
1980: Reagan Almost Assassinated
Conclusion: Its probably safe to run for office in the year 2000

Basics of Correlation:

You have two sampling variables (X and Y). Does the value of one variable depend on the other or are the variables random.

Correlation then determines the probability that the two variables are randomly correlated.

Correlations can be strong or weak.

Strong correlations are extremely useful in identifying root causes and/or what the most important variables are

Weak correlations open the way for tremendous ambiguity

If two variables are correlated, it means that one variable can be written as a function of the other.
Math Anxiety Alert While functions can be complicated, here we will only be concerned with linear correlation:

Y = AX + B
A = the slope of the correlation
B = a calibration constant (often called a zeropoint)
if A is negative, the variables are anti-correlated

Correlation can be used to summarise the amount of association between 2 continuous variables. Plotting a "scatter" yields a "cloud"of points :

A positive association between the x and y variable helps you to predict the value of the other. If there is little or no association the "cloud"is more spread out and information about one variable can not really be discerned from the other variable.

These "clouds" have the same values for the centre, defined by the mean x and y values. Furthermore, the dispersion in the X variable is the same as that in the Y variable. But (A) is tightly clustered and (B) is loosely clustered. The amount of clustering, i.e. the strength of association, is summarised by the correlation coefficient.

In general, we measure correlation by a parameter known as the correlation coefficient, r .

r is between - 1 and 1

Mathematically, r is defined as

But we don't really care about this - we only care about using the value of r as a rough guide to how well two variables are correlated. Usually your eye is a good estimator of r.

Regression is now built into the tool

Let's look at the some examples using correlation and regression analysis.

An example data set:

The Goal here is to find the best relation between, Y the dependent variable, and X- the independent variable.

X is the variable that would measure because Y is more difficult, and in some cases might be impossble to Measure.

Since we are measuring X - the role of measurement error will be come important. More on that later.


     X        Y       Other 

   10.0      12.5      22.0
    8.5      11.1      18.0
   16.8      22.3      19.5
   11.2      15.4      15.5
   17.8      25.3      12.2
    5.4       8.4      11.6
   21.6      32.6       7.4
    9.6      18.5       0.8
   14.0      15.3      30.5
   13.5      16.8      22.7

The correlation between X and Y is shown here:

Y_pred = 1.39X + 0.03 ; dispersion = 2.53 ; r = 0.94

Let's calculate the residuals for each data point now.

     X        Y       Y-pred     Residual    Significance

   10.0      12.5     13.93        1.43           0.56
    8.5      11.1     11.85        0.75           0.29
   16.8      22.3     23.38        1.08           0.43
   11.2      15.4     15.60        0.20           0.08
   17.8      25.3     24.77       -0.52          -0.21
    5.4       8.4      7.53       -0.86          -0.34
   21.6      32.6     30.05       -2.54          -1.00
    9.6      18.5     13.37       -5.12          -2.02
   14.0      15.3     19.49        4.19           1.65
   13.5      16.8     18.80        2.00           0.79

Try rejection analysis to improve the fit (mainly lower the scatter). Reject the most deviant point in the above.

That new relation is plotted here:

New relations has lower scatter by about 20%:

Y_pred = 1.48X -1.76 ; dispersion = 1.96

This representation of the data is a more reliable and robust. Can anything be done to further reduce the scatter?

In most cases usually not. In this case, however, we have a third variable labelled other. What happens if we plot the residuals against other?

In this case there is a clear correlation in that

residual = 0.30*Other -4.75

since residual = Y_actual - Y_pred = Y_actual - 1.39X + 0.03

so we have Y_actual = Y_pred + residual

or Y_actual = 1.39X +0.03 + 0.30*Other - 4.75

The good correlation between the residuals and another variable in this case allows us to make a linear combination to further reduce the scatter. To wit

    X          Y      Other       Y-pred        Residual
 
   10.0      12.5     22.0         12.08         -0.42
    8.5      11.1     18.0         11.20          0.10
   16.8      22.3     19.5         22.28         -0.02
   11.2      15.4     15.5         15.70          0.30
   17.8      25.3     12.2         25.86          0.56
    5.4       8.4     11.6          8.80          0.40
   21.6      32.6      7.4         32.58         -0.02
    9.6      18.5      0.8         17.88         -0.61
   14.0      15.3     30.5         15.09         -0.21
   13.5      16.8     22.7         16.73         -0.07

The dispersion has lowered from 2.53 to 0.3 with the addition of this second term. Hence, Y can be predicted from X but can be predicted very accurately from X and Other.

One final point about measurement errors in X.

Suppose I have two relations involving different quantities but both use the same independent variable X.

Relation 1:

Y₁ = 1.5X + 1.5 ; with a dispersion of 0.5 units

Relation 2:

Y₂ = 6.0X + 2.5 ; with a dispersion of 0.3 units

Suppose that I can only make measurements of X which are accurate to 10%. This means that, despite a lower disperions, Y₂ is less well determined than Y₁!

Example: x = 10 +/- 1

Y₁ = 1.5*10 +1.5 = 16.5
Y₁ = 1.5*11 +1.5 = 18.0
So 10% uncertainty in X translates into +/- 1.5 unit uncertainty in Y.

For Y₂ = 6.0*10 +2.5 = 62.5
For Y₂ = 6.0*11 +2.5 = 68.5
So 10% uncertainty in X translates into +/- 6.0 unit uncertainty

So relations which have steep slopes require that X be measured very accurately.

Final points about regression:

Sometimes correlations can be defined by a single point which significantly extends the range of the data.
Sometimes a few deviant points (devaint due to measurement error or poor sampling) can degrade the correlation. One cycle of rejection should occur for most data sets.

JAVA Applet

Previous Lecture Next Lecture Course Page