Документ взят из кэша поисковой машины. Адрес оригинального документа : http://zebu.uoregon.edu/1998/es202/distribution.html
Дата изменения: Wed Jan 28 02:39:56 1998
Дата индексирования: Tue Oct 2 07:14:11 2012
Кодировка:
Поисковые слова: http www.sai.msu.su sn sncat

A distribution represents the number of times that a variable has a particular value. This distribution can assume any form, but most can be well approximated by certain standard forms as we will see later. Distributions are also defined by maximum and minimum values. This is called the range of the data.

Example
Let's return to the rainfall example for Eugene: The following is the annual rainfall data for the last 25 years:

50.09 1970
60.67 1971
57.56 1972
55.30 1973
56.78 1974
51.77 1975
34.78 1976
46.91 1977
39.50 1978
51.28 1979
51.34 1980
55.90 1981
60.02 1982
64.01 1983
58.30 1984
33.83 1985
52.90 1986
44.80 1987
47.75 1988
40.66 1989
55.47 1990
48.44 1991
47.60 1992
53.73 1993
52.37 1994
65.56 1995

In this data, the data range is from approximately 30 inches per year for a minimum to 66 inches a year for a maximum. Over this interval we have 25 separate measurements that define the distribution. We want to use this distribution to get a feel for the rainfall characteristics in eugene. This brings up the important points of
Resolution vs. Frequency of Events
For instance, here is the distribution if you make the resolution so fine that each individual year is plotted:

This representation of the data is not useful. There are many more bins than there are data points and its very difficult to see what the most probable value or range of values really is. In general, the number of bins that you divide your data into should be at least twice as few as the number of data points. For this case we have 25 data points, and therefore we should not have any more 12 bins. The total range of the data is from 30 to 70 inches so if we choose 10 bins then each bin width is 4 inches. In general the size of a bin should correspond to some level of significance that you think exists in the data.
For instance, differences of 0.1 inches in mean annual rainfall from one year to the next are probably not significant. Differences of 1 inches might be or 2 inches, etc. You can use that to determine the size of the bin but you also have to generate enough data per bin to make the results meaningful.
Here is the data with a bin size of 1 inch

Again, the number of data points per bin is too small for this representation of the data to be meaningful.
Increasing the bin size to 4 inches (thereby decreasing the number of bins to 10 10x4 = 40 inches which is the range of the data) yields:

This is a meaningful distribution as it now shows the probable range of values directly to be 45--55 inches.
Also be aware of the Ross Perot type of Distribution in which the Vertical Axis is exaggerated in order to give a false impression of the data.
For instance, the following represents my left over average beer money at the end of each month for 1995. This is well known to be a leading economic indicator:

Jan 50
Feb 57
Mar 62
Apr 48
May 73
Jun 51
Jul 58
Aug 49
Sep 71
Oct 61
Nov 46
Dec 38

A normal representation of the data in real dollars looks like this:

An exaggerated vertical scale, such that the difference in units on the scale is not very significant, makes the data look like I had a catastrophe in my beer money indicating the economy is going to hell. Note the Y-axis does not go to Zero!

Return to Main Lecture