Документ взят из кэша поисковой машины. Адрес
оригинального документа
: http://zebu.uoregon.edu/1998/es202/distribution.html
Дата изменения: Wed Jan 28 02:39:56 1998 Дата индексирования: Tue Oct 2 07:14:11 2012 Кодировка: |

Let's return to the rainfall example for Eugene: The following is the annual rainfall data for the last 25 years:

- 50.09 1970
- 60.67 1971
- 57.56 1972
- 55.30 1973
- 56.78 1974
- 51.77 1975
- 34.78 1976
- 46.91 1977
- 39.50 1978
- 51.28 1979
- 51.34 1980
- 55.90 1981
- 60.02 1982
- 64.01 1983
- 58.30 1984
- 33.83 1985
- 52.90 1986
- 44.80 1987
- 47.75 1988
- 40.66 1989
- 55.47 1990
- 48.44 1991
- 47.60 1992
- 53.73 1993
- 52.37 1994
- 65.56 1995

Resolution vs. Frequency of Events

For instance, here is the distribution if you make the resolution so fine that each individual year is plotted:

This representation of the data is not useful. There are many more bins than there are data points and its very difficult to see what the most probable value or range of values really is. In general, the number of bins that you divide your data into should be at least twice as few as the number of data points. For this case we have 25 data points, and therefore we should not have any more 12 bins. The total range of the data is from 30 to 70 inches so if we choose 10 bins then each bin width is 4 inches. In general the size of a bin should correspond to some level of significance that you think exists in the data.

For instance, differences of 0.1 inches in mean annual rainfall from one year to the next are probably not significant. Differences of 1 inches might be or 2 inches, etc. You can use that to determine the size of the bin but you also have to generate enough data per bin to make the results meaningful.

Here is the data with a bin size of 1 inch

Again, the number of data points per bin is too small for this representation of the data to be meaningful.

Increasing the bin size to 4 inches (thereby decreasing the number of bins to 10 10x4 = 40 inches which is the range of the data) yields:

This is a meaningful distribution as it now shows the probable range of values directly to be 45--55 inches.

Also be aware of the Ross Perot type of Distribution in which the Vertical Axis is exaggerated in order to give a false impression of the data.

For instance, the following represents my left over average beer money at the end of each month for 1995. This is well known to be a leading economic indicator:

- Jan 50
- Feb 57
- Mar 62
- Apr 48
- May 73
- Jun 51
- Jul 58
- Aug 49
- Sep 71
- Oct 61
- Nov 46
- Dec 38

A normal representation of the data in real dollars looks like this:

An exaggerated vertical scale, such that the difference in units on the scale is not very significant, makes the data look like I had a catastrophe in my beer money indicating the economy is going to hell. Note the Y-axis does not go to Zero!