Документ взят из кэша поисковой машины. Адрес оригинального документа : http://zebu.uoregon.edu/1998/es202/distribution.html
Дата изменения: Wed Jan 28 02:39:56 1998
Дата индексирования: Tue Oct 2 07:14:11 2012

A distribution represents the number of times that a variable has a particular value. This distribution can assume any form, but most can be well approximated by certain standard forms as we will see later. Distributions are also defined by maximum and minimum values. This is called the range of the data.


Let's return to the rainfall example for Eugene: The following is the annual rainfall data for the last 25 years:

In this data, the data range is from approximately 30 inches per year for a minimum to 66 inches a year for a maximum. Over this interval we have 25 separate measurements that define the distribution. We want to use this distribution to get a feel for the rainfall characteristics in eugene. This brings up the important points of

Resolution vs. Frequency of Events

For instance, here is the distribution if you make the resolution so fine that each individual year is plotted:

This representation of the data is not useful. There are many more bins than there are data points and its very difficult to see what the most probable value or range of values really is. In general, the number of bins that you divide your data into should be at least twice as few as the number of data points. For this case we have 25 data points, and therefore we should not have any more 12 bins. The total range of the data is from 30 to 70 inches so if we choose 10 bins then each bin width is 4 inches. In general the size of a bin should correspond to some level of significance that you think exists in the data.

For instance, differences of 0.1 inches in mean annual rainfall from one year to the next are probably not significant. Differences of 1 inches might be or 2 inches, etc. You can use that to determine the size of the bin but you also have to generate enough data per bin to make the results meaningful.

Here is the data with a bin size of 1 inch

Again, the number of data points per bin is too small for this representation of the data to be meaningful.

Increasing the bin size to 4 inches (thereby decreasing the number of bins to 10 10x4 = 40 inches which is the range of the data) yields:

This is a meaningful distribution as it now shows the probable range of values directly to be 45--55 inches.

Also be aware of the Ross Perot type of Distribution in which the Vertical Axis is exaggerated in order to give a false impression of the data.

For instance, the following represents my left over average beer money at the end of each month for 1995. This is well known to be a leading economic indicator:

A normal representation of the data in real dollars looks like this:

An exaggerated vertical scale, such that the difference in units on the scale is not very significant, makes the data look like I had a catastrophe in my beer money indicating the economy is going to hell. Note the Y-axis does not go to Zero!

Return to Main Lecture