Deliverable 1: Descriptive Statistics
(10 points) Our main emphasis in the course is "inferential" statistics which means taking samples and drawing an inference or conclusions about the population.
Access raw data or a database from government, business, health, and similar official Web sites pertaining to your area of interest. Collect at least 30 pieces of numerical (quantitative) metric data (see p.15) but no more than an n of 50 (30-50 observations and only one theme). If you have a sample larger than 50 randomly select a subset so your n (sample size) is no more than 50. Explain where these data came from and why they are of interest to you.
Describe the population and the variable. From the data, plot a histogram, a stem-and-leaf diagram and an ogive (polygon). Also calculate the mean, median, mode, range, standard deviation, and quartiles of the data. Create a boxplot. Explain what this analysis tells you.
In a separate appendix (or spreadsheet), list all 30-50 observations labeled from 1 to 30 (up to 50 if n=50) so I can duplicate your work if necessary.
Since you will be doing a histogram you will need to select a sample that consists of numerical (quantitative) data not categorical (qualitative) data (see p.15). A bar graph is not the same as a histogram so in Excel, click bars (right click properties) /format data series/options/gap width (should equal to zero) and this should get rid of the gaps (histograms do not contain gaps...only bar graphs are used for categorical data. Your textbook describes other methods to provide an acceptable histogram and other graphics. If you have a category/class in the data with zero observations then try to get rid of the gap by extending the width of the class interval or at the very least explain it in your comments. Histograms and other descriptive statistics should not add to the confusion or generate more questions but should answer and explain the data. Look at your descriptive statistics and ask if there are any questions that would be asked and can you answer them by modifying the descriptive statistics or adding a comment or label. See also p. 81, exhibits 2.1 and 2.2.
I went to fedstats.gov, which has a huge range of statistical tables. Then I found a page with statistics on trade (http://ita.doc.gov/td/industry/otea/usfth/tabcon.html) and found a table with total US exports by product (http://ita.doc.gov/td/industry/otea/usfth/aggregate/ H03T41.html). If you only look at the data from the most recent year (2003) which I saved as an Excel file, then this is an example of cross-sectional data rather than time-series data as specified by your professor.
There are over 450 different categories of exported goods, but you only need 50, so we're going to select 50 at random. In the Excel file, the categories are in rows 3 to 456, so we're going to choose 50 random numbers between 3 and 456 and use only those categories. I found a random number generator (http://graphpad.com/quickcalcs/randomN1.cfm) and did this. Here are the numbers that were generated:
Each value was randomly selected, with an equal chance of choosing any integer between 3 and 456.
These are the rows that we're going to select to do the analyses with. These are in the second sheet in the Excel file.
The data we're going to be working with is a random sample of a population of all US exports in 1993. The variable we're interested in is the value of those exports in millions of dollars.
Descriptive Statistics (mean, median, mode, range, standard deviation, quartiles)
All of these calculations can be found in the third Excel sheet.
Mean: The mean is the average of all the numbers. You calculate it by adding all the numbers together and dividing by the number of observations:
127,519/50 = 2550.38
The average value of an export is 2550.38 million dollars ($2,550,380,000).
Median: The median is the middle number. You find it by sorting all the numbers from smallest to largest, then taking the number in the middle of the list. When there is an even number of observations (like here), there is no "middle" number. In that case, you take the average of the two middle numbers. You can do this by hand if you want, but 50 numbers is a long list to sort and go ...
A sample of numerical data (values of US exports in 2003) were used in an extensive descriptive analysis.
The mean, median, mode, range, standard deviation, and quartiles (five-number summary) were calculated. Definitions of each of those statistics and explanations of how to calculate them are included in the solution.
Then, graphs (histogram, stem-and-leaf, ogive, boxplot) describing the data were created. Definitions of each of those kinds of graphs and explanations of how to make them in Excel are included in the solution.
Finally, there is a discussion of what can be learned from the descriptive statistics and the graphs.