# Statistics: Analyzing Statistics

Consider the file "advertising.xls" (attached) showing data for magazine titles, the cost of a full-color page advertisement (page), audience (subscribers), male percentage of subscribers, and household income. The objective of this project is to find out if there is any relationship among variables using regression analysis techniques. You are to write a report about your findings after analyzing the data set. You need to perform more in-depth analysis. For example, you may have to use such tools as confidence interval estimates and one or two-sample tests on the data to improve the quality of your report.

a) State your statistical objective for this data set.

b) Perform exploratory data analysis (Section 3.3), such as numerical measures or the box-and-whisker plot for this data set.

c) Construct scatter diagrams for pairs of variables. Describe the relationship that you may see. Do these appear to have some association (linear or non-linear)?

d) Does a linear model appear to hold for any pair of variables? You may want to run some testing to substantiate. Why or why not.

e) Apply the best-subsets approach to model building to see if there is any variable that shouldn't be used for this

model.

f) Consider the male percentage of subscribers as categorical data, for example, if it is more than 66%, input as "male magazine," between 66% and 33% as "gender free," and less than 33% as "female magazine." Then introduce dummy variables for these data. Will this give you a meaningful (better) output for this model since

some households use male names to subscribe any magazine? Can you introduce any other dummy variables to improve your analysis? A new dummy variable can be created within the data or external data.

g) Once you determine which variables are to be used, perform a multiple regression analysis, including co-linearity, on this subset of variables.

h) Summarize and comment on your results.

#### Solution Preview

a) State your statistical objective for this data set.

First, we want to examine if there is significant relationship between any pair of variables.

Second, we want to evaluate

b) Perform exploratory data analysis (Section 3.3), such as numerical measures or the box-and-whisker plot for this data set.

To understand these four variables, we perform descriptive statistics under data analysis in excel and here are the outputs:

Cost ($)/Page (color ad) Audience (x 1000)

Mean 87496.875 Mean 1177.729

Standard Error 6748.024319 Standard Error 168.4355

Median 75880 Median 673.85

Mode 162000 Mode #N/A

Standard Deviation 46751.68388 Standard Deviation 1166.955

Sample Variance 2185719946 Sample Variance 1361785

Kurtosis -0.407283613 Kurtosis 2.309768

Skewness 0.6934077 Skewness 1.618175

Range 180900 Range 5028

Minimum 17100 Minimum 164.5

Maximum 198000 Maximum 5192.5

Sum 4199850 Sum 56531

Count 48 Count 48

Male (%) Median Income

Mean 41.26667 Mean 48024.06

Standard Error 3.759262 Standard Error 1548.718

Median 45.35 Median 46251

Mode 68.8 Mode #N/A

Standard Deviation 26.04493 Standard Deviation 10729.84

Sample Variance 678.3384 Sample Variance 1.15E+08

Kurtosis -1.43734 Kurtosis 0.601742

Skewness 0.035482 Skewness -0.10276

Range 84.9 Range 56759

Minimum 3.6 Minimum 15734

Maximum 88.5 Maximum 72493

Sum 1980.8 Sum 2305155

Count 48 Count 48

From these outputs, we could see that based on the values of skewness, the distributions for cost, audience and male are positively skewed and this distribution for median income is negatively skewed.

Further, we could see that for median income and audience, there is no mode present.

For audience dataset, Q1= 365.9,Q3=1667.1, IQR is equal to the difference between Q3 and Q1 and therefore is around 1,301.3. Therefore, Q3+1.5IQR= 3618.925. From the data set, we could see that there are two outliers (4091.7 and 5192.5).

c) Construct scatter diagrams for pairs of variables. Describe the relationship that you may see. Do these appear to have some association (linear or non-linear)?

From this scatter plot, we could see that there is a positive linear relationship between cost and audience.

From this scatter plot, the relationship between male and cost is nonlinear.

From this scatter plot, the relationship between cost and median income is nonlinear.

From this scatter plot, we could conclude that the relationship between audience and male is non-linear.

From this scatterplot, we could see that the relationship between audience and median income is linear (negative).

From this scatter plot, we could see that the relationship between male and median income is linear (positive).

d) Does a linear model appear to hold for any pair of variables? You may want to run some testing to substantiate

why or why not.

To reveal if there is any significant linear relationship between any pair of variables, we first run correlation matrix analysis for the data set and obtain the following output:

Cost ($)/Page (color ad) Audience (x 1000) Male (%) Median Income

Cost ($)/Page (color ad) 1

Audience (x 1000) 0.881209517 1

Male (%) -0.097135749 -0.196588991 1

Median Income -0.191974589 -0.386158169 0.568937143 1

To further substantiate if the linear relationship is significant, we need to ...

#### Solution Summary

The expert analyzes statistics for statistical objectives.