# OLS and OLS (Robust Regression Analysis) with STATA

Heteroskedasticity Diagnostics and Corrections

For this exercise, use newschools9810.dta. Please download the do-file for this assignment, "Class 8 Exercise 2014.do," from the course website, and perform all the required statistical operations as directed below.

Please submit:

a) A printed Stata log file documenting that you've completed each of the operations directed below as well as results. As always, be sure to run only your corrected Do file so your printed log file does not contain errors.

b) A printed write-up, which includes typed answers to the questions below.

1. Estimate an OLS regression with total expenditures per pupil as the dependent variable. Use the following independent variables:

• percent black

• percent free lunch

• total enrollment

• total enrollment squared

• percent Hispanic

• percent Asian

• percent full-time special education students

• percent immigrant

• percent female

• math z score

• year dummies

• a middle school dummy

• borough dummies

2. Graph the residuals versus total enrollment.

a. Copy and paste the graph into your typed write-up for submission.

b. What does the graph suggest about the possibility of heteroskedasticity?

3. Re-estimate the equation in 1, and perform a White test using estat imtest with the white option.

a. Copy and paste the results of the White test into your typed write-up for submission.

b. Is there heteroskedasticity? How do you know?

4. Re-estimate the equation in 1 using robust standard errors. How do the standard errors estimated with the model in this question differ from the standard errors estimated with the model in question 1? Is this what you expected? Why?

© BrainMass Inc. brainmass.com October 25, 2018, 9:32 am ad1c9bdddfhttps://brainmass.com/statistics/regression-analysis/ols-ols-robust-regression-analysis-stata-577478

#### Solution Summary

This solution is comprised of a detailed explanation Regression Analysis in STATA. This solution mainly discussed the different regression models with actual variable, dummy variables and other transformation of variables. This solution explained the questions with interpretation of Regression Output in different models especially discussing the assumptions of regression model in terms of Heteroskedasticity and associated tests.

Fixed model - Robust Regression Interpretation of results

Question One:

This problem employs a dataset on labor markets in 23 OECD countries for the years 1980 to 1998.

The variables used in the analysis (followed by descriptive statistics) are:

1. Productivity index [prod] = An index measuring country i's economic output (GDP) per hour worked in year t, normalized such that each country's index = 100 in 1995.

2. Unemployment rate [unr] = The total number of unemployed workers in country i and year t divided by the total number of labor force participants in that country and year, multiplied by 100.

3. Union density [ud] = The ratio of total reported union members (minus retired and unemployed members) in country i and year t to the total number of employees earning wages or salaries in that country and year, multiplied by 100.

4. Public sector growth [gempl] = The one-year percentage growth (from year t-1 to year t) in public sector employment in country i (measured as a proportion, 0 to 1).

5. USD exchange rate growth [usd]: The one-year percentage growth (from year t-1 to year t) in the value of country i's currency relative to the US dollar (measured as a proportion, 0 to 1).

6. Labor force (1K) [lf]: The total number of labor force participants in country i and year t, in thousands.

(See attached)

1. While there are no missing years in the dataset, there are missing observations for some of the variables.

a. If there were no missing values for any variables, how many observations (country-years) would there be for every variable in the summary table of descriptive statistics presented above?

b. Given the number of observations for each variable shown in the summary table, knowing there are no missing years in the data, and knowing that Stata regression drops a case when there are any missing values for any variable for a given country in a given year, what is the maximum number of countries that can be used in a regression analysis (assuming nothing is done to replace missing values)?

c. Given that there are 19 years of data in the regression analyses presented in Table 1, how many countries were used in the analyses?

d. Could we estimate the effect of usd on prod with FE if the value of every country's currency (relative to the US dollar) remained the same over the sample time period? Why or why not? Please answer in 2-3 sentences.

2. Write the general equations for the specifications in columns (1) and (2). Use lowercase b for the regression coefficients and, where appropriate, a to indicate fixed effects and/or T to indicate time effects. Use the variable names presented in brackets [ ] on the prior page, and use subscripts as appropriate. You do not have to include an error term.

Column 1:

Column 2:

3. Using the models estimated without time effects, interpret:

A. The effect of a 5-percentage-point increase in union density.

B. The effect of a 10-percent increase in the growth of the public sector.

4. Compare the specifications with time effects to those without time effects. What do the differences in the statistically significant coefficients imply about the time effects? Note that the time effects are jointly statistically significant with a p-value of 0.00. Please answer in 2-3 sentences.

Question Two:

Table 2 presents results of a study of the effect of differences in the fraction of new immigrants on crime rates in U.S. metropolitan area (MA's) over nine years.

^

a. Write the general equation for the regression in column 2. Use β for the regression coefficients (not the actual numbers in column 2), use the variable names presented in the table in brackets [ ], and use subscripts as appropriate. If appropriate, use MA and T as fixed effects.

b. Using the results in column 2:

i. Ignoring significance, what is the effect of a twenty (20 percentage point or .2 fraction) increase in new immigrants on the overall crime rate?

ii. Form a 95% confidence interval around the effect you've just calculated.

c. Two parts:

i. In what two ways does the coefficient on Fraction of new immigrants [IMM] differ between columns 1 and 2?

ii. Why does it differ and what does this indicate about the estimates in columns 1 and 2?

d. Using the results in column 2: For an MA with 10% (.10 fraction) Hispanic population, what is the effect of a one percentage point (.01 fraction) increase in the percent female on the metropolitan crime rate?

e. Using the results in column 2, what is the effect of a one percent increase in population of an MA on the overall crime rate of the MA?

f. Two parts:

i. What hypotheses do the p values for F's in column 2 at the bottom of the table test? (Hint: There are two different p values (F's) and thus two different hypotheses.)

ii. What do you conclude from the tests?

Table 2

Regression coefficients: Log metropolitan area (MA) overall crime rate (CR) on various variables

(See attached)

Source: Calculations from Current Population Surveys (CPSs) and Uniform Crime Reports (UCR)

Notes: Robust Standard errors are parentheses and constant included but not shown.

~ p-value from an F-test

z: "Fraction" varies from 0 to 1 and differs in measurement from percent, which varies from 1 to 100.