Each line of the data set has an identification number and provides information on 11 other variables for a single hospital. The data presented here are for the 1975-76 study period. The 12 variables are:
number Variable name Description
1 Identification number 1-113
2 Length of stay Average length of all patients in
Hospital (in days)
3 Age Average age of patients (in years)
4 Infection risk Average estimated probability of
Acquiring infection in hospital
5 Routine culturing ratio Ratio of number of cultures performed
To number of patients without signs or symptoms of hospital-acquired infection, times 100.
6 Routine chest X-ray ratio Ratio of number of X-rays performed to number of patients without signs or symptoms of pneumonia, times 100
7 Number of beds Average number of beds in hospital during study period
8 Medical school affiliation 1=Yes, 2=No
9 Region Geographic region, where: 1=NE, 2=NC, 3=S, 4=W
10 Average daily census Average number of patients in hospital per day during study period
11 Number of nurses Average number of full-time equivalent registered and licensed practical nurses during study period (number full time plus one half the number part time)
12 Available facilities and services Percent of 35 potential facilities and services that are provided by the hospital
The data is in a file called "senic.txt". Please use SAS to solve all problems. All codes must be attached.
1. Two models have been proposed for predicting the average length of patient stay in a hospital (Y). Model I utilizes as predictor variables age, infection risk, and available facilities and services. Model II uses as predictor variables number of beds, infection risk, and available facilities and services.
a. For each of the two proposed models, fit first-order regression model with three predictor variables.
b. Obtain the correlation matrix for model I and model 2. Interpret these results.
c. Calculate for each model. Is the model clearly preferable in terms of this measure?
d. For each model, obtain the residuals. In terms of the residuals, is one model clearly more appropriate than the other?
2. For each geographic region regress infection risk against the predictor variables age, routine culturing ratio, average daily census, and available facilities and services.
a. Use first-order regression model with four predictor variables. State the estimated regression functions.
b. Are the estimated regression functions similar for the four regressions? Discuss.
c. Calculate MSE and for each region. Are these measures similar for the four regions? Discuss
d. Obtain the residuals for each fitted model. State your finding.
3. For predicting the average length of patient stay in a hospital (Y), it has been decided to include age and infection risk as predictor variables. Assume first order model.
a. For each of the following variables, calculate the coefficient of partial determination given that age and infection risk are included in the model: routine culturing ratio, average daily census, and available facilities and services.
b. Using the F test statistic, test whether or not the variable determined to be best in part (a) is helpful in the regression model when age and infection risk are included in the model; use . Hint: use partial F-test and how many tests do you need for this problem?
4. Fit a second-ordered model for Y with available facilities and services. Test for the quadratic term from the model. State the null and alternative hypothesis, perform the appropriate test and draw conclusions.
5. Length of stay is to be predicted, and the pool of potential predictor variables includes all variables in the data set. It is believed that a model with as the response variable and the predictor variables in the first-order terms will be appropriate. Using the variables above and appropriate indicator for region variable, perform the following analysis.
a. Obtain the correlation matrix of the X variables. Is there evidence of strong linear pair wise association among the predictor variables here?
b. Find the best subset according to the R-square-adjusted criterion.
c. Obtain three best subsets according to the CP criterion.
d. Obtain three best subsets according to BIC scores.
e. Using stepwise model selection, what is the best overall model?
f. Is there any indication of multicollinearity problem in the best over model?
The solution provides step by step method for the calculation of multiple regression analysis in SAS . Formula for the calculation and Interpretations of the results are also included.