# Regression Analysis and ANOVA

Kindly generate necessary charts using Genstat software. Please complete Question 2 & 3 in attached document.

See attached file for full problem description.

-----------------------------------------------------------------------------

Question 2

An experiment was carried out to investigate the effect of three factors on the survival of the bacterium Salmonella typhimurium. Three levels of sorbic acid, three pH(acidity) levels and six levels of water activity were used. The experiment was run once at each possible combination of factor levels. The response variable was the logarithm of the density of bacteria (per milli liter) seven days after treatment started. (Previous experience with data like these indicated that the log transformation would be appropriate.)

(i) Suppose you are going to perform the usual significance tests in analyzing these data. Which main effects and/or interactions would you have to leave out of the model initially? Give your reasons for leaving out any interactions or main effects.

(ii) Using the model you described in part (a)(i), analyze the data using the GENSTAT analysis of variance commands. Your analysis should include appropriate plots of residuals. Write a brief report on the results of your analysis. The report should make it clear how the mean response does (or does not) depend on the levels of the three factors concerned. You may well wish to include appropriate plots and/or table of means. Your report should also make it clear whether there is anything about the data or about the assumptions you have made in the analysis that throws doubt on your conclusions.

Does the analysis you have carried out throw any light on the appropriateness of your choice in part (a)(i) of interactions or main effects to omit?

(iii) The need for the omission of certain main effects and/or interactions, which you explained the need for in part (a)(i) and carried out in part (a)(ii), can be avoided by not treating all the explanatory variables as factors. Without actually doing any calculations, briefly explain how this could be done.

Question 3

(a) As part of a study on the nutritional quality of oats, six varieties of oats, labeled 1 to 6, were to be compared. A standard amount of each variety was grown on each plot, and the plots were laid out in a field in six blocks, each containing six plots. Each variety was randomly allocated to one plot within each block. The response measured was the percentage of protein in the oats produced. The data from this experiment are given in the file oatprote.gsh. The response variable is labeled protein and the variety and block numbers are given in the factors variety and block, respectively.

(i) Produce a scatter plot of the response against variety number, with block as the grouping factor. Do there appear to be any difference between varieties? Does the block structure appear to have any effect on the response?

(ii) Your plot in part (a)(i) should not give any immediate cause for concern about the assumptions underlying the appropriate analysis of variance model. Therefore, produce the appropriate ANOVA table and use it to answer the question 'Is there any difference in percentage of protein between the different varieties of oats?

(iii) To check the appropriateness of the model GENSTAT is using, produce the usual set of residual plots. Are any of the assumptions of the model in doubt?

(iv) Calculate a point estimate of the difference in mean percentage of protein between oats of variety 3 and oats of variety 5. Calculate a 95% confidence interval for this difference. Is it plausible that these two varieties of oats do not, in fact, differ in average protein content?

(b) An experiment was conducted by students at the Ohio state University in 1993 to explore the nature of the relationship between a person's heart rate and the frequency at which that person stepped up and down on steps of various heights. The response variable, hrinc, was the ratio of the person's measured heart rate after the exercise to that measured before. There were two different step heights, height, coded 0 (for 5.75 inches) and 1 (for 11.5 inches); and there were three rates of stepping, freq, coded 0 (14 steps/min), 1 (21 steps/min) and 2 (28 steps/min).

Each subject performed the activity for three minutes. One experimenter counted the subject's pulse for 20 seconds before and after each trial. Another experimenter kept track of the time spent stepping. Each subject was always measured by the same pair of experimenters. Each subject/experimenters combination was treated as a block (called experimt). Subjects always rested between trails. The data are in stepping.gsh.

The six possible combinations of height and freq are coded

1 for {height = 0, freq = 0}, 2 for {height = 0, freq = 1}

3 for {height = 0, freq = 2}, 4 for {height = 1, freq = 0}

5 for {height = 1, freq = 1}, 6 for {height = 1, freq = 2}

and these values are also given in stepping.gsh, in the factor treat, each subject/experimenters combination had time to perform five trials, and there were six such combinations in all. The experimental layout is given as follows.

(i) What sort of experimental design is this? Explain your answer.

(ii) This experiment can be analyzed using the General Analysis of variance option in the Design field of the Analysis of Variance dialogue box, provided that the appropriate entries are made in the Treatment Structure and Block Structure fields. With respect to the former, do not use the single treatment factor treat, but instead use the factorial structure of its components height and freq in the usual way. Perform this analysis. Produce appropriate residual plots. Are the assumptions of the model justified? What model is suggested as being appropriate to explain these data?

(iii) Under the model you suggested at the end of part (b)(ii), what is a point estimate of the mean heart rate increase factor (hrinc) at the higher step height ad the fastest rate of steeping?

#### Solution Preview

I hope my answers to question 2 and 3 help you do them in genstat. See the text below and the attached file for the work to these questions.

For the questions you did, my comments are in italics. Besides my comments in the attached files, make sure you check your spelling! I think you actually understand this quite well -- there were only a few instances for which I disagreed with your analyses. Just keep doing what you've been doing.

As for your question ... Could you kindly let me know the tests use to do the questions? So that i can try it out using Genstat to generate the results. For e.g Like Anova..what test did you use, what to input in the variate, block etc. ... I think you'll find the answers in my explanations. Basically, if it's a grouping variable (like "block" or "experimt"), put it in block. If it is a variable that you're testing, put it in "variate."

-------------------------------------------------------------------

Question 2

An experiment was carried out to investigate the effect of three factors on the survival of the bacterium Salmonella typhimurium. Three levels of sorbic acid, three pH(acidity) levels and six levels of water activity were used. The experiment was run once at each possible combination of factor levels. The response variable was the logarithm of the density of bacteria (per milli liter) seven days after treatment started. (Previous experience with data like these indicated that the log transformation would be appropriate.)

[Source: Mead,R.(1988) The Design of Experiments, Cambridge, Cambridge University press.]

The data are stored in data file salmonel.gsh, with the response variable labeled as response and the three treatment factors as sorbic, pH and activity respectively. Assume that you have done some preliminary analysis and plotting of these data, and you have concluded that it appears to be acceptable to analyze them using analysis of variance.

(i) Suppose you are going to perform the usual significance tests in analyzing these data. Which main effects and/or interactions would you have to leave out of the model initially? Give your reasons for leaving out any interactions or main effects.

We are testing if 3 levels of sorbic acid, 3 pH levels, and/or 6 levels of water activity affect the density of bacteria.

I started by looking at plots of estimated marginal means (the mean averaged over the different levels of the independent variable). For each of the plots, one of the independent variables is along the x-axis, and another independent variable is along the y-axis. For any variable, if the lines are horizontal, the variable on the x-axis most likely does not affect response, and if the slopes of the lines in the plot differ, there is most likely an interaction between the variables.

First 2 plots: Notice that the variables pH and activity do not interact (for each level of activity, the response level is the same regardless of pH). There might be a slight interaction between pH and sorbic acid (response increases - slightly - at a pH of 5.5 at all sorbic acid levels except for the highest one).

Second 2 plots: The first plot confirms that there is probably an interaction between sorbic acid and pH level. There might also be one between sorbic acid and activity.

Third 2 plots: It also looks as if there is an interaction between activity and pH.

The main effect we should definitely use is activity (it looks as if increased activity leads to increased response). We should NOT include pH as a main effect factor, because the plots are flat across different pH levels. I am undecided about whether or not to include sorbic acid, but I think we should, since it looks as if ...

#### Solution Summary

The solution consists of complete answers and explanations to questions 2 and 3, as well as a review of student-submitted answers to questions 1 and 4. The analyses were done using SPSS.