# Survival, correlation, and regression

1. Variables x and y each have standard deviations of 20. Their correlation is 0.6. The best fit line passes through the Y axis at Y = 40.

Write the regression line.

If a subject is 10 on x, what do you predict for Y?

2. You are looking through a book of new car ratings, and you decide to see how car weight influences EPA mileage. You randomly pick 10 cars and get the following data:

Wt (1000 lb) 3.3 3.4 2.3 2.2 2.9 2.6 2.4 3.0 2.0 4.2

EPA Highway mileage 25 30 33 36 28 31 37 28 35 26

You plug the numbers into your calculator, and get the regression equation as follows:

EPA mileage = -5.2 (Wt) + 45.6

Plot the numbers and the regression line. Provide clear and appropriate labels.

3. With respect to a scatterplot, how would a correlation of 1.0 differ from a correlation of 0.9?

4. What does it tell you if the Pearson correlation coefficient between variables x and y is 1.0?

5. You come across a data set in which the Spearman correlation between two variables is 1.0. A friend asks you whether that means that the data, if plotted, would fall on a straight line. You answer:

(a) Yes, since the interpretation is the same as for a Pearson correlation.

(b) No, for a Spearman correlation, 2.0 is perfect association. 1.0 is only modest.

(c) No, but the means are necessarily the same.

(d) No, but the subjects should be in the same order on the two variables.

(e) No, the Spearman correlation is looking for non-linear associations.

6. Which of the following would most likely exhibit a correlation of 1.0? (What would you expect the others to be -- positive, negative, 1.0, -1.0, or 0?)

(a) Fahrenheit temperatures with Celsius temperatures.

(b) Age in years with remaining time till death.

(c) Height in inches with time to take an exam.

(d) Height in inches with weight in kilograms.

(e) Systolic blood pressure with diastolic blood pressure.

7. A researcher studying multiple sclerosis patients uses an imaging technique to measure the cross-sectional area of the spinal cord at the C2 level. This area is found to have a correlation of -0.75 with duration of the illness, p < 0.001. The negative correlation, - 0.75, suggests:

(a) There is no association between area and duration.

(b) People who have had the disease longer tend to have larger cord areas.

(c) People who have had the disease longer tend to have smaller cord areas.

(d) Both long and short durations have small areas; the intermediate durations have the largest areas.

(e) Both long and short durations have large areas; the intermediate durations have the smallest areas.

8. The p value of < 0.001 in this data and the correlation provide good statistical evidence that:

(a) The population correlation is < -0.75.

(b) The population correlation is less than 0.

(c) The population correlation is within 0.001 of -0.75.

(d) The population correlation is > 0.75.

(e) No conclusion can be drawn because the p value is too small to be statistically significant.

9. A researcher follows 100 patients with a rare cancer to determine their prognosis (how long they live). Some die during the study period, others are lost to follow-up while still alive, and still others are alive when the study period ends. It is apparent that patients who are doing well are more likely to be lost to follow-up, since the ill patients require ongoing care.

(a) Which observations in the study are "censored"?

(b) Would Kaplan-Meier curves be appropriate here? Why or why not?

10. 10 subjects enter a study. 5 are alive when the study ends. These subjects were followed 3, 3, 4, 4, and 5 years. The other 5 died during the study period at 1, 1, 2, 3, and 5 years. Construct a K-M survival curve on these data. Show your work.

LOGISTIC REGRESSION/SURVIVAL

1. List at least two advantages of using Logistic regression instead of implementing a chi-square test for independence or/and trend.

2. Consider 2619 people hospital discharge who survived after a first myocardial infarction. You want to find out whether smoking status is a risk factor for another heart attack after a first myocardial infarction. Explain when you would use:

(a) Logistic Regression

(b) Kaplan-Meier curve (That is what kind of information you would need to know to use either of the two methods)

3. The presence and absence of symptoms of coronary artery disease (SCAD) were assessed on walk-in patients at a clinic. Logistic regression was fitted to examine if age is associated with coronary artery disease. The estimate of age's coefficient, (that is, the slope or b), from the logistic regression analysis was 0.080 with p-value 0.008. The intercept or constant (a) was -3.6431.

(a) What would you conclude from the above information?

(b) What is the odds ratio for every 1 year of age?

(c) Estimate the odds that a 40 years of age person has SCAD.

(d)What's the odd ratio of getting SCAD for 65 years of age person compared to that of 40 years?

4. Catheter placement was performed on two groups using two techniques (surgical and percutaneous) on 119 kidney dialysis patients. The time to infection (if any) was documented. The p-value for the Log-Rank test between the two groups is 0.1117.

(a) Using a significance level 0.05, is there any difference on the time to infection between the two groups?

(b) What would be the effect on the K-M curve if the researcher incorrectly just discarded all the censored observations?

In a study of the long-term outcomes of a certain cancer, patients who are doing well are more likely to drop out of the study than those who experience relapses or symptoms. What will the effect of this differential retention be on the survival curve?

#### Solution Summary

A multi-answer solution discussing regression lines, correlations, and survival rates.