Regression model validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are in fact acceptable as descriptions of the data. The validation process can involve analysing the goodness of fit of the regression residuals is random, and checking whether the model’s predictive performance deteriorates substantially when applied to data that were not used in model estimation.
A high R2 does not guarantee that the model fits the data well. This is because Anscombe’s quartet shows a high R2 can occur in the presence of misspecification of the functional form of a relationship or in the presence of outliers that distort the true relationship. The problem with the R2 as a measure of model validity is that is can always be increased by adding more variables into the model, except in the unlikely event that the additional variables are exactly uncorrelated with the dependent variable in the data sample being used.
The residuals from a fitted model are the difference between the responses observed at each combination values of the explanatory variable and the corresponding prediction of the response computed using the regression function. If the model fit to the data were correct, the residuals would approximate the random errors that make the relationship between the explanatory variables and the response variable a statistical relationship.