Can you help with multi-variable regression?

XLSTAT support center

This tutorial will help you do a multiple linear regression in Excel set up and interpreted using the XLSTAT software. Linear regression is based on ordinary least squares (OLS).
Not sure if this is the modeling feature you're looking for? Further information can be found here.

Data for performing a multiple linear regression

You can download an Excel sheet with the data and the results by clicking the button below:
Data download
The data are from Lewis T. and Taylor L.R. (1967). Introduction to Experimental Ecology, New York: Academic Press, Inc. The data is about 237 children, described by their gender, age in inches (1 inch = 2.54 cm), and weight in English. Pound (1 pound = 0.45 kg).

Purpose of this tutorial

Using multiple linear regression, it should be determined how the weight of the children varies depending on height and age and whether a linear model makes sense. The linear regression method belongs to the large family of GLM (Generalized Linear Models), just like ANCOVA and ANOVA. This data set is also used in both tutorials on simple linear regression and on ANCOVA, with height, age and gender as explanatory variables.

Set up a multiple linear regression

After opening XLSTAT, choose the command XLSTAT / Data Modeling / Linear Regression.

After clicking the button, the corresponding linear regression dialog box appears. You can now select the data in the Excel sheet. There are several ways to select the data in the XLSTAT dialog windows. In the examined example, the data starts on the first line; it is therefore faster to use the column selection. Therefore, the selections appear in the form of columns in the dialog box below. The Variable Labels option is enabled because the first line of data contains the names of the variables. Select the "Dependent Variables" (or Model Variables) as Weight and Age. The quantitative explanatory variables are height and age.

In the tab output let's activate the option Type III SSto view the associated results.

The calculations begin as soon as the button OK is clicked.

Interpret the results of a multiple linear regression

If you have activated the option "Confirm selection" in the XLSTAT options, XLSTAT asks you to confirm the number of rows and columns of the selection. The first table shows the goodness-of-fit coefficients for the model. The R² (coefficient of determination) indicates the percentage of variability of the dependent variable, which is described by the explanatory variable. The closer R² is to 1, the better the match.

In this particular case, height and age account for 63% of the variability in weight. The rest of the variability is due to effects (other explanatory variables) that were not included in the analysis.

It is important to examine the results of the analysis of variance table (see below). These results allow us to decide whether the explanatory variables bring significant information (null hypothesis H0) into the model or not. In other words, this is a way of checking whether it makes sense to use the mean to describe the entire population, or whether the information brought in by the explanatory variable (s) is valuable.

Fisher's F test is used. Based on the fact that the probability that corresponds to the F value is less than 0.0001, the risk is less than 0.01% that the assumption of the null hypothesis (no influence of the explanatory variables) is wrong. Therefore, one can safely conclude that the three variables bring significant information.

The next table shows the Type III SS. These results determine whether or not a variable brings significant information to the model when all other variables are already present in the model.

The following table shows model details. This table is useful when predictions are needed or when you want to compare the coefficients of the model for a given population with those for another population (it can be used to compare the models for girls and boys). One can see that the 95% confidence interval of the parameter of the variable size is very narrow, with the p-value for the parameter age being almost 0. This indicates that the effect of the age variable is smaller than the effect of the size variable. The model equation is written below the table. It is found that for a given height, age has a positive effect on weight. As the age increases by a month, the weight increases by 0.2 pounds.

The following table and diagram correspond to the standardized regression coefficients (sometimes also referred to as beta coefficients). These allow the influence and significance of the various variables on the dependent variable to be compared.

The next table shows the residuals. This makes it possible to examine each standardized residual in more detail. According to the assumption of the linear regression model, these residuals should be normally distributed. This means that 95% of the residuals should be in the interval [-1.96, 1.96]. All values ​​outside the interval are potential outliers or suggest that the assumption of normality is wrong. We used XLSTAT's DataFlagger to highlight the residuals outside the interval [-1.96, 1.96].

So one can find 15 of the 237 residuals that lie outside the range [-1.96, 1.96], which is 6.3% instead of 5%. A more detailed analysis of the residuals can be found in the ANCOVA tutorial.

The diagram below allows the predictions and observed values ​​to be compared.

The residuals histogram allows you to quickly see the residuals that are outside the range [-2, 2].

Conclusion for this multiple linear regression

The bottom line is that height and age account for 63% of the variability in weight. A significant amount of information is not explained by the regression model used. In the ANCOVA tutorial, the gender variable is added to the model in order to improve the model's goodness of fit.