Regression analysis: What it means and how to interpret the outcome?

Let’s discuss a simple situation in which we have two variables to understand regression analysis and what it means, and how to interpret the outcomes of regression analysis. In other blog posts, we discuss how to do a simple regression.

The term regression was first used as a statistical concept in 1877 by Sir Francis Galton. He used the word ‘regression’ as the name of the general process, wherein multiple data points regress towards the (regression) line (to minimize the error) that defines the relationship between the variables of interest (generally one or more predictor or independent variables and the dependent variable).

Regression Analysis

What does it means?

When you are exploring the relationship between two variables, we generally tend to measure the correlation between them. These correlations are very helpful, but we can take this relationship further by conducting a regression analysis that can give us the power to predict the impact of one variable on the other.

For example, we might want to understand how overcrowding per census tract predicts the crime rate per tract. City planners can use this information to control overcrowding and crime in the city, which can lead to better use of resources.

Regression analysis is used to predict the effect of the independent variable on the dependent variable in order to make a causal inference. Remember, causal inference requires correlation between the two variables, temporal ordering, and ruling out plausible rival explanations. If we use one variable to predict our dependent variable, we call it a simple regression. Things in real life are rarely this simple. There are usually more than one independent variables that influence the dependent variable. For example, crime is not just a product of overcrowding. Factors like availability of jobs or unemployment rate, violence in the neighborhood, education level of people in the neighborhood are also variables of interest that could predict crime. When we use more than one independent variable to predict our dependent variable, we call this multiple regression.

One predictor or independent variable – Simple Regression

Multiple predictors or independent variables – Multiple Regression

Depending on how our IV (independent variable/s) affects DV (dependent variable), we can see a linear relationship between the two variables, which can be represented by a straight line, or non-linear relation that can be represented by a curved line. For example, there is a linear relationship between distance traveled and time. That is, this relationship can be represented by a straight line. The longer the distance to be traveled, the longer it takes. On the other hand, the relationship between age and our ability to learn is non-linear. As infants and toddlers, we are slower to learn; our ability to learn grows at a much faster rate during our school and adolescent years, and then it eventually slows down again.

Regression analysis uses current and past data to predict the dependent variable from one predictor variable (simple regression) or multiple variables (multiple regression) with minimum error.

Regression analysis is a great tool as it allows us to go beyond the current data collected. We use the following equation to represent regression analysis.

Predicted value of Y (DV) = Model + Error (E)

Above equation simply means that our dependent variable (Y) can be predicted by a model that best fits our data plus some error. We generally use a linear model in regression. That is, we summarize our data set with a straight line. The model in the above equation gets replaced by our independent variables or the predictors.

The regression line is the line that best describes the past and present data, and then provides an estimate of the DV for a given value of IV.

The straight line can be defined by the intercept (b₀) and by the rate of change or slope (b₁) between X and Y. Once we have the intercept and the slope we can plugin any value of X (IV) and can get a corresponding value for Y (DV). The straight line that summarizes the data is called the regression line. Since regression analysis is used for prediction, we want a line that best fits the data and minimizes the error of prediction. That is, it should go through or be close to as many data points as possible so that it results in the least amount of difference between the observed data point and the line. The problem however is that we can find multiple lines fitting the data that results in a minimum error when these errors are simply added up (positive and negative errors).

Additionally, by simply adding up the errors, the negative errors (underestimation of the predicted values) and positive errors (overestimation of the predicted values) might cancel each other and be closer to zero even when large errors are present. As you can imagine, this is not that useful. It might also be difficult for different people to take the same data and reach different conclusions. This is avoided by adding the squares of the errors for each point. This produces one and only one unique line that minimizes the error.

In effect, the regression line is the line that passes through the data points in such a way that it reduces the total error (distance between the actual point and the line). It is called a regression line as all the points regress towards the line. The regression line is also referred to as a model. For multiple regression, the model looks like below.

Regressions Analysis

Interpreting the outcome

In the absence of any predictive regression model, our best estimate for any value of X would be the average value of Y. Since this will result in the same value of Y for every possible value of X, as such this is the model of no relationship between X and Y. We use the average value of Y as a base to compare the improvement in our predictive capacity achieved through our regression model.

We test the validity of the model using an F-test. F-test is the statistic that tells us whether the model (our independent variable/s) is any better at predicting the outcome variable than having no model at all. A significant (p<0.05) F-test indicates that our independent variable (or variables in multiple regression) is better at predicting the dependent variable. That is, your regression line fits the data better than no independent variable/s. We can also calculate R², which shows the proportion of improvement due to the model. R² is interpreted as the proportion of variance in the Y explained by the model. In simple regression, taking a square root of this value gives us the Pearson r correlation coefficient.

It also ‘jointly’ tests the coefficients of the independent variables (b₀ and b₁). The significant t-test predicts whether the slope or regression coefficient/s and the intercept (b₀ and b₁, b₂, b₂,…, b_n) in the model are different from zero. That is, whether individual variables are important predictors or not. Sometimes, you might have a significant F-test, but individual independent variables are not significant. This means that independently, the predictors or independent variables are not effective, but jointly they can do a better job of predicting the dependent variable.

Bibliography

Bobko, P. (2001). Correlation and regression: Applications for industrial organizational psychology and management (2nd ed.). Thousand Oaks, Calif.: Sage Publications.

Levin, R. I., Rubin, D. S., Siddiqui, M. H., & Rastogi, S. (2017). Statistics for management (8th ed.). Noida: Pearson India.

Salkind, N. J. (2010). Encyclopedia of research design. Thousand Oaks, Calif.: SAGE Publications.

Cite this Article (APA)

Trivedi, C. (2020, November 23). Regression analysis: What it means and how to interpret the outcome? ConceptsHacked. https://conceptshacked.com/regression-analysis/