Threats to internal validity

Threats to internal validity are characteristics of research design that jeopardize our ability to interpret and make appropriate inferences. That is, it is an evaluation of the research design to provide causal inference. But to infer causality we must be able to establish three conditions of causality: covariation, temporal precedence, and rule of rival explanations. In that sense, we can think of threats to internal validity as plausible rival explanations.

Threats to internal validity can be divided into four categories. These are

Reverse causation
Time threats
Group threats
Mortality

Reverse causation

Correlational designs that measure all the variables at the same time risk the threat of reverse causation. That is, such designs cannot tell whether the cause came before the effect in time. Such designs fail to establish temporal precedence. The ex post facto design also suffers from this threat. For example, if we employ a cross-sectional survey design and measure depression and unemployment and find a strong correlation between these two variables, we might be inclined to make a causal link that depression causes unemployment. But this might not be the case as we can make an argument that long periods of unemployment can also lead to depression. We might call this reverse or reciprocal causation. This can often happen when we play into cultural stereotypes.

Here is another example. Say we find that principals of better schools are better educated and frequent change in school leadership has lower morale. We might be tempted to think the education level of the principal and stable leadership cause better schools. When observing a correlation between teachers’ behavior and students’ behavior we are more inclined to think that teacher’s behavior might influence students’ behavior (cultural stereotype). Whereas there might be reverse causation at play. Better schools might be able to keep well-educated principals to stay on while poorer schools might lead better-educated principals to leave for another job.

Another threat to correlational design is spurious causation. For example, we might observe that marital status and wine consumption are positively correlated. If we infer a causal relationship that consuming wine causes people to get married or that marriage drives individuals to consume more wine. Logically none of these make sense as a third variable like age might be explaining this spurious relationship in the first place. Age and marital status are positively correlated and so are age the wine consumption. That is, variation in age accounts for the high initial correlation between marital status and wine consumption. To test for a spurious relationship we can calculate a partial correlation coefficient between marital status and wine consumption after controlling for age. A non-significant result would prove the existence of a spurious relationship. It is also possible that a non-significant result might also indicate age as a mediating variable. Of course, it needs to be tested for mediation. Spurious relationships are the reasons we are often reminded that correlation is not equal to causation.

Time threats

These threats to internal validity consist of changes in the DV over time within the participants/subjects because of factors other than changes in the IV. There are four types of time threats.

History
Maturation
Test-reactivity
Instrumentation

History

The difference in the DV is due to an external event that has occurred because of the passage of time between the pretest and posttest measure and is not related to IV. Simply put, we have to ask ourselves, “Did some unanticipated event occur while the experiment was in progress, and did the event affect the dependent variable?” History is any event, other than a planned treatment, that occurs between the pretest and posttest measurement and influences the posttest measurement of the dependent variable. History only affects single-group pretest-posttest design.

Time 1					        Time 2
Pretest		External event		Posttest

For example, if we were evaluating the effectiveness of a vaccination campaign that informs about the benefits of vaccination. Our DV can be people’s attitude towards vaccination. Let’s say that the campaign starts in November 2019 and ends in May 2020. We take a pretest of our DV in November and then a posttest at the end of May. We find a big increase between the pretest and the posttest measure. If we conclude that the campaign was extremely effective we might be incorrect. We know the COVID-19 pandemic had its worst effect during that period. It is very much possible that the change in our DV is due to COVID-19 pandemic and not due to the campaign itself.

You might have guessed that the way to control history threats is to have a control group. This way the control group will be exposed to the same external event as the treatment group. Any change in the DV because of history will also be observed in the control group which makes it easier to isolate the effect of the campaign in the above example. However, we should make sure that the groups are either created through random assignment or are equivalent before the treatment is administered.

Longitudinal studies are more susceptible to this threat to internal validity. Seasonal variations can also contribute to this threat. Non-equivalent control group (NECG) designs are particularly vulnerable to this threat. We can also see selection by history interaction where the treatment group treats differently to the external event compared to the control group. Testing for pretest equivalence especially for NECG designs could be very handy. As explained earlier, NECG designs also benefit from matching or having multiple control groups, whenever possible.

Another obvious solution to controlling history threats to internal validity is to limit the time duration between pretest and posttest.

November 2019 	March 2020 - COVID 19		May 2020
Pretest			Treatment Group			    Posttest (ΔT + ΔH)
Pretest			Control Group				Posttest (ΔH)

Here ΔT represents the change in DV due to the treatment and ΔH represents the change in DV due to the history.

Maturation

Maturation is another time threat that results in a change in the DV because of participants’ normal development during the experiment. That is any physical or mental change such as growing older, becoming more tired, less interested, cognitive development, etc. that occurs over time in a participant and affects the participant’s performance on the dependent variable. Again this threat affects only single-group pretest-posttest designs, longitudinal designs, and NECG designs.

For example, if we are conducting a study of intellectual functioning following periods of anesthesia in elderly patients undergoing coronary bypass graft surgery, and we provide them with cognitive tests before surgery, 1-week after the surgery, 1 month after the surgery, 3 months after the surgery, and 12 months after the surgery. If we observe a decline in intellectual functioning we must ask whether the decline is due to the surgery or a relative cognitive decline in elderly patients due to age. The most serious threat to the internal validity of this study is maturation.

Similar to the history threat, maturation can also be managed by using a control group created through random assignment. When we have NECG design, we can use matching, have multiple control groups wherever possible, and check for pretest equivalence. Selection by maturation interaction can also be a threat here. That is, the treatment and control groups may differ on the rate at which the participants are maturing in intellectual functioning before the surgery. This is referred to as differential maturation. The treatment group helps us measure change due to the surgery (ΔT) and the change due to maturation (ΔM), while the control group would measure the change due to maturation (ΔM). Another strategy is to limit the time duration between the pretest and posttest measures. We should also avoid rapidly maturing samples whenever possible.

Reactivity

Reactivity is the third time threat to internal validity. It is the change in DV due to participants’ reaction to the pretest. It is pretest sensitization. Here the testing itself proves to be the source of change. We provide a test to establish a baseline measure and then provide the same test to demonstrate change. However, the participants may perform better simply because they are more familiar with the test, the pretest somehow influences them to change their behavior, or they are aware of their participation in a study.

For example, people commonly improve on standardized tests such as intelligence tests, SATs, or GREs. Suppose we are evaluating a study on creating awareness of the health benefits of regular exercise. We decide to develop a pretest measure of weekly time participants engage in regular exercise. Our pretest measure might make participants aware of what they ought to be doing for their health. Participants might decide to engage in more regular exercise even before the study starts.

This affects all designs that use a pretest measure. An obvious solution to this problem is to have a control group that will measure this effect. We can also go for Solomon four group design to eliminate this threat. Another possible solution is to administer a pretest in a separate session so that the participants do not perceive it to be related to the main experiment (disguise the pretest). We can also use different but equivalent tests. Finally, we can avoid the pretest or can design an unobtrusive pretest. Pretesting is generally done so that the pretest measure can be used as a covariate. It increases the power of statistical function but there is a trade-off in research design between pretest sensitization and power.

Summary

History, maturation, and reactivity threats only affect single-group pretest-posttest designs (correlations, longitudinal, NECG designs).
A control group can help isolate this effect as long as these groups are equivalent.
Multiple control groups can be employed when it is difficult to find one equivalent control group.
Shorten the time duration between pretest and posttest measures.

Instrumentation

The change in DV is because of the change in the way the measure is defined or the data is collected. Put simply, any change that occurs in the way the dependent variable is measured results in thereat of instrumentation. That is, there is a change in the nature of measurement rather than the change in participants. It is called instrumentation because it relates to a change in the instrument. It is also referred to as measurement decay. It affects longitudinal design. If one measure is used in the pretest and another in the posttest, instrumentation can be an issue.

Instrumentation can also be a problem when data collectors are employed as observers, scorers, raters, or recorders. Individual characteristics such as age, gender, language spoken, experience, etc. can introduce bias in the observation. Long-drawn data collection processes can eventually lead to fatigue, thereby leading to scoring differences. Standardization of the data collection processes can help to a certain extent.

Obesity has been a lingering problem in the US. The estimated annual cost of obesity was nearly $173 billion in 2019. Obesity has been measured differently over the years. First, we used to measure weight and height as a measure of obesity, then it was BMI index, and now it is the waist circumference and waist-to-height ratio. It is important to note that there is no actual change in the DV value, only how we measure it changes here. So when comparing the obesity rate across the years, the difference might be due to instrumentation.

The solution again is to have a control group to capture this effect. However, in case of bias when data collectors are working as observers can be hard to identify. In such cases, it is harder to identify, predict, and control this threat.

Time threats

Actual changes in DV
- History
- Maturation
- Reactivity
No actual change in the DV
- Instrumentation

Group threats

These threats to internal validity pose a rival explanation for group differences other than experimental manipulation (IV). That is the group might not be equivalent to each other at the start of the experiment. These threats affect between and mixed subject designs. These threats are:

Selection
Regression to the mean / Statistical regression

Selection

Selection threat is operational when post-test differences between groups are due to pre-existing differences between groups. Put simply, the groups are not equivalent to each other at the beginning of the experiment. Selection threat also indicates the presence of confounding variables. Quasi-experimental designs, NECG designs, and even true experiments where random assignment fails are vulnerable to this threat. The pre-existing differences then serve as rival explanations. Let’s take a look at an example.

Let’s say we are interested in evaluating the effectiveness of a new teaching method on students’ achievement scores. We take two classes in the school and provide treatment to one and keep the other as a control group. After the study, we find out that the students in the control group are higher achievers than those in the experimental group. We find no treatment effect. Our failure to find an effect might be due to the pre-existing differences between the groups. Maybe the children in the control group were already academically ahead than in the treatment group. An obvious solution is the use of random assignment which should take care of the pre-existing differences.

The presence of confounding variables is another reason for the pre-existing differences between the groups. Here I have discussed how the ‘seniority’ of typists can be a confounding variable that creates pre-existing differences between the groups.

Regression to the mean

When the participants are selected based on extreme scores, their scores tend to move toward the mean when tested again. This is called regression to the mean. That is, on average, high scores on the pretest will get lower on the posttest and low scores on the pretest will improve on the posttest assessment, on average. There are pre-existing differences between the groups. Again, this threat affects quasi-experiments, NECG designs, and single-group designs. The pre-existing differences are a form of bias. This is common when the treatment is made available either to those with special merits or to those with special needs. Patients who come to psychotherapy when they are extremely distressed are likely to be less distressed on subsequent occasions, even if psychotherapy had no effect.

For example, if a mental health clinic refers all clinically depressed patients to a new treatment program aimed at improving well-being. It then finds improvements in the levels of depression measured after the new treatment program is implemented. To conclude that the new treatment program is effective might be wrong. Starting the experiment with people who were already experiencing symptoms of depression is starting with an extreme value of pretest. Depression is not a normal condition and over time, even without treatment, people tend to feel better (normal mean). Regression to the mean is a serious threat to the internal validity in this case.

Another example is when children with the worst reading scores are selected to participate in a reading course. Improvements at the end of the course might be due to regression to the mean and not the course’s effectiveness. If the children had been tested before the course started, they would likely have obtained better scores anyway. Maturation is also a threat in this case.

The solution again is the use of random assignment to eliminate pre-existing differences between the groups. If avoiding people with extreme scores is not possible the best solution then is the create a larger group of people with extreme scores and then randomly assign people to differently treated groups from the larger group. This can unconfound regression as it affects both groups in the same way.

Mortality

The last threat to internal validity is mortality or attrition. It refers to the loss of participants midway through the experiment. That is, participants drop out of the experiment for various reasons and the experiment loses the posttest observations for these participants. Mortality is a problem when the participants dropping out have some common characteristics that are relevant to the experiment.

For example, many weight loss treatments see participants dropping out of the treatment as it gets tougher over time. Only those who are persistent stay on and complete. To infer that the treatment is effective then would be incorrect as only those who were already persistent completed the treatment. In that sense, a subset of the initial participants were systematically different from the initial participants. Mortality or attrition is therefore a special case of selection occurring after the treatment has started. However, this cannot be controlled by random assignment.

Needless to say, this threat affects longitudinal designs. A control group is not helpful here if there is a differential mortality rate between the treatment and control group. One way to deal with the issue of mortality is to reduce the time between pretest and posttest. Some researchers also recommend rewards for participants and providing make-up sessions wherever possible. In any case, we should be mindful about it, and when suspected we should try to see if there are any systematic differences between those who left and those who didn’t.

Conclusion

Internal validity threats are the reasons for us to be partly or completely wrong when we make inferences about covariation, causation, and constructs. Threats serve a valuable function. They help us to anticipate the likely criticisms of inferences that experience has shown occur frequently and we should try to rule them out using design elements. We do that by analyzing which threat applies in any particular case, evaluating its plausibility, and then applying design elements and/or statistical control. Here are some design and other forms of control we can employ to reduce different types of threats to internal validity.

Design controls to threats to internal validity

Sources

Aldrich, J. (1995). Correlations genuine and spurious in Pearson and Yule. Statistical Science, 10(4), 364–376.

Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research on teaching. In N. Gage (Ed.), Handbook of research on teaching (pp. 171–246). Chicago: Rand-McNally.

Rosenbaum, P. R. (1987). The role of a second control group in an observational study (with discussion). Statistical Science, 2(3), 292–316. doi:10.1214/ss/1177013232

Shadish, W., Cook, T., & Campbell, D. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.