Validity in research

Validity in scientific research is arguably one of the most important properties of measurement as every aspect of research design can impact the validity of the research. Cook and Campbell (1979) defined it as “the best available approximation to the truth or falsity of propositions” (p. 37).  The Standards for Educational and Psychological Testing (Standards), a joint product of the AERA, APA, and NCME standards describe it as “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests.” (p. 11). Let’s unpack this.

As discussed earlier, the research method include designing the research, making decisions about the way the data will be collected, and analyzing the data. This means that the researcher also needs to know the logic behind the selection of a particular method for their research. The logic justifies the choice of methods to make valid inferences.

In social science research, we generally deal with constructs. Constructs are abstract, unmeasured aspects of people, events, things, or mental states that can’t be directly observed or manipulated. For example, overcrowding, stress, love, intelligence, aggression, etc. are all abstract concepts. Theory, on the other hand, uses constructs. A theory is a formalized set of constructs that organize observations and inferences and can also predict and explain a phenomenon of interest. Theory, thus, links two or more constructs and specifies the suspected nature of the relationship between these constructs. However, the abstract nature of constructs makes it harder to observe directly. Thus, it needs to be converted into an operational construct through the process of operationalization.

The way we represent and measure an abstract construct for a given research study is called operationalization.

Here is an example of how the construct of depression is measured or operationalized by the CDC. Operational constructs are the test specifications scales or instruments and other elements of assessment that theoretically represent and empirically measure these theoretical constructs. An instrument or scale is a tool for measuring, observing, or documenting data.

Validity, then, refers to this crucial relationship between the theoretical construct and the operational construct, defined as the extent or degree to which the operational construct measures what it is supposed to measure. Put simply, it is the question that we have to ask ourselves, “Are we measuring what we think we are measuring?” This is crucial to the relevance of any inferences we can make. Thus, validity is a theoretical and empirical concept. We need theoretical evidence to show we are measuring the right construct and empirical evidence to show we have measured the right construct which can then enable us to draw correct inferences.

In a nutshell, measurement validity is asking, “Are we measuring what we think we are measuring?”

This means that establishing the validity argument should integrate various threads of evidence that support the operationalization of theoretical constructs and interpretations of such constructs for the study purpose. Combining evidence to establish validity means that it can be viewed along a continuum from weak to strong.

There are three major ways to provide evidence for validity, namely, content, criterion, and construct validity. These are also referred to as measurement validity. That is, how well a measure reflects what it is supposed to measure. For example, depression is a complex construct measured through nine related concepts: sadness (dysphoria), loss of interest (anhedonia), appetite, sleep, thinking/concentration, guilt (worthlessness), tiredness (fatigue), movement (agitation), and suicidal ideation. It is assumed that each related concept measures a part of the total construct of depression.

Content validity

Content validity mainly refers to the extent to which the items or questions on the instrument are fairly representative of the construct it wants to measure. That is, you might assess the validity of the instrument by examining its content. For example, a final exam in a sociology class should indicate how much knowledge everyone has gained. But it is impossible to cover every detail of the course. However, we do expect the questions on the exam to be fairly representative of the whole course. We can judge the content validity of the exam by checking how representative the questions are of the course material.

Put simply, it is a means to validate instruments (exam in the previous example) used to sample a content domain (construct) (material covered in the course). The best way to ascertain content validity is by asking a subject matter expert. Thus, it depends on the judge’s subjective judgment. Typically, judgments of multiple experts are used to establish this validity.

A clear definition of the content domain thus is key for the experts to establish content validity. Subject matter experts first establish face validity by observing if the items/questions appear to represent what it is attempting to measure. Then they look for underrepresentation (missing relevant items) or overrepresentation (irrelevant items are included) of the construct’s content. If all is good in their view, the instrument is said to have established content validity. In general, we should look for evidence of content relevance, representativeness, and technical quality of items. That is, it focuses on the quality of the measure rather than the interpretations. Content validity also provides evidence for construct validity as it shows that the measure covers the intended domain of the construct being measured.

Advantages

  • Once you have a subject matter expert, it is relatively easy to establish.
  • It provides evidence for construct validity.

Disadvantages

  • It is subjective and thus prone to confirmatory bias. That is a priori ideas about what the content should be are inseparable from the process of establishing this validity. To minimize this generally multiple subject matter experts are used. This is why content validity generally plays a supportive role in establishing validity along with criterion and construct validity.
  • The subjective nature makes it imprecise for scientific purposes.
  • Many times the items are obvious in the intent of what the instrument is measuring which can lead to social desirability bias.

Criterion validity

When we have an existing measure that is accepted as the best measure of the target construct (gold standard) we refer to it as a criterion. The existing measure is called the criterion measure and the target construct is the new measure of our dependent variable. Criterion validity then is the degree to which the new measure and the criterion measure correlate with each other. Such a correlation coefficient is referred to as a validity coefficient. Criterion validity is generally established by either a predictive or concurrent approach. In the predictive approach, the target measure is used to predict the criterion measure. That is, the target measure is obtained before the criterion measure. Let’s break this down with an example.

Criterion validity is the degree to which the new measure and the criterion measure correlate with each other.

Universities are interested in getting high-quality students who would succeed in their academic life. The known and well-established criterion for college success is the grade point average (GPA – criterion measure). However, when making admission decisions the GPA is not available. So a reasonable alternative is needed that can predict how students might perform academically. Standardized aptitude test (SAT) scores can serve as a predictor for academic performance and the correlation between SAT (predictor or proxy) scores and GPA (criterion) would serve as evidence of criterion validity. Universities might also want to add additional proxy measures such as the strength of recommendations and fit to the program to obtain better results. In such a situation regression analysis can be used to establish criterion validity (effect size).

Another example would be the development of heart disease in later life. Here the criterion measure could be experiencing a heart attack. Doctors often use proxy measures such as present diet, exercise behavior, blood pressure, family history, etc. to predict future health issues related to heart diseases. Criterion validity can be established using correlation and/or regression techniques to quantify the relationship between proxy measures and the criterion measure. Such proxy measures are often used to predict outcome measures by researchers. For example, all sorts of proxy measures are used to predict, job performance, voting outcomes, medical diagnosis, product demands, production needs, etc.

Establishing criterion validity using the predictive approaches is also referred to as predictive validity.  In a concurrent approach, the criterion measure and the proxy (new) measure are obtained at approximately the same time. The new measure and the criterion measure are tapping into the same construct. The primary reason for obtaining a new measure for an established criterion measure could be cost and convenience. For example, rapid home COVID-19 (antigen) tests have a high correlation with the  polymerase chain reaction (PCR) test for COVID-19. While PCR tests are considered the “gold standard” for accuracy, they are expensive, have lengthy turnaround times, and need to be analyzed in the laboratory. On the other hand antigen tests have a faster turnaround time, are cheaper than PCR tests, and can be performed at home. Both tests have a relatively high correlation for accuracy thus establishing concurrent validity for the antigen tests.

Advantages

  • When a criterion measure is available, a single easily interpreted validity coefficient (e.g., correlation coefficient) or regression analysis (when there are multiple proxy measures) can be used.
  • It is objective.

Disadvantages

  • Criterion measures should be relevant to the desired decision. That is, it assumes a criterion measure exists. It is difficult to have criterion measures for complex measures like self-esteem or depression.
  • It also assumes that the criterion measures are valid and free from bias.
  • It needs the criterion measure to be reliable, stable, and replicable.
  • The selection of the criterion measure can be affected by the availability and cost of the measure.
  • Range-restricted samples and smaller sample sizes can attenuate the correlation between proxy measure and criterion measure.

Construct validity

When we are looking to establish validity evidence for a complex, multi-faceted, and theory-based construct, for example, global self-esteem, leadership, and emotional intelligence, we need to look at how well the measure/instrument reflects the target construct. Construct validity is then the collection of all evidence that supports the interpretation and use of the score as a measure of the construct.  

Given that the constructs are highly abstract it is more difficult to establish construct validity. In such a case our best option is to gather evidence that strengthens our confidence in the construct validity of the measure. This can be accomplished by examining the internal structure of the instrument, the relationship with other instruments that supposedly measure the same construct, and the relationship among indicators of different constructs within a theory.

Some tests/instruments are designed to measure one construct while others are designed to more than one dimension of the construct. For example, transformative leadership is measured through five related dimensions: idealized attributes, idealized behaviors, inspirational motivation, intellectual stimulation, and individualized consideration. Evidence of construct validity is often examined based on the internal structure of the measure. That is, a researcher must collect data using the new measure and test it using a statistical procedure called factor analysis. Factor analysis helps to determine these dimensions (also referred to as factors) and how they are related to each other. It tells the researchers whether the instrument is unidimensional or multidimensional and whether the measure meets the theoretical specifications. Additionally, they also examine the homogeneity of each set of these unidimensional items by calculating the item to total correlation, intercorrelations of the subscales representing each dimension, and coefficient alpha.

The second method is to correlate the measure with other measures that are thought to reflect the same construct. For example, the Transformational Leadership Inventory (TLI) and the Multifactor Leadership Questionnaire (MLQ) both measure transformational leadership. Since both measure the same concept, the scores from each scale should show a high correlation. Such a way of establishing validity is called convergent validity where valid measures of the same construct should agree with each other. Put simply, we have convergent validity when we see that measures that are expected to be related are related. When we have scores from instruments that are not related to each other and are designed to measure unrelated constructs we have discriminant validity. For example, MLQ is and should be empirically unrelated to the Authentic Leadership Questionnaire (ALQ) as both are theoretically different constructs.

Lastly, what if we have two measures of the same construct of emotional intelligence, let’s name them EI1 and EI2, but they do not converge? Which one we should use? In such a situation we look for evidence of construct validity on how well the measure confirms the theory. For example, a theory states that exposure to certain kind of training program leads to an increase in emotional intelligence. In order to show construct validity evidence, we need to assume two things. First, this theory is correct, and second, we will see an increase in emotional intelligence after the training program is delivered. Then, if EI1 scores exceed EI2, we can conclude that measure EI1 reflects the construct of emotional intelligence better than EI2. It is very similar to criterion validity where we assume that the criterion is valid. Here we assume that the theory is valid and the change we will observe is valid.

Establishing construct validity requires that the conclusions drawn from research hold up across contexts. In this sense, construct validity often helps in refining and clarifying the theory itself.

Advantages

  • It’s a comprehensive way to look at the validity of measurement.
  • There are multiple ways to establish evidence for construct validity such as content validity, convergent validity, and discriminant validity.
  • Construct validity can help refine and clarify the theory itself.

Disadvantages

  • It is relatively difficult to establish construct validity.
  • The abstract nature of the constructs makes it more difficult to establish construct validity.

Several types of validity are explained here but one thing we must remember is that there is no single right way to validate a measure. Rather the intended use of the measure can help determine the most appropriate type of validity. An even better approach is the examine components of content, criterion, and construct validity in unison to yield a comprehensive and integrated approach to validation.

But validity also relates to the ability of the research design to provide evidence of a cause-and-effect relationship between the independent and the dependent variable. This type of validity is known as internal validity. Additionally, validity also relates to our ability to draw valid inferences from the research. Our ability to generalize results from the study population to the target population is known as external validity. Then there is ecological validity, the extent to which a research situation represents the natural social environment or the “real world”. I cover it in the this post.

Overall, validity refers to the accuracy and trustworthiness of the instrument or scale, data, experiment or observation, analysis, and findings of the research.

Sources:

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061.

Brewer, M. B., & Crano, W. D. (2014). Research design and issues of validity. In H. T. Reis & C. M. Judd (Eds.), Handbook of research methods in social and personality psychology (pp. 11-26). Cambridge University Press.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105.

Campbell, D. T. (1988). Definitional versus multiple operationalism. In E. S. Overman (Ed.), Methodology and epistemology for social science: Selected papers (pp. 32–36). Chicago, IL: University of Chicago Press.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation. Boston: Houghton Mifflin

Dooley, D. (2001). Social research methods (4th ed.). Prentice Hall.

Malloy, E., & Kavussanu, M. (2021). A comparison of authentic and transformational leadership in sport. Journal of Applied Social Psychology, 51(7), 636-646. https://doi.org/https://doi.org/10.1111/jasp.12769

Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational Measurement: Issues and Practice, 14(4), 5–8.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: American Council on Education and Macmillan.

Price, L. R. (2017). Psychometric methods: Theory into practice. The Guilford Press.

Cite this article

Trivedi, C. (2024, March 21). Validity in research. ConceptsHacked. Retrieved from https://conceptshacked.com/validity

Chitvan Trivedi
Chitvan Trivedi

Chitvan is an applied social scientist with a broad set of methodological and conceptual skills. He has over ten years of experience in conducting qualitative, quantitative, and mixed methods research. Before starting this blog, he taught at a liberal arts college for five years. He has a Ph.D. in Social Ecology from the University of California, Irvine. He also holds Masters degrees in Computer Networks and Business Administration.

Articles: 36