Assessing reliability in research methods

Reliability, fundamentally, concerns the extent to which a measure, an experiment, or test yields the same results on repeated trials. However, the measurement of any phenomenon invariably contains a certain amount of chance error. Even repeated measures of the same characteristics for the same individual might not duplicate themselves. At the same time, we can and should expect consistent results on repeated measurement from a good experiment, test, or instrument. This consistency is what we refer to as reliability. Reliability, thus, is a matter of degree. Four major ways of assessing reliability are test-retest, parallel test, internal consistency, and inter-rater reliability. In theory, reliability refers to the true score variance to the observed score variance.

Reliability = True score/ (True score + Errors)

Reliability is majorly an empirical issue concentrated on the performance of an empirical measure. Measurement error can be random or non-random. Random errors are the chance factors that can distort the true score and are inversely proportional to the degree of reliability. Random errors are also unsystematic. This means that such errors will equally likely overestimate the true score as much they are likely to underestimate the true score. It also means that when multiple observations are taken, it will should even out. Random error can affect most stages of the research process and thus is endemic to research, whether in social, physical, or natural sciences. Thus, a good research design is the one that minimizes errors and maximizes reliability. However, reliable measures are not necessarily valid. Rather, reliability is a pre-requisite for a measure to be valid. Or reliability and validity are the first line of defense against spurious or incorrect conclusions.

Relationship between Reliability and Validity

Reliability is also viewed differently in qualitative and quantitative research. In quantitative research, reliability can also be characterized as the extent to which multiple researchers can come to similar conclusions when they replicate the experiment under identical conditions. That is, reliable scientific research should produce similar results if it is carried out in the same/similar way. Thus, reliability is not research specific. But it is an essential indicator of the study quality (along with validity). This is how science builds on itself through transparency and scrutiny. However, there is no uniform understanding of reliability in qualitative research. But it often translates to credibility, dependability, and conformability. The idea is to provide some logic to the subjective nature of the inquiry. This logic is often related to the methodological rigor and coherence, researcher’s responsiveness to verification of facts, and accountability through the transparent methodology. However, critics have also argued that the forced attempts to demonstrate reliability can be counterintuitive and a departure from the essence of qualitative research. Instead, the researcher’s creativity and originality, is what produces the richness and meaningfulness in the analysis and where repeatability is neither desired nor possible.

Methods of assessing the reliability

Researchers often collect data through different techniques. One way of assessing reliability is by taking the same measure twice at different time points through a longitudinal design (test-retest reliability). The second way of assessing reliability is by making an individual complete a scale that is related to another similar one through a cross-sectional design (parallel test reliability). The third way is to ask different raters to rate the text, interaction, or some characteristics related to the variable and measure agreement of the raters (inter-rater reliability). The last method of assessing reliability is related to measuring the consistency of the instrument used to measure a variable (internal consistency). In all of these methods, reliability is generally measured through a correlation coefficient. However, unlike other correlations, this correlation cannot have a negative value and hence it ranges from 0 to 1. An acceptable measure of reliability results in the correlation coefficient (aka reliability coefficient) that is higher than 0.7 or 0.8. Let’s look at each of these ways in detail.

Test-retest reliability

One of the easiest ways of assessing the reliability of an empirical measure is to test the measure on the same person at two different points in time. It is a test of the stability of a measure over time. Researchers can then simply correlate the scores of the two measures. If the measure is reliable then the scores will have a positive and reasonably high correlation. If the researcher wants to administer the test multiple time, then s/he can simply take an average or mean of the correlations.

Assumptions

There are certain assumptions when you use this type of reliability. The first is that you use this test only when you expect the true scores are going to remain stable across time, particularly through the time interval between the test and retest. For example, intelligence and certain personality traits are generally considered to remain stable over time, making it appropriate for test-retest reliability. Whereas moods or preferences are expected to change or vary over time. In such cases, test-retest is not a valid technique to test the reliability of a measure. Second, the error variance of the first test should be equal to the second test. That is, the preciseness of the test or what is being measures should not vary over time.

Advantages of test-retest reliability

It is the easiest way to estimate reliability.
Only the test is required.

Disadvantages of test-retest reliability

Time interval – if the time interval between the two tests is larger, the true score might change.
Memory (carry-over effect) – if the time interval is too short, individuals might remember their responses from the first time and reproduce them and thus overestimating the reliability.
Reactivity – sometimes the very process of taking the first test can induce a change in the true score, referred to as reactivity, and this can result in underestimation of the reliability.
It is often difficult to get multiple responses from the same people and it can be resource and time consuming or sometimes even impractical.
A lower correlation coefficient can also mean a change in true score rather than lower reliability and hence it might not be possible to separate the reliability of the measure from the stability.

Parallel test reliability or Parallel form reliability

To overcome some of the disadvantages of test-rest reliability, research use parallel test reliability. Here the researcher observes a similar or “equal” test that uses different items, to the same people generally at different time points. These similar or “equal” tests are designed to measure the same construct. Ideally, these tests should be similar on content, level of difficulty, types of scales. In theory, they should have identical true score assessments, independent error, and identical error variance. Once the scores from the two tests are obtained they can then be correlated with each other to estimate reliability. This method is an improvement over the rest-retest method as it can take care of the memory problem. However, there is still a chance that the true score might change. This method of assessing reliability, thus, is appropriate for measures that are stable over time. The obvious difficulty is to construct two similar tests that measure the same construct.

Advantages of parallel test reliability

Solves the memory problem.

Disadvantages of parallel test reliability

Change in the true score is still harder to catch.
Substantial problems can occur if the tests are not parallel.
It is very difficult to create a similar test.
Cost, impracticality, and resources needed can double in developing parallel tests.

Within-test consistency/reliability or Internal consistency

In practice, arranging to administer the test twice to the same group of people can be very challenging. So, researchers came up with a within-test consistency method that needs to be administered on just one occasion. Additionally, scales that measure complex constructs need multiple items. For example, the Centers for Epidemiologic Studies Depression Scale (CESD) is made up of 20 items (https://cesd-r.com/cesdr/). Depression is also a complex concept measure through nine related concepts of sadness (dysphoria), loss of interest (anhedonia), appetite, sleep, thinking/concentration, guilt (worthlessness), tired (fatigue), movement (agitation), and suicidal ideation. It is assumed that each related concept (nine listed above) measures a part of the total construct of depression, in which case, scores of each part should show correlate with each other. For example, for an individual who scores high on the overall scale, we would expect a positive correlation between the scores of items that measure sadness and those that measure the loss of interest.

Hence the basic premise behind this type of reliability assessment is that the test would be reliable and accurate if the scores from individual items are consistent across the test, for an individual. That is, if a respondent answers the first few questions one way, the later part of the instrument should reflect that direction and the scores will be consistent with the previous answers. There are multiple reliability estimates available for this method. But the most commonly use are the Split half method and Cronbach’s alpha (aka coefficient alpha).

Split-half method

Here the test is divided randomly into two halves and a score is computed for each half. The halves are considered to be approximations of the parallel tests. These scores are then evaluated for agreement with a correlation. Since this correlation is based on half of the items rather than the full-length test, the Spearman-Brown formula is used to corrects it as below:

Reliability of full length test = 2 (split-half correlation) /(1+ split-half correlation)

So if the split-half correlation came out to be 0.7 the full-length test correlation would be

R = (2 x 0.7) / (1+0.7) = 1.4/1.7 = 0.82

The obvious drawback of this method is that there are many different ways in which a test can be split. For example, a 10-item scale can have 125 possible combinations with each split possibly resulting in different reliability estimates. This lack of a unique reliability coefficient is a major disadvantage of this method.

Cronbach’s alpha

Lee Cronbach improved this estimate by taking an average of the correlations produces by all possible split-halves combinations. An added advantage is that it also required fewer assumptions about the statistical properties of each item on the scale than the split-half method. Additionally, Novick and Lewis (1967) proved that Cronbach’s alpha is a conservative estimate of reliability. That is, the reliability of a test can never be lower than alpha. Coefficient alpha also informs the researcher of how much the score would vary if different items are used. This was a perfect solution. No wonder it is the most commonly used method to check within test reliability.

Inter-item correlations — Items of CESD Scale

Within test reliability is especially important when you have a complex construct and the research wants to ensure that all appropriate items are included on the test to adequately capture the construct. Within test reliability is also important if only a part of the scale is utilized for each participant. For example, a professor might want to use computer-assisted testing that assigns different items/questions randomly to students. S/he must make sure that the course grades are reflective of real individual differences and not inconsistencies with testing.

Another thing to keep in mind is that the sample is representative of the population, otherwise, the variability would be misrepresented in a homogenous sample that will systematically over or underestimate the reliability of a scale. Some researchers recommend a sample size of more than 200 to obtain a generalizable reliability estimate.

Coefficient alpha is calculated as

α = N x r_mean / 1 + r_mean(N-1)

where N is the number of items of the scale, and r_meanis the average inter-item correlation Alpha is influenced by the number of items on the scale and the size of the inter-item correlation. An easier way to increase alpha is thus to increase the number of items in the scale, add more items with better inter-item correlation, or delete items with poor inter-item correlation. Adding more items might increase alpha but one has to make sure that the added items are measuring the same construct. It not, it can severely affect the validity. That is, having a highly reliable instrument does not guarantee its validity. Additionally, adding items makes progressively less impact on alpha. It also makes the test longer. If the added items have a poor inter-item correlation, it can reduce alpha. So researchers need to be cautious about the pros and cons of adding more items. Here is the acceptability range for alpha.

You want to aim for .9 or higher
Generally, anything below .7 is considered poor
Below .6 is not considered acceptably reliable

Advantage of within test consistency

The test needs to be administered just once.
Easy to calculate a reliability coefficient in the split-half method but not the coefficient alpha.
Cronbach’s alpha is the most commonly used measure of reliability.
Cronbach’s alpha provides a conservative estimate of reliability.

Disadvantages of within-test consistency

Cronbach’s alpha is harder to calculate.

Inter-rater reliability

Inter-rater reliability is the degree of agreement between two observers (raters) who have independently observed and recorded behaviors or a phenomenon at the same time. For example, observers might want to record episodes of violent behavior within children, or quality of submitted manuscripts, or physicians’ diagnosis of patients. Once the scores are recorded, they are compared to see similarities and differences. Inter-rater reliability is computed by taking a correlation between the two scores or by comparing agreement between the scores. Inter-rater reliability can also be estimated by analysis of variance (ANOVA). Depending on whether the scores are nominal, or interval/ratio, appropriate correlations can be computed.

For nominal variables, Cohen’s Kappa is the best known and frequently used measure of chance corrected agreement. I have earlier discussed how to calculate Kappa statistics when the data is nominal. A value of 0.5 for Kappa statistics is considered to be acceptable. Kappa statistic is a preferred method when the raters are the same. But if the rates different while everything else is the same, Fleiss’ kappa should be used.

It is also important that the raters are consistent in classifying a behavior or phenomenon, even though raters may not share a common interpretation of the rating scale. Researchers use consistency analysis for continuous data. The most common estimates are the correlation coefficients (Pearson when the data is normally distributed and Spearman when the data is not normally distributed), Cronbach’s alpha, and intra-class correlations (ICC). For continuous data (interval/ratio), Pearson and Spearman correlation coefficients are well suited but if the data is ordinal or interval, intraclass correlation is better suited. The minimum acceptable value of Pearson and Spearman coefficient is 0.7 while for ICC it is 0.6. Depending on the type of data, researcher’s focus or goals, number of raters, and resources available, researchers can choose the best method to computer inter-rater reliability for their study.

Advantages of Inter-rater reliability

Scores from two raters can negate any bias that an individual rater might add to the scores.

Disadvantages of Inter-rater reliability

It requires the raters to be trained and requires them to reconcile their differences.

Implications of low reliability

Attenuation

Measures with low reliability tend to underestimate the relationship between the two constructs and this effect is called attenuation. That is the magnitude of the correlation that exists between the two unreliable constructs will be less than two perfectly reliable constructs. This can happen because of measurement error or due to range restriction in the sample (sample is not representative of the population). There are corrections available for attenuated scores by dividing the correlation by the square root of the product of the two variables’ reliabilities.

Regression towards the mean

When the participants are selected based on extreme scores, their scores tend to move towards the mean when tested again. This is called regression towards the mean. That is, on average, high scores on the first test will get lower on the second test and low scores on the first will improve on the second assessment, on average. This happens when the sample is not random or because of a bias. This can make the reliability coefficient appear smaller. For example, if a mental health clinic referred all clinically depressed patients to a new treatment program aimed at improving well-being and then finds improvements in the levels of depression measured after the new treatment program, is a clear example of regression towards the mean. Why? Because the research started with taking people who were already experiencing symptoms of depression. But depression is not a normal condition and overtime people tend to feel better (normal mean). Studies with lower reliability will produce greater shifts of scores like this and can be a real problem with longitudinal studies.

Bibliography

Barchard, K. A. (2010). Internal consistency reliability. In N. J. Salkind (Ed.), Encyclopedia of research design (pp. 615-619). Thousand Oaks, California: Sage.

Carmines, E. G., & Zeller, R. A. (1979). Reliability and validity assessment. Beverly Hills, Calif.: Sage Publications.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.

Galton, F. (1886). Regression toward mediocrity in hereditary stature. Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246–263.

Matthews, B., & Ross, L. (2010). Research methods: A practical guide for the social sciences. Harlow: UK: Pearson Education Limited

Multon, K. D. (2010). Test-retest reliability. In N. J. Salkind (Ed.), Encyclopedia of research design (pp. 1495-1498). Thousand Oaks, California: Sage.

Novick, M. & Lewis, G. (1967). Coefficient alpha and the reliability of composite measurements. Psychometrika, 32, 1-13.

Rogers, W. M. (2010). Parallel forms reliability. In N. J. Salkind (Ed.), Encyclopedia of research design (pp. 995-997). Thousand Oaks: CA: Sage.

Lord, F. (1957). Do tests of the same length have the same standard errors of measurement? Educational and Psychological Measurement, 17, 510–521.

Cite this article (APA)

Trivedi, C. (2020, December 20). Assessing reliability in research methods. Conceptshacked. Retrieved from https://conceptshacked.com/assessing-reliability/

Assessing reliability in research methods