Reliability coefficients

Reliability coefficients measure the consistency of a measurement scale. That is, they answer the question — how consistent is the scale or instrument when multiple observations are taken when the true score is unchanged? The reliability coefficient quantifies the precision of the instrument (accuracy on repeated trial), and therefore the trustworthiness of the scores.

Reliability coefficients are the ratio of scores to scores plus error.

More theoretically, they are the true score variance to observed score variance. But in simple terms, it is a statistic used to express quantitatively (mathematically) the extent to which two measures/scores/variables are related and the direction of that relationship.

Kappa

The best measure of inter-rater reliability available for nominal data is, the Kappa statistic. That is, when you want to see the inter-rater reliability, you use Cohen’s Kappa statistics. Kappa is a chance corrected agreement between two independent raters on a nominal variable. Inter-rater reliability is the degree of agreement between two observers who have independently observed and recorded behaviors at the same time. Kappa statistics range from 1 to -1 with 1 being a perfect agreement, and 0 being what we might expect by chance and a negative Kappa means the observed level of agreement is less than what you would expect by chance.

For example, if we are interested in analyzing the psychiatric disorder in patients, we might want to have two physicians looking at the patients to have better reliability of diagnosis. Suppose we find the analysis as below:

The observed agreement between the two physicians is calculated as the sum of the diagonal/total.

Absolute agreement: [(23+32+21)/212] *100 = 35.8%.

Generally, this is considered to be a crude measure of reliability. But if it is more than 70% or higher than 0.7 it is considered to be adequate. However, a better estimate is the Kappa coefficient as the absolute agreement includes change agreement between the raters. That is, for physician 2, total anxiety diagnosis represents 77/212=.3632 or 36.32% and for physician 1, anxiety diagnosis is 41/212 = 19.3%. Thus, the chance agreement between the two is 36.32 * 19.3 = 6.94%. We can convert that into cases and it will be 14.7 cases. Once we get all the data, we can calculate

Kappa = (76-66.22)/(212-66.22) = 0.0671 or 6.71%

We can see that the agreement between the two raters is largely by chance as the chance correct agreement is very low at 6.71%. An acceptable value of kappa is 0.5 or 50%.

Intra-class correlation (ICC)

The best measure of inter-rater reliability available for ordinal and interval data is the intra-class correlation (R) or ICC. It is interpreted as the proportion of variance in the ratings caused by the variation in the phenomenon being rated. The reliability coefficient ranges from 0 to 1, with 1 being highly reliable and 0 being unreliable. Any value above 0.6 is considered acceptable. Different forms of ICC can be used under different circumstances. For more information on ICC, you can refer to this webpage.

Pearson (r)

Pearson r is the most commonly used measure of bivariate correlation. It describes the degree to which a linear relationship exists between two continuous variables. It is often used in testing theories, checking the reliability of instruments, evaluating validity evidence (predictive and concurrent), evaluating strengths of intervention programs, and other descriptive and inferential measures. It provides a measure of direction and strength of the relationship between two variables. Additionally, when squared (R²), it can provide the measure of shared variance between two variables as well. However, caution should be observed as Pearson r might not be appropriate if there are outliers (extreme values) in the data, particularly when the sample size is small or where there is range restriction (sample is not representative of the population). Additionally, one needs to remember that correlation does not imply correlation. However, it can also be used as a measure of effect size. Pearson r ranges from -1 to 1. This is best suitable for interval and ratio scales.

Cronbach’s alpha

Also known as a measure of internal consistency for interval or ratio data is used as a reliability measure of composite scales by correlating the test results among multiple subjects. It shows how multiple items of a scale are correlated with each other. That is, the researcher can examine how each item of the scale correlates with the total score on the scale. If an item does not correlate well then, the item can be dropped from the scale, and doing that should improve the reliability of the scale. It was designed to measure split-half reliability. It ranges from 0 to 1, where values closer to 1 indicate higher internal consistency or higher reliability. A generally acceptable value is 0.7 or higher indicating that items on the scale are measuring the same thing. But, typically, researchers like to have a value of more 0.8. However, the alpha value also gets influenced by the number of items on the scale. So, it is possible to get the alpha value more even though the scale might not be reliable. Additionally, the scale should be homogenous, that is measuring one dimension or construct. But if the scale has multiple dimensions or multiple constructs, then separate Cronbach’s alpha should be reported for each dimension. Cronbach’s alpha is also a superior method for testing reliability compared to test-retest and parallel-test reliability, as the latter two can get affected by history threats. It also overcomes other drawbacks of these types of reliability tests. However, a higher value of alpha does not mean high validity, particularly when multiple constructs correlate well with each other but do not measure what they are supposed to measure. Since values of alpha get positively affected by an increased number of items on a scale, adding more questions that are similar can lead to participant fatigue or frustration. This can ultimately lead to higher measurement error and lower reliability. Reverse coding can help alleviate some of these burdens.

General assumption

A general assumption about all statistical calculations is that the data is always sensitive to the distribution. The more it deviates from the normal distribution, the more those measures are attenuated.

Bibliography

Bernard, H. R. (2006). Research methods in anthropology: Qualitative and quantitative approaches (4th ed.). Lanham, MD: AltaMira Press.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.

Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78(1), 98.

Johnson, A. J. (2017). Reliability, Cronbach’s alpha. In M. Allen (Ed.), The SAGE encyclopedia of communication research methods (Vol. 1 & 2, pp. 1414-1417). Los Angeles: Sage.

Multon, K. D., & Coleman, J. S. M. (2010). Coefficient alpha. In N. J. Salkind (Ed.), Encyclopedia of research design (Vol. 1, pp. 159-162). Thousand Oaks, California: Sage.

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107–120. doi:10.1007/s11336-008-9101-0