Studying at Cambridge

DAPA Measurement Toolkit

Statistical assessment of reliability and validity


Assessments of the reliability and validity of a method are multi-faceted, and cannot be proven by a statistical test alone. However, statistical analysis can inform the choice of method by quantifying reliability and validity in relation to the ‘analytical goal’ (the intended practical requirements of the measurement tool [2]. Validity and reliability can be assessed in either relative or absolute terms, as described in Table C.7.1. Statistical techniques used to assess both the relationship (relative) or agreement (absolute) between two sets of data in studies of reliability or validity are described in the following sections.

Table C.7.1 Relative and absolute forms of reliability and validity.

Relative Absolute
Reliability The degree to which individuals maintain their position in a sample with replicate measurements using the same method The agreement between replicate measures of the same phenomenon using the same method and units
Validity The degree to which two methods, irrespective of units, rank individuals in the same order The agreement between two methods measuring the same phenomenon with the same units

Correlation coefficients describe the relationship between two variables. The two variables can come from replicate measures using the same method (reliability), or from different methods (validity). A correlation coefficient ignores the units of two variables. It measures the relationship between two variables rather than the agreement between them, and is therefore commonly used to assess relative reliability or validity. A more positive correlation coefficient (closer to 1) is interpreted as greater validity or reliability.

Tests of significance for correlation coefficients should be interpreted with caution in the context of reliability and validity. It would be highly unexpected to find that two measures of the same phenomenon are not related to some degree. Furthermore, it is plausible that measurement data which are strongly correlated have very little agreement. Use of correlation coefficients may therefore fail to detect systematic bias [9]. However, absolute validity or reliability may not be required depending on the research question, for example, when accurate ranking of individuals within a population is required, such as in a study of a lifestyle-disease association [6]. In such a study, a method with relative reliability and validity demonstrated by correlation may be suitable. Examples of correlation coefficients used to assess reliability and validity include:

  • Pearson correlation coefficient, which describes the direction and strength of the linear relationship between two continuous measurements under assumption of normal distribution of two variables. Pearson’s r can be affected by sample homogeneity which makes distinguishing between individuals more difficult. High correlation does not rule out unacceptable measurement error in some circumstances.
  • Spearman’s rho which describes the degree to which two measurements rank individuals in the same order. While the statistical power is lower than Pearson correlation coefficient, this method is less affected by outliers and does not make assumptions about the distribution of the data.
  • Kendall tau, another assessment of rank, specifically for discrete rather than continuous variables.
  • Intraclass correlation (described in more detail below)

Unlike the correlation coefficients discussed above, which assess the relationship between two variables, the ICC assesses the agreement. The variables can either be from replicate measures using the same method (reliability), or from two (or more) methods measuring the same phenomenon (validity). Since the ICC assesses agreement between data, it is typically used to determine the absolute validity or reliability.

An advantage of the ICC is that it can be used to assess agreement between more than two sets of data (i.e. from three different observers of the same phenomenon, or multiple days of measurement). The ICC can be calculated in different ways and these can provide different estimates of reliability or validity. ICC estimates closer to 1 represent greater reliability or validity. The ICC can be affected by sample homogeneity: if a study population is highly homogeneous, it is more difficult to distinguish between individuals and ICC deviates from 1. Conversely, if a study population is too heterogeneous, ICC can become erroneously closer to 1. Therefore, high ICC does not rule out unacceptable measurement error in some circumstances.

Linear regression methods aim to determine the linear relationship between two sets of variables. In the assessment of reliability, two variables come from replicate measures of the same phenomenon using the same method. For validity, they are from two different methods measuring the same phenomenon.

General principle

Two variables can be displayed as a scatterplot: one on the x-axis against one (from either replicate measures or another method) on the y-axis. Figure C.7.1 is an example of a scatterplot and regression equation.

The relationship between the two sets of paired scores is shown by the regression line with the equation y = mx + c, where m is the slope and c is the y intercept. The slope represents the average increase in y if x were to increase by one unit, the intercept is the y value when x = 0.

  • The correlation coefficient ‘r’, describes the closeness of the data to the regression line, or, the linear association between the two measurements
  • r2 provides a measure of how much of the variability in one set of measurements is explained by the variability in the other set of measurements
  • 1 - r2 is the proportion that remains unexplained by the relationship
Figure C.7.1 Example of scatterplots and regression equations. As a reference, energy expenditure from indirect calorimetry is plotted on x-axis. On the left, energy expenditure estimated from combined accelerometer and heart rate (AHR) is presented. On the right, energy expenditure from pedometer is presented.
Source: [10].

Systematic error

Suppose that validity of the variable on y-axis is of interest. Its systematic error in reference to the other on x-axis can be described using the intercept and slope of the regression line:

  • The intercept ‘c’ provides a measure of the fixed systematic error between the two variables, i.e. one method provides values that are different to those from the other by a fixed amount. A value of 0 for c indicates no fixed error. Confidence intervals (e.g. 95%) can be used to examine whether c ≠ 0 and thus determine whether fixed error is statistically significant.
  • The slope, m, provides a measure of the proportional error between the two variables, i.e. one method provides data that are different to those from the other by an amount that is proportional to the level of the measurement. A value of 1 for m indicates no proportional error. Confidence intervals (e.g. 95%) can be used to examine whether m ≠ 1 and thus determine whether proportional error is present.

Random error

Random error inherently exists in any measurements. The influence of random errors on the regression estimates vary whether the error is present in the variable on y-axis or the other on x-axis. If the random error were present in the variable on y-axis, estimates of the slope and intercept would become more imprecise, but estimates themselves would be unbiased. By contrast, if the error is present in the variable on x-axis, the slope would be attenuated or ‘diluted’ toward the null. Thus, effects of random errors would vary by how a regression line would be fitted.

Uncertainty when using regression

Uncertainty in regression estimates increases when [11]:

  • The data are not evenly distributed over the investigation range
  • The number of data points is low
  • The samples employed are not independent
  • The relationship between the data is not linear
  • The magnitude of variability relative to the data range is high

Errors of one variable against another (reference) can be described by the root mean square error (RMSE), i.e. the square-root of the mean squared differences between methods. This statistic amplifies and severely punishes larger errors. A lower RMSE indicates greater agreement at individual level.

The agreement between continuous data in the same units from replicated measures using the same method (absolute reliability), or from different methods (absolute validity), can be quantified with Brand-Altman plot. The plot uses the difference between each pair of measures (A-B) on the Y axis, against the mean of each pair ((A+B)/2) on the X axis (Figure C.7.2).

Assessing validity in this way provides:

  • Quantification of the difference between methods rather than how related they are, meaning that systematic differences between methods can be detected (e.g. between new method and gold-standard)
  • Examination of whether the agreement between values is related to the magnitude of the reference value. In Figure C.7.2, the larger the mean values, the larger the variability of the difference between two measures, i.e. heteroscedastic. Two methods may demonstrate good agreement at the low end of the scale but not as the true value is high.
  • The magnitude and direction (i.e. positive or negative) of systematic bias is represented by the mean of all differences between paired values (solid horizontal line in Figure C.7.2)
  • Random error is estimated by the standard deviation of these differences (dashed horizontal lines in Figure C.7.2)
Figure C.7.2 Agreement analysis for vitamin E intake estimated with semi-quantitative food frequency questionnaire (SFFQ) and the weighed record (WR). Solid line = mean difference or systematic bias; dashed lines = plus and minus two standard deviations.
Source: [1].

Cohen’s kappa K is a measure of agreement between two categorical variables; it is commonly used to measure inter-rater reliability, but can also be used to compare different methods of categorical assessment (e.g. validation study for new method). It improves upon percent agreement between scores by taking into consideration the agreement we would expect due to chance (i.e. if the two measurements were unrelated).

It is calculated as the amount by which agreement exceeds chance, divided by maximum possible amount by which agreement could exceed chance. K closer to 1 indicates perfect agreement, whereas 0 represents the amount of agreement that can be expected due to chance.

  • Fleiss kappa is an adaptation of Cohen’s kappa for more than two sets of observations
  • When categories are ordered, the weighted kappa can be used. This technique assigns different weights to the degree of disagreement, for example the disagreement between categories 1 and 5 is greater than the disagreement between categories 1 and 2.


  1. Andersen LF, Lande B, Trygg K, Hay G. Validation of a semi-quantitative food-frequency questionnaire used among 2-year-old Norwegian children. Public Health Nutr. 2004;7(6):757-64.
  2. Atkinson G, Nevill AM. Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Med. 1998;26(4):217-38.
  3. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29-36.
  4. Hopkins WG. Bias in Bland-Altman but not regression validity analyses. Sportscience. 2004;8:42-6.
  5. Ludbrook J. Linear regression analysis for comparing two measurers or methods of measurement: but which regression? Clin Exp Pharmacol Physiol. 2010;37(7):692-9.
  6. Masson LF, McNeill G, Tomany JO, Simpson JA, Peace HS, Wei L, et al. Statistical approaches for assessing the relative validity of a food-frequency questionnaire: use of correlation coefficients and the kappa statistic. Public Health Nutr. 2003;6(3):313-21.
  7. McHugh ML. Interrater reliability: the kappa statistic. Biochemia Medica. 2012;22(3):276-82.
  8. Muller R, Buttner P. A critical discussion of intraclass correlation coefficients. Stat Med. 1994;13(23-24):2465-76.
  9. Schmidt ME, Steindorf K. Statistical methods for the validation of questionnaires--discrepancy between theory and practice. Methods Inf Med. 2006;45(4):409-13.
  10. Thompson D, Batterham AM, Bock S, Robson C, Stokes K. Assessment of low-to-moderate intensity physical activity thermogenesis in young adults using synchronized heart rate and accelerometry with branched-equation modeling. J Nutr. 2006;136(4):1037-42.
  11. Twomey PJ, Kroll MH. How to use linear regression and correlation in quantitative method comparison studies. Int J Clin Pract 2008;62(4):529-38.