(Note: British English spelling - ‘harmonisation’; North American English spelling - ‘harmonization’)
There are many definitions of data harmonisation, but a good working definition is provided by Maelstrom Research.
Harmonisation involves achieving or improving comparability of similar measures collected by separate studies or databases for different individuals. Some research programs foster prospective implementation of harmonised measures to collect data across studies, while others turn their efforts to retrospective harmonisation and co-analysis of existing datasets.
In summary, harmonisation seeks to bring together various types, levels and sources of data, which represent measurement of the same latent construct(s), in such a way that they can be made compatible and comparable (see Figure C.8.1 for example on wine consumption).
Harmonisation differs from standardisation in that it does not impose a single methodology or norm, but rather seeks to find ways of integrating or making "an agreeable effect" from information gathered through disparate methodologies (Harris et al., 2012).
Figure C.8.1 Example of harmonisation of common format wine consumption variable using data collected in different ways.
When the purpose of an investigation requires bringing data from multiple sources together in one analysis (e.g. meta-analysis), a preceding step to the main analysis involves converting the variable(s) to a common format which should adhere to the principle of inferential equivalence (see Figure C.8.2).
Figure C.8.2 Principle of inferential equivalence for harmonised common format variables. The inferences about the latent truth are equivalent regardless of the method originally used to collect the data. It is important to note that the ‘method’ refers to initial measurement by a tool/instrument, plus any stages of inference involved in post-processing and derivation of estimates; flexibility and variation during these steps may also influence the degree to which data can be harmonised and be deemed inferentially equivalent.
Researchers may, for example, conduct an analysis of multiple cohorts in order to understand the genetic, lifestyle, social and physical environmental factors associated with disease by bringing together data from diverse countries and regions [2]. As there are significant financial, technical and time burdens associated with developing and maintaining large population health studies which have matured to generate disease events, researchers are making use of data set consisting of information from multiple studies.
Analyses using data from multiple studies are constrained by the quality and compatibility of their data [3]. Variables are often assessed using different methods, limiting their compatibility. In longitudinal studies, there can be changes in methods of assessment between phases of data collection. Standardisation and harmonisation are related approaches which facilitate analyses by improving the compatibility of data, however there are important differences between the two.
Standardisation
Standardisation refers to the implementation of uniform processes for prospective collection, storage and transformation of data [4]. Standardisation implies that precisely the same methods, protocols and standard operating procedures are used in every study or study phase contributing to the analyses [1]. (Note: ‘standardisation’ is used in statistics for a different meaning to divide a variable by a standard deviation, e.g. z-score.)
Using standardised methods across multiple studies greatly facilitates analyses of datasets from separate cohorts. However, imposing identical procedures is also very challenging due to varying:
Even if rigorous standards are implemented, some level of human error and variation between studies is inevitable [4].
Harmonisation
Harmonisation is a more flexible approach that is more realistic than standardisation in a collaborative context [3]. Harmonisation refers to the practices that improve the comparability of variables from separate studies, permitting the pooling of data collected in different ways, and reducing study heterogeneity [5].
The harmonisation process involves deriving target variables formatted in a specified way from existing data collected using methods which are diverse across studies. Data can be recoded, transformed or combined with additional information to achieve harmonisation, but the process requires compatibility of both the methods used and the pre-existing data. The degree of similarity required is not absolute or easily defined; it varies according to the target variable to be derived and the scientific context (i.e. the research question). What is important is that the data are ‘inferentially equivalent’, i.e. conclusions about the latent true values from the derived target variable are valid regardless of the method by which the data were originally collected.
For example, when harmonising a ‘total daily energy intake’ target variable, there would be little scope to harmonise data from a study with only fruit and vegetable intakes. Alternatively, if a ‘total daily physical activity’ were harmonised as a target variable, data from diverse methods such as questionnaires and accelerometers could conceivably be harmonised.
Harmonisation includes practices that enable the pooling of data from multiple cohorts/biobanks at a level of precision that is scientifically adequate, yet accommodates the heterogeneity of those studies. The key challenge of harmonisation is to increase sample size by combining an adequate number of studies, whilst limiting inclusion to those that are satisfactorily harmonised [3]. Compared to standardisation, advantages of this more flexible approach include the potential to include a broader range of studies with greater variety of information, and the ability to use existing data which could lead to more rapid scientific impact [6]. However, this work is challenging and time consuming, and requires access to measurement expertise and resources. Resources such as the InterConnect and Maelstrom Research registries therefore aim to capture and share the algorithms and processes used during harmonisation, so that others may utilise this information for future work.
Prospective vs. retrospective harmonisation
Prospective harmonisation
Ideally, researchers would agree in advance on a series of practices to collect data in such a way as to directly enable pooled analysis [4]. This prospective harmonisation does not necessarily denote complete standardisation of methods, since a degree of plurality is accepted where necessary but this would be planned and justified before the data are collected. A prospective harmonisation approach provides comparable output across methods of inference, despite differences in measurement without the need for further harmonisation steps but this involves significant planning, as well as adherence to those plans across studies.
Retrospective harmonisation
In contrast, retrospective harmonisation occurs after the data have been collected; this is the most common scenario. The quantity and quality of data that can be pooled is limited by the pre-existing differences between study methods and protocols [3]. The retrospective harmonisation process involves steps which can be summarised as follows:
The target variable is the desired common format to be derived using harmonisation from the existing raw data in the different studies. There may be several variables in any given analysis that require harmonisation, and multiple target variables must therefore be defined. The definition of any single target variable should include its unit. Examples of target variables include:
The target variable should be suitable for the purposes of answering a research question but is also dependent upon the methods used and data available from the various studies. Some studies may already report the target variable in the desired units with no requirement for modification or transformation. The target variable and its units may need to be reconsidered when assessing harmonisation potential; this is a balance between what is desirable for the purposes of answering the research question, and what is feasible considering the data available. Please refer to case study 2 on the derivation of leisure-time physical activity target variables for the InterConnect project for more information.
As indicated above, the aim of harmonisation is to produce a target variable using data from different studies in such a way that the data can be considered inferentially equivalent. Since the level of harmonisation potential is determined by the methods used (and the resulting data), it is essential to scrutinise the methods used across different studies to establish whether metrics can be harmonised and how this may be achieved.
The first step in the harmonisation process is, therefore, to acquire relevant meta-data information from studies, such as:
Methods and method components
Each method has a number of components, including the instrument used to make the initial measurement and how it is administered, plus any data storage, processing and derivation stages. The use of additional information such as energy cost tables or nutritional databases also form part of the method. These components vary between methods used in different studies; however, depending upon the target variable, this variation may not always impact inferential equivalence.
It is therefore useful to document the methods used and assess which components impact inferential equivalence. For example, if the target variable for a study was daily physical activity energy expenditure, then variation in the domains captured by two different questionnaires (e.g. leisure-time activity vs occupational and travel-related activity) would have greater impact on compatibility than variation in administration mode (e.g. electronic vs. pen and paper).
When assessing harmonisation potential, the components should be examined in detail. For example, when questionnaires are used, assessment items relating to the target variable should be identified and compared. As shown in Figure C.8.3, specific items relating to the target variable of interest are highlighted alongside the units and categories used. This information can be used to assess not only whether the items relate to the variable of interest (e.g. are the activities queried relevant to the research question?), but also whether the existing data can be transformed to the common format.
STUDY | ORIGINAL QUESTION | DATA FORMAT | |
Online survey of pregnant women | In total, how much of the following do you do at present? • Jogging • Aerobic • Ante-natal exercises • Keep fit exercises • Yoga • Squash • Tennis/badminton • Swimming • Brisk walking • Weight training • Cycling • Other exercises |
Categorical: >7 hrs/week, 2-6 hrs/week, <1 hr/week, 0 hrs/week | |
Interviews of older adults |
In your spare time, how much time in the past week did you spent on: • walking for fun? • riding a bicycle? • playing sports (for example: tennis, handball, gymnastics, fitness, skating, and swimming)? • doing any other physical exercise in your spare time, for example working in the garden or doing odd jobs around the house (do not include household activities)? For each question: At what pace do you usually do this? • relaxed pace • average pace • brisk pace |
Continuous: Hours per week of light (relaxed), moderate and vigorous intensity physical activity | |
Postal survey of general population |
Nowadays, at least one hour per week, do you engage in any regular activity like brisk walking, gardening, housework, jogging, cycling, etc. intense enough to work up a sweat? |
Binary: Yes/No |
Figure C.8.3 Overview of questionnaire items of leisure-time physical activity from three different studies which can potentially be used to derive a harmonised target variable.
Harmonisation using simple unit conversion
Data are sometimes collected using methods which are sufficiently harmonised but expressed in units which are not directly compatible. If the mathematical relationship between two variables is known, then a conversion factor can be used to harmonise the data to the same units; this process does not have any uncertainty and all other things being equal, the result is fully inferentially equivalent.
One example of this approach is the use of different units for rate of energy turnover, say kilocalories (kcal) per day, kilojoules (kJ) per day, or Watt. The relationship between kcal and kJ is known to be 1 kcal to 4.184 kJ, and Watt is Joules per second (86400 seconds per day). Units can therefore be harmonised using conversion factors as required.
A more complex example may involve energy expenditure data which have been adjusted for body mass (kcal/kg/day), or not (kcal/day). If individual-level data of both energy and body mass are available, it is possible to convert adjusted data to unadjusted data, or vice versa, depending on the analysis to be conducted. If individual-level data are not available, assumptions are necessary on homogeneity of these variables within the strata to be analysed.
Simple conversion should occur in tandem with proper assessment of harmonisation potential (see above). Variables representing the same exposure (e.g. total energy intake) in the same units (e.g. kcal/day) may not be inferentially equivalent due to differences in the methods used. Where differences between methods are too great, further harmonisation using algorithms or validation data is required.
Harmonisation by collapsing to least common denominator
The harmonisation process often requires more complex recoding, modification or transformation of existing data in order to achieve a common format. There is therefore a degree of inference involved. Separate processing rules must be formulated to transform the variables from each study into the common target variable format. These rules, or algorithms, depend upon the data available in each study; the following dimensions may be available to different degrees across studies according to the methods used:
The various dimensions of the variable of interest can be combined or modified to produce the target variable. For examples of processing rules for deriving harmonised target variables, please see the three case studies:
Depending on the type of data available in each study (e.g. continuous, categorical, ordinal, interval) assumptions will likely be needed in order to derive the target variable. For example, in Figure C.8.3 (above), responses of 2-6 hours per week in Study 1 could be recoded as 4 hours per week to yield a target variable of weekly activity duration in mins/week as available in Study 2; to derive another target variable of activity energy expenditure (e.g. in MET * minutes per week), duration information will need to be combined with:
External information or normative data can be used to inform and support the assumptions made when developing harmonisation algorithms, such as:
Caution is advised when using additional information such as this, as the degree of generalisability to the population may vary by participating study; making assumptions explicit allows better evaluation of inferential equivalence.
Some algorithms may result in the loss of more granular data (see Figure C.8.4 for example of potential variation in granularity of data). If one study provides data in binary format (e.g. low/high), and another provides data in a continuous metric, then a potential harmonisation approach is to reduce the more granular data to the binary format (a ‘reductionist approach’). For more detail on this issue, please see case study 3 on simulation of harmonisation of physical activity exposure using validation data.
Figure C.8.4 Potential differences in granularity of data to be harmonised in three participating studies.
Harmonisation using validation data
The use of collapsing algorithms leads to loss of information when richer, more detailed data are reduced to the less granular level of another variable in order to achieve harmonisation (e.g. coding a continuous variable as ordinal categories of low, medium and high exposure). This loss in information generally weakens statistical power to detect associations.
An alternative approach can preserve the more detailed information from some participating studies and enable harmonisation of less granular data from others. The approach is based on relationships of the estimates in included studies with the unobservable, or latent, true values which can be estimated in method comparison (validation) studies using a criterion measure alongside the method in question, as shown in Figure C.8.5.
Figure C.8.5 Hypothetical relationships between estimates from three participating studies with the latent truth. Data from a suitable criterion method best estimate the latent true values of the target variable. If the relationship (mapping relationship 1) between estimates from a criterion method and estimates from Method A is known from a validation study, then it may be possible to transform data from Method A so that they are harmonised with data from the criterion method.
The above approach relies upon the existence of a suitable criterion for the target variable, and the availability of validation data for a given ‘Method X’ against that criterion. Ideally this validation work would be conducted in a population similar to that which is providing the data being harmonised.
Sourcing applicable validation data for multiple studies across heterogeneous populations may be challenging; for some methods (e.g. Method B in Figure C.8.4), no validation data are available (mapping relationship 2). In this scenario, it may be possible to map via a third method (Method C) if two additional sets of validation data are available, namely criterion validity of Method C (mapping relationship 3) and convergent validity between Method B and C (mapping relationship 4).
When validation data are available, the transformation using validation data consists of:
Subsequent association analysis should then ideally provide:
For a worked example of this harmonisation approach, please see case study 3 on simulation of harmonisation of physical activity exposure using validation data.