Data on diet, physical activity, and anthropometry are not always suitable for use in their raw state. Data processing refers to the series of steps performed to derive variables from the raw measurement.
Each of the diet, physical activity, and anthropometry domains involves unique data processing steps, for example, processing food consumption to nutrient intakes, heart rate to energy expenditure, and x-ray absorption to fat mass, as discussed on relevant pages. This section describes the following more general topics requiring data processing:
All are often implemented or addressed after standard data processing is undertaken. All considerations are common across assessments of diet, physical activity, and anthropometry. All should be considered as options depending on the research aim, the study design, availability of data, and biological or statistical plausibility.
Involves identification of invalid or incorrect records within the dataset, for example:
In some cases, it may be possible to correct data or contact the participant to clarify details.
An outlier is an extreme observation that appears to deviate markedly from other observations in a defined sample, and can have a large effect on statistical analysis according to the quantity of outliers. Some statistics, such as the mean and least squares regression, are particularly vulnerable to the effects of outliers [1].
Identifying outliers depends upon the underlying distribution of the data, and should therefore begin with inspecting this distribution. Outliers can be identified visually using plots such as a scatterplot, histogram, or box plot. Statistical thresholds can also be used, for example, data points three or more standard deviations from the mean can be flagged.
An outlier may indicate an error in the data but may also be a legitimate extreme case sampled from the population by chance. Outliers must therefore be carefully investigated to understand their cause(s) and to decide what should be done about them. Decisions must be biologically or statistically plausible. The aim is to produce results that are not affected by a few outliers. Methods that can be used include truncation, winsorization, or transformation of the data.
Truncation
Truncation is to remove values above or below an absolute (e.g. kcal/day) or relative threshold (mean ± three standard deviations).
The drawback of this method is that potentially valuable data are lost; as such the sample may not be fully representative of the population of interest. The choice to truncate data and the threshold to use should be considered carefully.
Winsorization
Winsorization involves recoding extreme values to the nearest ‘reasonable’ values (either minimum or maximum). For example, when using the International Physical Activity Questionnaire (IPAQ) [2] in a population where a sedentary lifestyle is concerning, values for walking exceeding three hours per day may appear to be outliers. Those outlying values can be winsorized to three hours, permitting a maximum of 21 hours of walking per week.
Winsorization can reduce the effect of outliers without removing individual cases from the dataset. Winsorization maintains the relative ordering of scores, with the highest or lowest scores remaining, thus minimising harm to statistical inference.
The choice of the highest or lowest ‘reasonable’ value is important and should be justified. For example, cut points may reflect their clinical meaning, pilot data, biological plausibility, or statistical plausibility. It can also be based on percentiles.
Diet, physical activity, and anthropometric data often have skewed distributions, mostly being naturally truncated at zero and having no upper limit. Transformation of data may be required for statistical analysis and modelling, and may also reduce the effect of outliers.
Transformation has the advantages of keeping all values in the dataset and keeping the relative ranking of scores. Common types of transformation include:
Box-Cox transformation is a type of power transformation to achieve a nearly perfect normality of a variable. A variable x is transformed by fitting it to a function of (xk-1)/k, where k is selected so that the transformed variable has skewness equal to zero. This transformation is useful to obtain a normally distributed variable, but this loses its interpretability. Practically, standardisation after Box-Cox transformation is conceivable to allow 1 unit to represent 1 standard deviation.
Transformation always alters the meaning of one unit of the target variable. Notably, the interpretation of a variable does not necessarily become difficult. For example, if a variable is transformed with a common logarithm (Log10), a one unit difference in the variable indicates difference by 10 times (i.e. Log1010=1, Log10100=2, Log101000=3 etc.).
Categorisation
Depending on the aim of the study, categorisation of a study population can be readily applied. This reduces any effects of the shape of distribution and of outliers. Categorisation can be based on:
Categorisation facilitates statistical analysis and interpretation. However, if a continuous variable is categorised, this always loses differences between individuals in the same category. For instance, if BMI is treated as a categorical variable as above, individuals with BMI=30 kg/m2 and with BMI=40 kg/m2 are treated as the same.
Missing data represent a potential source of bias in population health research. Research that is carefully designed and conducted can minimise the potential for missing data. It may be useful to consider whether the method to be used is likely to result in missing data in the population of interest. For example,
Consideration of the subgroups more likely to return missing data, and collection of additional information describing these groups, can in turn be used to address missing data that do occur.
Types of missing data
Unfortunately, missing data often occurs. The mechanism by which this happens must be accounted for when the missing information is treated. In statistics and epidemiology literature [3], missing data are categorised as:
Dealing with missing data
The type of missing data may affect the chosen method for dealing with missing data [3,4]. Broadly, the principal options are:
Imputation can be performed on a case-by-case basis. For example: