17 September 2016

Do not control a variable belonging to a sub-sample

In public health research, multivariable models are often used to control confounders while investigating the effect of an exposure on an outcome. To use a given model, certain assumptions shall be fulfilled. A model cannot be valid if its assumptions are fatally violated. Some assumptions are checked before using the model. For example, one wouldn’t choose a linear regression model for a nominal dependent variable because the assumption of “a continuous dependent variable” is violated. Other assumptions are checked post-hoc (for example, normality of residuals in linear regression models). Assumptions are model-specific.

There is also a prerequisite that must be checked for all models but that is not often taught to students and not clearly addressed in books. A variable being controlled in a multivariable model should be one that has been measured on all study units and hence for which all study units have some meaningful value. A control variable should not be one that is measured only on a sub-sample. For example, in a study among mothers, if there is a variable about outcome of previous pregnancy, it applies only to mothers who had at least one pregnancy in the past. It is not uncommon to find researchers who enter such variables into multivariable models and end up in models “behaving wildly”. This is because all study units for which the variable doesn’t apply will be excluded from the analysis. The analysis will then be limited to a sub-sample. In effect, estimates will biased, precision will be lost as manifested by too wide confidence intervals (sometimes, the lower and upper bounds of confidence intervals could not be estimated), and model goodness-of-fit statistics cannot be determined.
          
A variable belonging to a sub-sample can be controlled only if one is purposely doing the analysis on a sub-sample for a sub-population inference. In that case, it should be well planned from the outset including ensuring the adequacy of the sample size for sub-population inference.

In conclusion, while working with multivariable models, it s necessary to make sure that all control variables (also exposure variables, for that matter) apply for all study units (unless one is doing an analysis of a sub-sample for sub-population inference). Otherwise, results could lack both validity and precision.

No comments:

Post a Comment