Abstract:
This book presents methods of statistical inference from
multivariate datasets with missing values where missingness
may occur on any or all of the variables. Such datasets arise
frequently in statistical practice, but the tools for effectively
dealing with them are not readily available to data analysts. It
is our goal to provide these tools, along with the knowledge of
how to use them.
When faced with missing values, practitioners frequently
resort to ad hoc methods of case deletion or imputation to
force the incomplete dataset into a rectangular complete-data
format. Many statistical software packages, for example,
automatically omit from a linear regression analysis any case t
hat has a missing value for any variable. Imputation is a
generic term for filling in missing data with plausible values.
In a multivariate dataset, each missing value may be replaced
by the observed mean for that variable, or, in a slightly less
naive approach, by some sort of predicted value from a
regression model. Almost invariably, after the dataset has been
altered by one of these methods no additional provision for
missing data is made in the subsequent analysis. The research
usually proceeds as if the omitted cases had never really been
observed, or as if the imputed values were real data.
When the incomplete cases comprise only a small fraction
of all cases (say, five percent or less) then case deletion may
be a perfectly reasonable solution to the missing-data problem.
In multivariate settings where missing values occur on more
than one variable, however, the incomplete cases are often a
substantial portion of the entire dataset. If so, deleting them
may be inefficient, causing large amounts of information to be
discarded. Moreover, omitting them from the analysis will
tend to introduce bias, to the extent that the incompletely
observed cases differ systematically from the completely
observed ones. The completely observed cases that remain
will be unrepresentative of the population for which the
inference is usually intended: the population of all cases,
rather than the population of cases with no missing data.
Ad hoc methods of imputation are no less problematic.
Imputing averages on a variable-by-variable basis preserves
the observed sample means, but it distorts the covariance
structure, biasing estimated variances and covariances toward
zero. Imputing predicted values from regression models, on
the other hand, tends to inflate observed correlations, biasing
them away from zero. When the pattern of missingness is
complex, devising an ad hoc imputation scheme that preserves
important aspects of the joint distribution of the variables can
be a daunting task. Moreover, even if the joint distribution of
all variables could be adequately preserved, it may be a
serious mistake to treat the imputed data as if they were real.
Standard errors, p-values and other measures of uncertainty
calculated by standard complete-data methods could be
misleading, because they fail to reflect any uncertainty due to
missing data.
This book presents a unified approach to the analysis of
incomplete multivariate data. We will consider datasets for
which the variables are continuous, categorical, or both. This
approach allows one to analyze the data by virtually any
technique that would be appropriate if the data were complete.
This is accomplished not by simply modifying the data in an
ad hoc fashion to make them appear complete, but by
principled methods that account for the missing values, and
the uncertainty they introduce, at each step of the analysis in a
formal way. These methods tend to be computationally
intensive, requiring more computer time than ad hoc
alternatives. However, they do not require a heavy investment
of analyst time, and can be applied to a wide variety of
problems more or less routinely without special efforts to
develop new technology unique to each problem. This book is
written from an applied perspective, attempting to bring together theory, computational methods, data examples and
practical advice in a single source.