Tabular utility

Produces tables from observed and synthesised data and calculates utility measures to compare them with their expectation if the synthesising model is correct.

It can be also used with synthetic data NOT created by syn(), but then an additional parameter cont.na might need to be provided.

Usage

# S3 method for synds
utility.tab(object, data, vars = NULL, ngroups = 5,
            useNA = TRUE, max.table = 1e6,
            print.tables = length(vars) < 4,
            print.stats = c("pMSE", "S_pMSE", "df"),
            print.zdiff = FALSE, print.flag = TRUE,
            digits = 4, k.syn = FALSE, ...)

# S3 method for data.frame
utility.tab(object, data, vars = NULL, cont.na = NULL,
            ngroups = 5, useNA = TRUE, max.table = 1e6,
            print.tables = length(vars) < 4,
            print.stats = c("pMSE", "S_pMSE", "df"),
            print.zdiff = FALSE, print.flag = TRUE,
            digits = 4, k.syn = FALSE, ...)

# S3 method for list
utility.tab(object, data, vars = NULL, cont.na = NULL,
            ngroups = 5, useNA = TRUE, max.table = 1e6,
            print.tables = length(vars) < 4,
            print.stats = c("pMSE", "S_pMSE", "df"),
            print.zdiff = FALSE, print.flag = TRUE,
            digits = 4, k.syn = FALSE, ...)


# S3 method for utility.tab
print(x, print.tables = NULL,
      print.zdiff = NULL, print.stats = NULL,
      digits = NULL, ...)

Arguments

object: an object of class synds, which stands for 'synthesised data set'. It is typically created by function syn() or syn.strata() and it includes object$m number of synthesised data set(s), as well as object$syn the synthesised data set, if m = 1, or a list of m such data sets. Alternatively, when data are synthesised not using syn(), it can be a data frame with a synthetic data set or a list of data frames with synthetic data sets, all created from the same original data with the same variables and the same method.
data: the original (observed) data set.
vars: a single string or a vector of strings with the names of variables to be used to form the table.
cont.na: a named list of codes for missing values for continuous variables if different from the R missing data code NA. The names of the list elements must correspond to the variables names for which the missing data codes need to be specified.
max.table: a maximum table size. You could try increasing the default value, but memory problems are likely.
ngroups: if numerical (non-factor) variables are included they will be classified into this number of groups to form tables. Classification is performed using classIntervals() function for n = ngroups. By default, style = "quantile" to get appropriate groups for skewed data. Problems for variables with a small number of unique values are handled by selecting only unique values of breaks. Arguments of classIntervals() may be, however, specified in the call to utility.tab().
useNA: determines if NA values are to be included in tables.
print.tables: a logical value that determines if tables of observed and synthesised data are to be printed. By default tables are printed if they have up to three dimensions.
print.stats: a single string or a vector of strings that determines which utility measures to print. Must be a selection from: "VW", "FT","JSD", "SPECKS", "WMabsDD", "U", "G", "pMSE", "PO50", "MabsDD", "dBhatt", "S_VW", "S_FT", "S_JSD", "S_WMabsDD", "S_G", "S_pMSE", "df", dfG. If print.stats = "all", all of these will be printed. For more information see the details section below.
print.zdiff: a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.
print.flag: a logical value that determines if messages are to be printed during computation.
digits: an integer indicating the number of decimal places for printing statistics, tab.zdiff and mean results for m > 1.
k.syn: a logical indicator as to whether the sample size itself has been synthesised. The default value is FALSE, which will apply to synthetic data created by synthpop.
...: additional parameters; can be passed to classIntervals() function.
x: an object of class utility.tab.

Details

Forms tables of observed and synthesised values for the variables specified in vars. Several utility measures are calculated from the cells of the tables, as described below. Details of all of these measures can be found in Raab et al. (2021). If the synthesising model is correct the measures VW, FT, G and JSD should have chi-square distributions with df degrees of freedom for large samples. Standardised versions of each measure are available (e.g. S_VW for VW, where S_VW = VW/df) that will have an expected value of 1 if the synthesising model is correct. Four other measures are calculated by considering the table as a prediction model. The propensity score mean-squared error pMSE, and from a comparison of propensity scores for the synthetic and original data the Kolmogorov-Smirnov statistic SPECKS and the Wilcoxon rank-sum statistic U and also the percentage of the observations correctly predicted in the combined tables over 50%(PO50) where the majority of observations in each grouping are in agreement with category (real or synthetic) of the observation. The first of these pMSE is identical except for a constant to VW. No expected values are computed for the last three of these measures, but they can be obtained by replication from utility.gen(). Three further measures are calulated from the tables. The mean absolute difference in distributions: firstly MabsDD, the avarage absolute difference in the proportions of original and synthetic data from all the cells in the table. Secondly a weighted version of this measure WMabsDD where the weights are proportional to the inverse of the variance of the absolute differences so that this measure can be standardised by its expected value, df. Finally the Bhattacharyya distances BhattD derived from the overlap of the histograms of the original and synthetic data sets.

Value

An object of class utility.tab which is a list with the following components:

m: number of synthetic data sets in object, i.e. object$m.
VW: a vector with object$m values for the Voas Williamson utility measure.; linearly related to pMSE.
FT: a vector with object$m values for the Freeman-Tukey utility measure.
JSD: a vector with object$m values for the Jensen-Shannaon divergence for comparing the tables.
SPECKS: a vector with object$m values for the Kolmogorov-Smirnov statistic for comparing the propensity scores for the original and synthetic data.
WMabsDD: a vector with object$m values of the weighted mean absolute difference in distributions for original and synthetic data.
U: a vector with object$m values of the Wilcoxon statistic comparing the propensity scores for the original and synthetic data.
G: a vector with object$m values for the adjusted likelihood ratio utility measure.
pMSE: a vector with object$m values of the propensity score mean-squared error; linearly related to VW.
PO50: a vector with object$m values of the percentage over 50% of observations correctly predicted from the propensity scores linearly related to SPECKS and MabsDD.
MabsDD: a vector with object$m values of the mean absolute difference in distributions for original and synthetic data linearly related to SPECKS and PO50.
dBhatt: a vector with object$m values of the Bhattacharyya distances between the synthetic and original data, linearly related to the square root of FT.
S_VW: VW/df.
S_FT: FT/df.
S_JSD: JSD/df.
S_WMabsDD: WMabsDD/df.
S_G: G/df.
S_pMSE: standardised measure from pMSE, identical to S_VW.
df: a vector of degrees of freedom for the chi-square tests which equal to the number of cells in the tables with any observed or synthesised counts minus one when k.syn == FALSE or equal to the the number of cells when k.syn == TRUE.
dfG: degrees of freedom used in standardising G.
nempty: a vector of length object$m with number of cells not contributing to the statistics.
tab.obs: a table from the observed data.
tab.syn: a table or a list of m tables from the synthetic data.
tab.zdiff: a table or a list of m tables of Z statistics for differences between observed and synthesised cells of the tables. Large absolute values indicate a large contribution to lack-of-fit.
digits: an integer indicating the number of decimal places for printing statistics, tab.zdiff and mean results for m > 1.
print.tables: a logical value that determines if tables of observed and synthesised are to be printed.
print.stats: a single string or a vector of strings with utility measures to be printed out.
print.zdiff: a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.
n: number of observation in the original dataset.
k.syn: a logical indicator as to whether the sample size itself has been synthesised.

References

Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11 .

Raab, G.M., Nowok, B. and Dibben, C. (2021). Assessing, visualizing and improving the utility of synthetic data. Available from https://arxiv.org/abs/2109.12717.

Read, T.R.C. and Cressie, N.A.C. (1988) Goodness--of--Fit Statistics for Discrete Multivariate Data, Springer--Verlag, New York.

Voas, D. and Williamson, P. (2001) Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177-200.

Examples