Distributional comparison of synthesised and observed data
utility.gen.RdDistributional comparison of synthesised data set with the original (observed) data set using propensity scores.
This function can be also used with synthetic data NOT created by
 syn(), but then additional parameters not.synthesised
 and cont.na might need to be provided.
Usage
# S3 method for synds
utility.gen(object, data,
            method = "cart", maxorder = 1, k.syn = FALSE, tree.method = "rpart",
            max.params = 400, print.stats = c("pMSE", "S_pMSE"), resamp.method = NULL,
            nperms = 50, cp = 1e-3, minbucket = 5, mincriterion = 0, vars = NULL,
            aggregate = FALSE, maxit = 200, ngroups = NULL, print.flag = TRUE,
            print.every = 10, digits = 6, print.zscores = FALSE, zthresh = 1.6,
            print.ind.results = FALSE, print.variable.importance = FALSE, ...)
# S3 method for data.frame
utility.gen(object, data, not.synthesised = NULL, cont.na = NULL,
            method = "cart", maxorder = 1, k.syn = FALSE, tree.method = "rpart",
            max.params = 400, print.stats = c("pMSE", "S_pMSE"), resamp.method = NULL,
            nperms = 50, cp = 1e-3, minbucket = 5, mincriterion = 0, vars = NULL,
            aggregate = FALSE, maxit = 200, ngroups = NULL, print.flag = TRUE,
            print.every = 10, digits = 6, print.zscores = FALSE, zthresh = 1.6,
            print.ind.results = FALSE, print.variable.importance = FALSE, ...)
# S3 method for list
utility.gen(object, data, not.synthesised = NULL, cont.na = NULL,
            method = "cart", maxorder = 1, k.syn = FALSE, tree.method = "rpart",
            max.params = 400, print.stats = c("pMSE", "S_pMSE"), resamp.method = NULL,
            nperms = 50, cp = 1e-3, minbucket = 5, mincriterion = 0, vars = NULL,
            aggregate = FALSE, maxit = 200, ngroups = NULL, print.flag = TRUE,
            print.every = 10, digits = 6, print.zscores = FALSE, zthresh = 1.6,
            print.ind.results = FALSE, print.variable.importance = FALSE, ...)
# S3 method for utility.gen
print(x, digits = NULL, zthresh = NULL,
               print.zscores = NULL, print.stats = NULL,
               print.ind.results = NULL, print.variable.importance = NULL, ...)Arguments
- object
 it can be an object of class
synds, which stands for 'synthesised data set'. It is typically created by functionsyn()and it includesobject$msynthesised data set(s) asobject$syn. This a single data set whenobject$m = 1or a list of lengthobject$mwhenobject$m > 1. Alternatively, when data are synthesised not usingsyn(), it can be a data frame with a synthetic data set or a list of data frames with synthetic data sets, all created from the same original data with the same variables and the same method.- data
 the original (observed) data set.
- not.synthesised
 a vector of variable names for any variables that has been left unchanged in the synthetic data. Not required if oject is of class
synds- cont.na
 a named list of codes for missing values for continuous variables if different from the
Rmissing data codeNA. The names of the list elements must correspond to the variables names for which the missing data codes need to be specified. Not required if oject is of classsynds- method
 a single string specifying the method for modeling the propensity scores. Method can be selected from
"logit"and"cart".- maxorder
 maximum order of interactions to be considered in
"logit"method. For model without interactions0should be provided.- k.syn
 a logical indicator as to whether the sample size itself has been synthesised.
- tree.method
 implementation of
"cart"method that is used whenmethod = "cart". It can be"rpart"or"ctree".- max.params
 the maximum number of parameters for a
"logit"model which alerts the user to possible fitting failure.- print.stats
 statistics to be printed must be a selection from
"pMSE","SPECKS","PO50","S_pMSE","S_SPECKS","S_PO50". Ifprint.stats = "all", all of the measures mentioned above will be printed.- resamp.method
 method used for resampling estimates of standardized measures can be
"perm","pairs"or"none". Defaults to"pairs"ifprint.statsincludes"S_SPECKS"or"S_PO50"or synthesis is incomplete else defaults to"perm"if method is"cart"or toNULL, no resampling needed, if method is"logit"."none"can be used to get results without standardized measures e.g. in simulations.- nperms
 number of permutations for the permutation test to obtain the null distribution of the utility measure when
resamp.method = "perm".- cp
 complexity parameter for classification with tree.method
"rpart". Small values grow bigger trees.- minbucket
 minimum number of observations allowed in a leaf for classification when
method = "cart".- mincriterion
 criterion between 0 and 1 to use to control
tree.method = "ctree"when the tree will not be allowed to split further. A value of0.95would be equivalent to a5%significance test. Here we set it to0to effectively disable this test and grow large trees.- vars
 variables to be included in the utility comparison. It can be a character vector of names of variables or an integer vector of their column indices. If none are specified all the variables in the synthesised data will be included.
- aggregate
 logical flag as to whether the data should be aggregated by collapsing identical rows before computation. This can lead to much faster computation when all the variables are categorical. Only works for
method = "logit".- maxit
 maximum iterations to use when
method = "logit". If the model does not converge in this number a warning will suggest increasing it.- ngroups
 target number of groups for categorisation of each numeric variable: final number may differ if there are many repeated values. If
NULL(default) variables are not categorised into groups.- print.flag
 TRUE/FALSE to indicate if any messages should be printed during calculations. Change to FALSE for simulations.
- print.every
 controls the printing of progress of resampling when
resamp.methodis notNULL. Whenprint.every = 0no progress is reported, otherwise the resample number is printed everyprint.every.- ...
 additional parameters passed to
glm,rpart, orctree.- x
 an object of class
utility.gen.- digits
 number of digits to print in the default output values.
- zthresh
 threshold value to use to suppress the printing of z-scores under
+/-this value formethod = "logit". If set toNAall z-scores are printed.- print.zscores
 logical value as to whether z-scores for coefficients of the logit model should be printed.
- print.ind.results
 logical value as to whether utility score results from individual syntheses should be printed.
- print.variable.importance
 logical value as to whether the variable importance measure should be printed when
tree.method = "rpart".
Details
This function follows the method for evaluating the utility of masked data as given in Snoke et al. (2018) and originally proposed by Woo et al. (2009). The original and synthetic data are combined into one dataset and propensity scores, as detailed in Rosenbaum and Rubin (1983), are calculated to estimate the probability of membership in the synthetic data set. The utility measure is based on the mean squared difference between these probabilities and the probability expected if the data did not distinguish the synthetic data from the original.
If k.syn = FALSE the expected probability is just the proportion of
  synthetic data in the combined data set, 0.5 when the original and
  synthetic data have the same number of records. Setting k.syn = TRUE
  indicates that the numbers of observations in the synthetic data was
  synthesised and not fixed by the synthesiser. In this case the expected
  probability will be 0.5 in all cases and the model to discriminate
  between observed and synthetic will include an intercept term. This will
  usually only apply when the standalone version of this function
  utility.gen.sa() is used.
Propensity scores can be modeled by logistic regression method = "logit"
  or by two different implementations of classification and regression trees as
  method "cart". For logistic regression the predictors are all variables
  in the data and their interactions up to order maxorder. The default of
  1 gives all main effects and first order interactions. For logistic
  regression the null distribution of the propensity score is derived and is
  used to calculate ratios and standardised values.
For method = "cart" the expectation and variance of the null
  distribution is calculated from a permutation test. Our recent work
  indicates that this method can sometimes give misleading results.
If missing values exist, indicator variables are added and included in the
  model as recommended by Rosenbaum and Rubin (1984). For categorical variables,
  NA is treated as a new category.
Value
An object of class utility.gen which is a list including the utility
  measures their expected null values for each synthetic set with the following
  components:
- call
 the call that produced the result.
- m
 number of synthetic data sets in object.
- method
 method used to fit propensity score.
- tree.method
 cart function used to fit propensity score when
method = "cart".- resamp.method
 type of resampling used to get
pMSEExpandpval.- maxorder
 see above.
- vars
 see above.
- nfix
 see above.
- aggregate
 see above.
- maxit
 see above.
- ngroups
 see above.
- df
 degrees of freedom for the chi-squared test for logit models derived from the number of non-aliased coefficients in the logistic model, minus
1fork.syn = FALSE.- mincriterion
 see above.
- nperms
 see above.
- incomplete
 TRUE/FALSE indicator if any of the variables being compared are not synthesised.
- pMSE
 propensity score mean square error from the utility model or a vector of these values if
object$m > 1.- S_pMSE
 ratio(s) of
pMSEto its Null expectation.- PO50
 percentage over 50% of each synthetic data set where the model used correctly predicts whether real or synthetic.
- S_PO50
 ratio(s) of
PO50to its Null expectation.- SPECKS
 Kolmogorov-Smirnov statistic to compare the propensity scores for the original and synthetic records.
- S_SPECKS
 ratio(s) of
SPECKSto its Null expectation.- print.stats
 see above.
- fit
 the fitted model for the propensity score or a list of fitted models of length
mifm > 0.- nosplits
 for resampling methods and cart models, a list of the number of times from the total each resampled cart model failed to select any splits to classify the indicator. Indicates that this method is not working correctly and results should not be used but a logit model selected instead.
- digits
 see above.
- print.ind.results
 see above.
- print.zscores
 see above.
- zthresh
 see above.
- print.variable.importance
 see above.
References
Woo, M-J., Reiter, J.P., Oganian, A. and Karr, A.F. (2009). Global measures of data utility for microdata masked for disclosure limitation. Journal of Privacy and Confidentiality, 1(1), 111-124.
Rosenbaum, P.R. and Rubin, D.B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79(387), 516-524.
Snoke, J., Raab, G.M., Nowok, B., Dibben, C. and Slavkovic, A. (2018). General and specific utility measures for synthetic data. Journal of the Royal Statistical Society: Series A, 181, Part 3, 663-688.
Examples
if (FALSE) {
  ods <- SD2011[1:1000, c("age", "bmi", "depress", "alcabuse", "nofriend")]
  s1 <- syn(ods, m = 5, method = "parametric",
            cont.na = list(nofriend = -8))
  ### synthetic data provided as a 'synds' object
  u1 <- utility.gen(s1, ods)
  print(u1, print.zscores = TRUE, zthresh = 1, digits = 6)
  u2 <- utility.gen(s1, ods, ngroups = 3, print.flag = FALSE)
  print(u2, print.zscores = TRUE)
  u3 <- utility.gen(s1, ods, method = "cart", nperms = 20)
  print(u3, print.variable.importance = TRUE)
  ### synthetic data provided as 'list'
  utility.gen(s1$syn, ods, cont.na = list(nofriend = -8))
  }