Tabular utility
utility.tab.Rd
Produces tables from observed and synthesised data and calculates utility measures to compare them with their expectation if the synthesising model is correct.
It can be also used with synthetic data NOT created by syn()
,
but then an additional parameter cont.na
might need to be provided.
Usage
# S3 method for synds
utility.tab(object, data, vars = NULL, ngroups = 5,
useNA = TRUE, max.table = 1e6,
print.tables = length(vars) < 4,
print.stats = c("pMSE", "S_pMSE", "df"),
print.zdiff = FALSE, print.flag = TRUE,
digits = 4, k.syn = FALSE, ...)
# S3 method for data.frame
utility.tab(object, data, vars = NULL, cont.na = NULL,
ngroups = 5, useNA = TRUE, max.table = 1e6,
print.tables = length(vars) < 4,
print.stats = c("pMSE", "S_pMSE", "df"),
print.zdiff = FALSE, print.flag = TRUE,
digits = 4, k.syn = FALSE, ...)
# S3 method for list
utility.tab(object, data, vars = NULL, cont.na = NULL,
ngroups = 5, useNA = TRUE, max.table = 1e6,
print.tables = length(vars) < 4,
print.stats = c("pMSE", "S_pMSE", "df"),
print.zdiff = FALSE, print.flag = TRUE,
digits = 4, k.syn = FALSE, ...)
# S3 method for utility.tab
print(x, print.tables = NULL,
print.zdiff = NULL, print.stats = NULL,
digits = NULL, ...)
Arguments
- object
an object of class
synds
, which stands for 'synthesised data set'. It is typically created by functionsyn()
orsyn.strata()
and it includesobject$m
number of synthesised data set(s), as well asobject$syn
the synthesised data set, ifm = 1
, or a list ofm
such data sets. Alternatively, when data are synthesised not usingsyn()
, it can be a data frame with a synthetic data set or a list of data frames with synthetic data sets, all created from the same original data with the same variables and the same method.- data
the original (observed) data set.
- vars
a single string or a vector of strings with the names of variables to be used to form the table.
- cont.na
a named list of codes for missing values for continuous variables if different from the
R
missing data codeNA
. The names of the list elements must correspond to the variables names for which the missing data codes need to be specified.- max.table
a maximum table size. You could try increasing the default value, but memory problems are likely.
- ngroups
if numerical (non-factor) variables are included they will be classified into this number of groups to form tables. Classification is performed using
classIntervals()
function forn = ngroups
. By default,style = "quantile"
to get appropriate groups for skewed data. Problems for variables with a small number of unique values are handled by selecting only unique values of breaks. Arguments ofclassIntervals()
may be, however, specified in the call toutility.tab()
.- useNA
determines if NA values are to be included in tables.
- print.tables
a logical value that determines if tables of observed and synthesised data are to be printed. By default tables are printed if they have up to three dimensions.
- print.stats
a single string or a vector of strings that determines which utility measures to print. Must be a selection from:
"VW"
,"FT"
,"JSD"
,"SPECKS"
,"WMabsDD"
,"U"
,"G"
,"pMSE"
,"PO50"
,"MabsDD"
,"dBhatt"
,"S_VW"
,"S_FT"
,"S_JSD"
,"S_WMabsDD"
,"S_G"
,"S_pMSE"
,"df"
,dfG
. Ifprint.stats = "all"
, all of these will be printed. For more information see the details section below.- print.zdiff
a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.
- print.flag
a logical value that determines if messages are to be printed during computation.
- digits
an integer indicating the number of decimal places for printing statistics,
tab.zdiff
and mean results form > 1
.- k.syn
a logical indicator as to whether the sample size itself has been synthesised. The default value is
FALSE
, which will apply to synthetic data created by synthpop.- ...
additional parameters; can be passed to classIntervals() function.
- x
an object of class
utility.tab
.
Details
Forms tables of observed and synthesised values for the variables
specified in vars
. Several utility measures are calculated from the cells
of the tables, as described below. Details of all of these measures can be found
in Raab et al. (2021). If the synthesising model is correct the measures
VW
, FT
, G
and JSD
should have chi-square distributions
with df
degrees of freedom for large samples. Standardised versions of each
measure are available (e.g. S_VW
for VW
, where S_VW = VW/df
)
that will have an expected value of 1
if the synthesising model is correct.
Four other measures are calculated by considering the table as a prediction model.
The propensity score mean-squared error pMSE
, and from a comparison of
propensity scores for the synthetic and original data the Kolmogorov-Smirnov
statistic SPECKS
and the Wilcoxon rank-sum statistic U
and also
the percentage of the observations correctly predicted in the combined tables over
50%(PO50
) where the majority of observations in each grouping are in
agreement with category (real or synthetic) of the observation. The first of these
pMSE
is identical except for a constant to VW
. No expected values are
computed for the last three of these measures, but they can be obtained by replication
from utility.gen()
.
Three further measures are calulated from the tables. The mean absolute difference
in distributions: firstly MabsDD
, the avarage absolute difference in the
proportions of original and synthetic data from all the cells in the table.
Secondly a weighted version of this measure WMabsDD
where the weights are
proportional to the inverse of the variance of the absolute differences so that
this measure can be standardised by its expected value, df
. Finally the
Bhattacharyya distances BhattD
derived from the overlap of the histograms
of the original and synthetic data sets.
Value
An object of class utility.tab
which is a list with the following
components:
- m
number of synthetic data sets in object, i.e.
object$m
.- VW
a vector with
object$m
values for the Voas Williamson utility measure.; linearly related topMSE
.- FT
a vector with
object$m
values for the Freeman-Tukey utility measure.- JSD
a vector with
object$m
values for the Jensen-Shannaon divergence for comparing the tables.- SPECKS
a vector with
object$m
values for the Kolmogorov-Smirnov statistic for comparing the propensity scores for the original and synthetic data.- WMabsDD
a vector with
object$m
values of the weighted mean absolute difference in distributions for original and synthetic data.- U
a vector with
object$m
values of the Wilcoxon statistic comparing the propensity scores for the original and synthetic data.- G
a vector with
object$m
values for the adjusted likelihood ratio utility measure.- pMSE
a vector with
object$m
values of the propensity score mean-squared error; linearly related toVW
.- PO50
a vector with
object$m
values of the percentage over 50% of observations correctly predicted from the propensity scores linearly related toSPECKS
andMabsDD
.- MabsDD
a vector with
object$m
values of the mean absolute difference in distributions for original and synthetic data linearly related toSPECKS
andPO50
.- dBhatt
a vector with
object$m
values of the Bhattacharyya distances between the synthetic and original data, linearly related to the square root ofFT
.- S_VW
VW/df
.- S_FT
FT/df
.- S_JSD
JSD
/df.- S_WMabsDD
WMabsDD/df.
- S_G
G/df
.- S_pMSE
standardised measure from
pMSE
, identical toS_VW
.- df
a vector of degrees of freedom for the chi-square tests which equal to the number of cells in the tables with any observed or synthesised counts minus one when
k.syn == FALSE
or equal to the the number of cells whenk.syn == TRUE
.- dfG
degrees of freedom used in standardising
G
.- nempty
a vector of length
object$m
with number of cells not contributing to the statistics.- tab.obs
a table from the observed data.
- tab.syn
a table or a list of
m
tables from the synthetic data.- tab.zdiff
a table or a list of
m
tables of Z statistics for differences between observed and synthesised cells of the tables. Large absolute values indicate a large contribution to lack-of-fit.- digits
an integer indicating the number of decimal places for printing statistics,
tab.zdiff
and mean results form > 1
.- print.tables
a logical value that determines if tables of observed and synthesised are to be printed.
- print.stats
a single string or a vector of strings with utility measures to be printed out.
- print.zdiff
a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.
- n
number of observation in the original dataset.
- k.syn
a logical indicator as to whether the sample size itself has been synthesised.
References
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11 .
Raab, G.M., Nowok, B. and Dibben, C. (2021). Assessing, visualizing and improving the utility of synthetic data. Available from https://arxiv.org/abs/2109.12717.
Read, T.R.C. and Cressie, N.A.C. (1988) Goodness--of--Fit Statistics for Discrete Multivariate Data, Springer--Verlag, New York.
Voas, D. and Williamson, P. (2001) Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177-200.
Examples
ods <- SD2011[1:1000, c("sex", "age", "marital", "nofriend")]
s1 <- syn(ods, m = 10, cont.na = list(nofriend = -8))
#>
#> Synthesis number 1
#> --------------------
#> sex age marital nofriend
#>
#> Synthesis number 2
#> --------------------
#> sex age marital nofriend
#>
#> Synthesis number 3
#> --------------------
#> sex age marital nofriend
#>
#> Synthesis number 4
#> --------------------
#> sex age marital nofriend
#>
#> Synthesis number 5
#> --------------------
#> sex age marital nofriend
#>
#> Synthesis number 6
#> --------------------
#> sex age marital nofriend
#>
#> Synthesis number 7
#> --------------------
#> sex age marital nofriend
#>
#> Synthesis number 8
#> --------------------
#> sex age marital nofriend
#>
#> Synthesis number 9
#> --------------------
#> sex age marital nofriend
#>
#> Synthesis number 10
#> --------------------
#> sex age marital nofriend
utility.tab(s1, ods, vars = c("marital", "sex"), print.stats = "all")
#>
#> Observed:
#> ($tab.obs)
#> sex
#> marital MALE FEMALE
#> SINGLE 156 109
#> MARRIED 268 313
#> WIDOWED 10 93
#> DIVORCED 11 30
#> LEGALLY SEPARATED 2 0
#> DE FACTO SEPARATED 0 5
#> <NA> 1 2
#>
#> Mean of 10 synthetic tables ($tab.syn):
#> sex
#> marital MALE FEMALE
#> SINGLE 150.0 113.6
#> MARRIED 274.6 307.5
#> WIDOWED 10.1 91.8
#> DIVORCED 12.3 31.1
#> LEGALLY SEPARATED 1.2 0.0
#> DE FACTO SEPARATED 0.0 4.4
#> <NA> 1.3 2.1
#>
#> Selected utility measures from 10 syntheses:
#> VW FT JSD SPECKS WMabsDD U G pMSE PO50 MabsDD
#> 1 13.2321 17.5233 0.0027 0.022 11.4785 515641.0 8.6658 0.0008 1.10 0.044
#> 2 8.7500 12.7740 0.0019 0.024 9.9198 517981.5 4.6827 0.0005 1.20 0.048
#> 3 28.6949 32.8633 0.0055 0.066 20.6471 541808.0 25.7377 0.0018 3.30 0.132
#> 4 8.7084 10.8429 0.0017 0.026 9.7026 518120.5 6.4673 0.0005 1.30 0.052
#> 5 15.4531 15.7516 0.0028 0.050 13.9184 529536.0 16.6441 0.0010 2.50 0.100
#> 6 12.5744 14.6264 0.0024 0.033 10.9798 524711.5 10.8317 0.0008 1.65 0.066
#> 7 7.3763 7.8225 0.0014 0.012 8.1789 510351.5 8.7406 0.0005 0.60 0.024
#> 8 7.0810 9.1092 0.0014 0.025 10.1135 517469.0 5.0488 0.0004 1.25 0.050
#> 9 10.3883 10.4533 0.0019 0.040 11.9292 525658.0 10.2727 0.0006 2.00 0.080
#> 10 17.2808 21.3630 0.0034 0.039 14.3949 528451.5 13.6157 0.0011 1.95 0.078
#> dBhatt S_VW S_FT S_JSD S_WMabsDD S_G S_pMSE df dfG
#> 1 0.0468 1.2029 1.5930 1.4161 1.0435 0.8666 1.2029 11 10
#> 2 0.0400 0.7955 1.1613 0.9755 0.9018 0.4683 0.7955 11 10
#> 3 0.0641 2.6086 2.9876 2.8715 1.8770 2.5738 2.6086 11 10
#> 4 0.0368 0.7917 0.9857 0.9053 0.8821 0.6467 0.7917 11 10
#> 5 0.0444 1.4048 1.4320 1.4804 1.2653 1.5131 1.4048 11 11
#> 6 0.0428 1.1431 1.3297 1.2660 0.9982 1.0832 1.1431 11 10
#> 7 0.0313 0.6706 0.7111 0.7249 0.7435 0.7946 0.6706 11 11
#> 8 0.0337 0.6437 0.8281 0.7448 0.9194 0.5049 0.6437 11 10
#> 9 0.0361 0.9444 0.9503 0.9869 1.0845 0.9339 0.9444 11 11
#> 10 0.0517 1.5710 1.9421 1.7862 1.3086 1.3616 1.5710 11 10
s2 <- syn(ods, m = 1, cont.na = list(nofriend = -8))
#>
#> Synthesis
#> -----------
#> sex age marital nofriend
u2 <- utility.tab(s2, ods, vars = c("marital", "age", "sex"), ngroups = 3)
print(u2, print.tables = TRUE, print.zdiff = TRUE)
#>
#> Observed:
#> ($tab.obs)
#> , , sex = MALE
#>
#> age
#> marital [16,36) [36,56) [56,92]
#> SINGLE 125 24 7
#> MARRIED 30 121 117
#> WIDOWED 1 2 7
#> DIVORCED 0 4 7
#> LEGALLY SEPARATED 0 1 1
#> DE FACTO SEPARATED 0 0 0
#> <NA> 0 0 1
#>
#> , , sex = FEMALE
#>
#> age
#> marital [16,36) [36,56) [56,92]
#> SINGLE 97 7 5
#> MARRIED 64 141 108
#> WIDOWED 1 11 81
#> DIVORCED 2 19 9
#> LEGALLY SEPARATED 0 0 0
#> DE FACTO SEPARATED 1 3 1
#> <NA> 1 1 0
#>
#>
#> Synthesised:
#> ($tab.syn)
#> , , sex = MALE
#>
#> age
#> marital [16,36) [36,56) [56,92]
#> SINGLE 117 21 8
#> MARRIED 31 119 125
#> WIDOWED 1 1 8
#> DIVORCED 0 3 4
#> LEGALLY SEPARATED 0 1 1
#> DE FACTO SEPARATED 0 0 0
#> <NA> 0 0 2
#>
#> , , sex = FEMALE
#>
#> age
#> marital [16,36) [36,56) [56,92]
#> SINGLE 102 8 9
#> MARRIED 48 147 109
#> WIDOWED 3 9 81
#> DIVORCED 5 17 13
#> LEGALLY SEPARATED 0 0 0
#> DE FACTO SEPARATED 1 3 0
#> <NA> 3 0 0
#>
#>
#> Table of z-scores for differences:
#> ($tab.zdiff)
#> , , sex = MALE
#>
#> age
#> marital [16,36) [36,56) [56,92]
#> SINGLE -0.7273 -0.6325 0.3651
#> MARRIED 0.1811 -0.1826 0.7273
#> WIDOWED 0.0000 -0.8165 0.3651
#> DIVORCED -0.5345 -1.2792
#> LEGALLY SEPARATED 0.0000 0.0000
#> DE FACTO SEPARATED
#> <NA> 0.8165
#>
#> , , sex = FEMALE
#>
#> age
#> marital [16,36) [36,56) [56,92]
#> SINGLE 0.5013 0.3651 1.5119
#> MARRIED -2.1381 0.5000 0.0960
#> WIDOWED 1.4142 -0.6325 0.0000
#> DIVORCED 1.6036 -0.4714 1.2060
#> LEGALLY SEPARATED
#> DE FACTO SEPARATED 0.0000 0.0000 -1.4142
#> <NA> 1.4142 -1.4142
#>
#>
#> Selected utility measures:
#> pMSE S_pMSE df
#> 1 0.0016 0.8398 30
### synthetic data provided as 'data.frame'
utility.tab(s2$syn, ods, vars = c("marital", "nofriend"), ngroups = 3,
print.tables = TRUE, cont.na = list(nofriend = -8), digits = 4)
#>
#> Observed adjusted to match the size of the synthetic data:
#> ($tab.obs)
#> nofriend
#> marital [0,4) [4,8) [8,99] -8
#> SINGLE 78 92 94 1
#> MARRIED 193 185 198 5
#> WIDOWED 46 24 31 2
#> DIVORCED 14 15 12 0
#> LEGALLY SEPARATED 2 0 0 0
#> DE FACTO SEPARATED 3 1 1 0
#> <NA> 0 2 1 0
#>
#> Synthesised:
#> ($tab.syn)
#> nofriend
#> marital [0,4) [4,8) [8,99] -8
#> SINGLE 77 107 80 1
#> MARRIED 188 191 196 4
#> WIDOWED 38 30 34 1
#> DIVORCED 12 13 16 1
#> LEGALLY SEPARATED 1 1 0 0
#> DE FACTO SEPARATED 2 2 0 0
#> <NA> 2 1 2 0
#>
#> Selected utility measures:
#> pMSE S_pMSE df
#> 1 0.0015 1.0303 23