Skip to contents

Produces tables from observed and synthesised data and calculates utility measures to compare them with their expectation if the synthesising model is correct.

It can be also used with synthetic data NOT created by syn(), but then an additional parameter cont.na might need to be provided.

Usage

# S3 method for synds
utility.tab(object, data, vars = NULL, ngroups = 5,
            useNA = TRUE, max.table = 1e6,
            print.tables = length(vars) < 4,
            print.stats = c("pMSE", "S_pMSE", "df"),
            print.zdiff = FALSE, print.flag = TRUE,
            digits = 4, k.syn = FALSE, ...)

# S3 method for data.frame
utility.tab(object, data, vars = NULL, cont.na = NULL,
            ngroups = 5, useNA = TRUE, max.table = 1e6,
            print.tables = length(vars) < 4,
            print.stats = c("pMSE", "S_pMSE", "df"),
            print.zdiff = FALSE, print.flag = TRUE,
            digits = 4, k.syn = FALSE, ...)

# S3 method for list
utility.tab(object, data, vars = NULL, cont.na = NULL,
            ngroups = 5, useNA = TRUE, max.table = 1e6,
            print.tables = length(vars) < 4,
            print.stats = c("pMSE", "S_pMSE", "df"),
            print.zdiff = FALSE, print.flag = TRUE,
            digits = 4, k.syn = FALSE, ...)


# S3 method for utility.tab
print(x, print.tables = NULL,
      print.zdiff = NULL, print.stats = NULL,
      digits = NULL, ...)

Arguments

object

an object of class synds, which stands for 'synthesised data set'. It is typically created by function syn() or syn.strata() and it includes object$m number of synthesised data set(s), as well as object$syn the synthesised data set, if m = 1, or a list of m such data sets. Alternatively, when data are synthesised not using syn(), it can be a data frame with a synthetic data set or a list of data frames with synthetic data sets, all created from the same original data with the same variables and the same method.

data

the original (observed) data set.

vars

a single string or a vector of strings with the names of variables to be used to form the table.

cont.na

a named list of codes for missing values for continuous variables if different from the R missing data code NA. The names of the list elements must correspond to the variables names for which the missing data codes need to be specified.

max.table

a maximum table size. You could try increasing the default value, but memory problems are likely.

ngroups

if numerical (non-factor) variables are included they will be classified into this number of groups to form tables. Classification is performed using classIntervals() function for n = ngroups. By default, style = "quantile" to get appropriate groups for skewed data. Problems for variables with a small number of unique values are handled by selecting only unique values of breaks. Arguments of classIntervals() may be, however, specified in the call to utility.tab().

useNA

determines if NA values are to be included in tables.

print.tables

a logical value that determines if tables of observed and synthesised data are to be printed. By default tables are printed if they have up to three dimensions.

print.stats

a single string or a vector of strings that determines which utility measures to print. Must be a selection from: "VW", "FT","JSD", "SPECKS", "WMabsDD", "U", "G", "pMSE", "PO50", "MabsDD", "dBhatt", "S_VW", "S_FT", "S_JSD", "S_WMabsDD", "S_G", "S_pMSE", "df", dfG. If print.stats = "all", all of these will be printed. For more information see the details section below.

print.zdiff

a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.

print.flag

a logical value that determines if messages are to be printed during computation.

digits

an integer indicating the number of decimal places for printing statistics, tab.zdiff and mean results for m > 1.

k.syn

a logical indicator as to whether the sample size itself has been synthesised. The default value is FALSE, which will apply to synthetic data created by synthpop.

...

additional parameters; can be passed to classIntervals() function.

x

an object of class utility.tab.

Details

Forms tables of observed and synthesised values for the variables specified in vars. Several utility measures are calculated from the cells of the tables, as described below. Details of all of these measures can be found in Raab et al. (2021). If the synthesising model is correct the measures VW, FT, G and JSD should have chi-square distributions with df degrees of freedom for large samples. Standardised versions of each measure are available (e.g. S_VW for VW, where S_VW = VW/df) that will have an expected value of 1 if the synthesising model is correct. Four other measures are calculated by considering the table as a prediction model. The propensity score mean-squared error pMSE, and from a comparison of propensity scores for the synthetic and original data the Kolmogorov-Smirnov statistic SPECKS and the Wilcoxon rank-sum statistic U and also the percentage of the observations correctly predicted in the combined tables over 50%(PO50) where the majority of observations in each grouping are in agreement with category (real or synthetic) of the observation. The first of these pMSE is identical except for a constant to VW. No expected values are computed for the last three of these measures, but they can be obtained by replication from utility.gen(). Three further measures are calulated from the tables. The mean absolute difference in distributions: firstly MabsDD, the avarage absolute difference in the proportions of original and synthetic data from all the cells in the table. Secondly a weighted version of this measure WMabsDD where the weights are proportional to the inverse of the variance of the absolute differences so that this measure can be standardised by its expected value, df. Finally the Bhattacharyya distances BhattD derived from the overlap of the histograms of the original and synthetic data sets.

Value

An object of class utility.tab which is a list with the following components:

m

number of synthetic data sets in object, i.e. object$m.

VW

a vector with object$m values for the Voas Williamson utility measure.; linearly related to pMSE.

FT

a vector with object$m values for the Freeman-Tukey utility measure.

JSD

a vector with object$m values for the Jensen-Shannaon divergence for comparing the tables.

SPECKS

a vector with object$m values for the Kolmogorov-Smirnov statistic for comparing the propensity scores for the original and synthetic data.

WMabsDD

a vector with object$m values of the weighted mean absolute difference in distributions for original and synthetic data.

U

a vector with object$m values of the Wilcoxon statistic comparing the propensity scores for the original and synthetic data.

G

a vector with object$m values for the adjusted likelihood ratio utility measure.

pMSE

a vector with object$m values of the propensity score mean-squared error; linearly related to VW.

PO50

a vector with object$m values of the percentage over 50% of observations correctly predicted from the propensity scores linearly related to SPECKS and MabsDD.

MabsDD

a vector with object$m values of the mean absolute difference in distributions for original and synthetic data linearly related to SPECKS and PO50.

dBhatt

a vector with object$m values of the Bhattacharyya distances between the synthetic and original data, linearly related to the square root of FT.

S_VW

VW/df.

S_FT

FT/df.

S_JSD

JSD/df.

S_WMabsDD

WMabsDD/df.

S_G

G/df.

S_pMSE

standardised measure from pMSE, identical to S_VW.

df

a vector of degrees of freedom for the chi-square tests which equal to the number of cells in the tables with any observed or synthesised counts minus one when k.syn == FALSE or equal to the the number of cells when k.syn == TRUE.

dfG

degrees of freedom used in standardising G.

nempty

a vector of length object$m with number of cells not contributing to the statistics.

tab.obs

a table from the observed data.

tab.syn

a table or a list of m tables from the synthetic data.

tab.zdiff

a table or a list of m tables of Z statistics for differences between observed and synthesised cells of the tables. Large absolute values indicate a large contribution to lack-of-fit.

digits

an integer indicating the number of decimal places for printing statistics, tab.zdiff and mean results for m > 1.

print.tables

a logical value that determines if tables of observed and synthesised are to be printed.

print.stats

a single string or a vector of strings with utility measures to be printed out.

print.zdiff

a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.

n

number of observation in the original dataset.

k.syn

a logical indicator as to whether the sample size itself has been synthesised.

References

Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11 .

Raab, G.M., Nowok, B. and Dibben, C. (2021). Assessing, visualizing and improving the utility of synthetic data. Available from https://arxiv.org/abs/2109.12717.

Read, T.R.C. and Cressie, N.A.C. (1988) Goodness--of--Fit Statistics for Discrete Multivariate Data, Springer--Verlag, New York.

Voas, D. and Williamson, P. (2001) Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177-200.

See also

Examples

ods <- SD2011[1:1000, c("sex", "age", "marital", "nofriend")]

s1 <- syn(ods, m = 10, cont.na = list(nofriend = -8))
#> 
#> Synthesis number 1
#> --------------------
#>  sex age marital nofriend
#> 
#> Synthesis number 2
#> --------------------
#>  sex age marital nofriend
#> 
#> Synthesis number 3
#> --------------------
#>  sex age marital nofriend
#> 
#> Synthesis number 4
#> --------------------
#>  sex age marital nofriend
#> 
#> Synthesis number 5
#> --------------------
#>  sex age marital nofriend
#> 
#> Synthesis number 6
#> --------------------
#>  sex age marital nofriend
#> 
#> Synthesis number 7
#> --------------------
#>  sex age marital nofriend
#> 
#> Synthesis number 8
#> --------------------
#>  sex age marital nofriend
#> 
#> Synthesis number 9
#> --------------------
#>  sex age marital nofriend
#> 
#> Synthesis number 10
#> --------------------
#>  sex age marital nofriend
utility.tab(s1, ods, vars = c("marital", "sex"), print.stats = "all")
#> 
#> Observed:
#> ($tab.obs)
#>                     sex
#> marital              MALE FEMALE
#>   SINGLE              156    109
#>   MARRIED             268    313
#>   WIDOWED              10     93
#>   DIVORCED             11     30
#>   LEGALLY SEPARATED     2      0
#>   DE FACTO SEPARATED    0      5
#>   <NA>                  1      2
#> 
#> Mean of 10 synthetic tables ($tab.syn):
#>                     sex
#> marital               MALE FEMALE
#>   SINGLE             150.0  113.6
#>   MARRIED            274.6  307.5
#>   WIDOWED             10.1   91.8
#>   DIVORCED            12.3   31.1
#>   LEGALLY SEPARATED    1.2    0.0
#>   DE FACTO SEPARATED   0.0    4.4
#>   <NA>                 1.3    2.1
#> 
#> Selected utility measures from 10 syntheses:
#>         VW      FT    JSD SPECKS WMabsDD        U       G   pMSE PO50 MabsDD
#> 1  13.2321 17.5233 0.0027  0.022 11.4785 515641.0  8.6658 0.0008 1.10  0.044
#> 2   8.7500 12.7740 0.0019  0.024  9.9198 517981.5  4.6827 0.0005 1.20  0.048
#> 3  28.6949 32.8633 0.0055  0.066 20.6471 541808.0 25.7377 0.0018 3.30  0.132
#> 4   8.7084 10.8429 0.0017  0.026  9.7026 518120.5  6.4673 0.0005 1.30  0.052
#> 5  15.4531 15.7516 0.0028  0.050 13.9184 529536.0 16.6441 0.0010 2.50  0.100
#> 6  12.5744 14.6264 0.0024  0.033 10.9798 524711.5 10.8317 0.0008 1.65  0.066
#> 7   7.3763  7.8225 0.0014  0.012  8.1789 510351.5  8.7406 0.0005 0.60  0.024
#> 8   7.0810  9.1092 0.0014  0.025 10.1135 517469.0  5.0488 0.0004 1.25  0.050
#> 9  10.3883 10.4533 0.0019  0.040 11.9292 525658.0 10.2727 0.0006 2.00  0.080
#> 10 17.2808 21.3630 0.0034  0.039 14.3949 528451.5 13.6157 0.0011 1.95  0.078
#>    dBhatt   S_VW   S_FT  S_JSD S_WMabsDD    S_G S_pMSE df dfG
#> 1  0.0468 1.2029 1.5930 1.4161    1.0435 0.8666 1.2029 11  10
#> 2  0.0400 0.7955 1.1613 0.9755    0.9018 0.4683 0.7955 11  10
#> 3  0.0641 2.6086 2.9876 2.8715    1.8770 2.5738 2.6086 11  10
#> 4  0.0368 0.7917 0.9857 0.9053    0.8821 0.6467 0.7917 11  10
#> 5  0.0444 1.4048 1.4320 1.4804    1.2653 1.5131 1.4048 11  11
#> 6  0.0428 1.1431 1.3297 1.2660    0.9982 1.0832 1.1431 11  10
#> 7  0.0313 0.6706 0.7111 0.7249    0.7435 0.7946 0.6706 11  11
#> 8  0.0337 0.6437 0.8281 0.7448    0.9194 0.5049 0.6437 11  10
#> 9  0.0361 0.9444 0.9503 0.9869    1.0845 0.9339 0.9444 11  11
#> 10 0.0517 1.5710 1.9421 1.7862    1.3086 1.3616 1.5710 11  10

s2 <- syn(ods, m = 1, cont.na = list(nofriend = -8))
#> 
#> Synthesis
#> -----------
#>  sex age marital nofriend
u2 <- utility.tab(s2, ods, vars = c("marital", "age", "sex"), ngroups = 3)
print(u2, print.tables = TRUE, print.zdiff = TRUE)
#> 
#> Observed:
#> ($tab.obs)
#> , , sex = MALE
#> 
#>                     age
#> marital              [16,36) [36,56) [56,92]
#>   SINGLE                 125      24       7
#>   MARRIED                 30     121     117
#>   WIDOWED                  1       2       7
#>   DIVORCED                 0       4       7
#>   LEGALLY SEPARATED        0       1       1
#>   DE FACTO SEPARATED       0       0       0
#>   <NA>                     0       0       1
#> 
#> , , sex = FEMALE
#> 
#>                     age
#> marital              [16,36) [36,56) [56,92]
#>   SINGLE                  97       7       5
#>   MARRIED                 64     141     108
#>   WIDOWED                  1      11      81
#>   DIVORCED                 2      19       9
#>   LEGALLY SEPARATED        0       0       0
#>   DE FACTO SEPARATED       1       3       1
#>   <NA>                     1       1       0
#> 
#> 
#> Synthesised: 
#> ($tab.syn)
#> , , sex = MALE
#> 
#>                     age
#> marital              [16,36) [36,56) [56,92]
#>   SINGLE                 117      21       8
#>   MARRIED                 31     119     125
#>   WIDOWED                  1       1       8
#>   DIVORCED                 0       3       4
#>   LEGALLY SEPARATED        0       1       1
#>   DE FACTO SEPARATED       0       0       0
#>   <NA>                     0       0       2
#> 
#> , , sex = FEMALE
#> 
#>                     age
#> marital              [16,36) [36,56) [56,92]
#>   SINGLE                 102       8       9
#>   MARRIED                 48     147     109
#>   WIDOWED                  3       9      81
#>   DIVORCED                 5      17      13
#>   LEGALLY SEPARATED        0       0       0
#>   DE FACTO SEPARATED       1       3       0
#>   <NA>                     3       0       0
#> 
#> 
#> Table of z-scores for differences: 
#> ($tab.zdiff)
#> , , sex = MALE
#> 
#>                     age
#> marital              [16,36) [36,56) [56,92]
#>   SINGLE             -0.7273 -0.6325  0.3651
#>   MARRIED             0.1811 -0.1826  0.7273
#>   WIDOWED             0.0000 -0.8165  0.3651
#>   DIVORCED                   -0.5345 -1.2792
#>   LEGALLY SEPARATED           0.0000  0.0000
#>   DE FACTO SEPARATED                        
#>   <NA>                                0.8165
#> 
#> , , sex = FEMALE
#> 
#>                     age
#> marital              [16,36) [36,56) [56,92]
#>   SINGLE              0.5013  0.3651  1.5119
#>   MARRIED            -2.1381  0.5000  0.0960
#>   WIDOWED             1.4142 -0.6325  0.0000
#>   DIVORCED            1.6036 -0.4714  1.2060
#>   LEGALLY SEPARATED                         
#>   DE FACTO SEPARATED  0.0000  0.0000 -1.4142
#>   <NA>                1.4142 -1.4142        
#> 
#> 
#> Selected utility measures:
#>     pMSE S_pMSE df
#> 1 0.0016 0.8398 30

### synthetic data provided as 'data.frame'
utility.tab(s2$syn, ods, vars = c("marital", "nofriend"), ngroups = 3,
            print.tables = TRUE, cont.na = list(nofriend = -8), digits = 4)
#> 
#> Observed adjusted to match the size of the synthetic data:
#> ($tab.obs)
#>                     nofriend
#> marital              [0,4) [4,8) [8,99]  -8
#>   SINGLE                78    92     94   1
#>   MARRIED              193   185    198   5
#>   WIDOWED               46    24     31   2
#>   DIVORCED              14    15     12   0
#>   LEGALLY SEPARATED      2     0      0   0
#>   DE FACTO SEPARATED     3     1      1   0
#>   <NA>                   0     2      1   0
#> 
#> Synthesised: 
#> ($tab.syn)
#>                     nofriend
#> marital              [0,4) [4,8) [8,99]  -8
#>   SINGLE                77   107     80   1
#>   MARRIED              188   191    196   4
#>   WIDOWED               38    30     34   1
#>   DIVORCED              12    13     16   1
#>   LEGALLY SEPARATED      1     1      0   0
#>   DE FACTO SEPARATED     2     2      0   0
#>   <NA>                   2     1      2   0
#> 
#> Selected utility measures:
#>     pMSE S_pMSE df
#> 1 0.0015 1.0303 23