Compare univariate distributions of synthesised and observed data
compare.synds.Rd
Compare synthesised data set with the original (observed) data set
using percent frequency tables and histograms. When more than one
synthetic data set has been generated (object$m > 1
), by
default pooled synthetic data are used for comparison.
This function can be also used with synthetic data NOT created by
syn()
, but then an additional parameter cont.na
might
need to be provided.
Usage
# S3 method for synds
compare(object, data, vars = NULL,
msel = NULL, stat = "percents", breaks = 20,
nrow = 2, ncol = 2, rel.size.x = 1,
utility.stats = c("pMSE", "S_pMSE", "df"),
utility.for.plot = "S_pMSE",
cols = c("#1A3C5A","#4187BF"),
plot = TRUE, table = FALSE, ...)
# S3 method for data.frame
compare(object, data, vars = NULL, cont.na = NULL,
msel = NULL, stat = "percents", breaks = 20,
nrow = 2, ncol = 2, rel.size.x = 1,
utility.stats = c("pMSE", "S_pMSE", "df"),
utility.for.plot = "S_pMSE",
cols = c("#1A3C5A","#4187BF"),
plot = TRUE, table = FALSE, ...)
# S3 method for list
compare(object, data, vars = NULL, cont.na = NULL,
msel = NULL, stat = "percents", breaks = 20,
nrow = 2, ncol = 2, rel.size.x = 1,
utility.stats = c("pMSE", "S_pMSE", "df"),
utility.for.plot = "S_pMSE",
cols = c("#1A3C5A","#4187BF"),
plot = TRUE, table = FALSE, ...)
# S3 method for compare.synds
print(x, ...)
Arguments
- object
An object of class
synds
, which stands for 'synthesised data set'. It is typically created by functionsyn()
and it includesobject$m
synthesised data set(s) asobject$syn
. Alternatively, when data are synthesised not usingsyn()
, it can be a data frame with a synthetic data set or a list of data frames with synthetic data sets, all created from the same original data with the same variables and the same method.- data
An original (observed) data set.
- vars
Variables to be compared. If
vars
isNULL
(the default) all synthesised variables are compared.- msel
Index or indices of synthetic data copies for which a comparison is to be made. If
NULL
pooled synthetic data copies are compared with the original data.- stat
Determines whether tables and plots present percentages (
stat = "percents"
), the default, or counts (stat = "counts"
). Ifm > 1
andmsel = NULL
average counts for synthetic data are derived.- breaks
The number of cells for the histogram.
- nrow
The number of rows for the plotting area.
- ncol
The number of columns for the plotting area.
- rel.size.x
A number representing the relative size of x-axis labels.
- utility.stats
A single string or a vector of strings that determines which utility measures to print. Must be a selection from:
VW
,FT
,JSD
,SPECKS
,WMabsDD
,U
,G
,pMSE
,PO50
,MabsDD
,dBhatt
,S_VW
,S_FT
,S_JSD
,S_WMabsDD
,S_G
,S_pMSE
,df
. Ifutility.stats = "all"
, all of these will be printed. For more information see the details section forutility.tab
.- utility.for.plot
A single string that determines which utility measure to print in facet labels of the plot. Set to
NULL
to print variable names only.- cols
Bar colors.
- plot
A logical value with default set to
TRUE
indicating whether plots should be produced.- table
A logical value with default set to
FALSE
indicating whether tables should be printed.- ...
Additional parameters.
- cont.na
A named list of codes for missing values for continuous variables if different from the
R
missing data codeNA
. The names of the list elements must correspond to the variables names for which the missing data codes need to be specified.
Value
An object of class compare.synds
which is a list
including a list of comparative frequency tables (tables
) and a ggplot object
(plots
) with bar charts/histograms. If multiple plots are produced
they and their corresponding frequency tables are stored as a list.
Details
Missing data categories for numeric variables are plotted on the same plot
as non-missing values. They are indicated by miss.
suffix.
Numeric variables with fewer than 6 distinct values are changed to factors in order to make plots more readable.
References
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. DOI: 10.18637/jss.v074.i11
Examples
if (FALSE) {
ods <- SD2011[, c("sex", "age", "edu", "marital", "ls", "income")]
s1 <- syn(ods, cont.na = list(income = -8))
# synthetic data provided as a 'synds' object
compare(s1, ods, vars = "ls")
compare(s1, ods,
vars = "income", stat = "counts",
table = TRUE, breaks = 10
)
# synthetic data provided as 'data.frame'
compare(s1$syn, ods, vars = "ls")
compare(s1$syn, ods,
vars = "income", cont.na = list(income = -8),
stat = "counts", table = TRUE, breaks = 10
)
}