Generating synthetic data sets
syn.Rd
Generates synthetic version(s) of a data set. Function syn.strata()
performs stratified synthesis.
Usage
syn(data, method = "cart", visit.sequence = (1:ncol(data)),
predictor.matrix = NULL,
m = 1, k = nrow(data), proper = FALSE,
minnumlevels = 1, maxfaclevels = 60,
rules = NULL, rvalues = NULL,
cont.na = NULL, semicont = NULL,
smoothing = NULL, event = NULL, denom = NULL,
drop.not.used = FALSE, drop.pred.only = FALSE,
default.method = c("normrank", "logreg", "polyreg", "polr"),
numtocat = NULL, catgroups = rep(5, length(numtocat)),
models = FALSE, print.flag = TRUE, seed = "sample", ...)
syn.strata(data, strata = NULL,
minstratumsize = 10 + 10 * length(visit.sequence),
tab.strataobs = TRUE, tab.stratasyn = FALSE,
method = "cart", visit.sequence = (1:ncol(data)),
predictor.matrix = NULL,
m = 1, k = nrow(data), proper = FALSE,
minnumlevels = 1, maxfaclevels = 60,
rules = NULL, rvalues = NULL,
cont.na = NULL, semicont = NULL,
smoothing = NULL, event = NULL, denom = NULL,
drop.not.used = FALSE, drop.pred.only = FALSE,
default.method = c("normrank", "logreg", "polyreg", "polr"),
numtocat = NULL, catgroups = rep(5,length(numtocat)),
models = FALSE, print.flag = TRUE, seed = "sample", ...)
# S3 method for synds
print(x, ...)
Arguments
- data
a data frame or a matrix (
n
xp
) containing the original data. Observations are in rows and variables are in columns.- method
a single string or a vector of strings of length
ncol(data)
specifying the synthesising method to be used for each variable in the data. Order of variables is exactly the same as indata
. If specified as a single string, the same method is used for all variables in a visit sequence unless a data type or a position in a visit sequence requires a different method. Ifmethod
is set to"parametric"
the default synthesising method specified by thedefault.method
argument are applied. Variables that are transformations of other variables can be synthesised using a passive method that is specified as a string starting with~
(seesyn.passive
). Variables that need not to be synthesised have the empty method""
. By default all variables are synthesised using"cart"
method, which isrpart
implementation of a CART model (seesyn.cart
). See details for more information on method.- visit.sequence
a character vector of names of variables or an integer vector of their column indices specifying the order of synthesis. The default sequence
1:ncol(data)
implies that column variables are synthesised from left to right. See details for more information.- predictor.matrix
a square matrix of size
ncol(data)
specifying the set of column predictors to be used for each target variable in the row. Each entry has value 0 or 1. A value of 1 means that the column variable is used as a predictor for the row variable. Order of variables is exactly the same as indata
. By default all variables that are earlier in the visit sequence are used as predictors. For the default visit sequence (1:ncol(data)
) the defaultpredictor.matrix
will have values of 1 in the lower triangle. See details for more information.- m
number of synthetic copies of the original (observed) data to be generated. The default is
m = 1
.- k
a size of the synthetic data set (
k x p
), which can be smaller or greater than the size of the original data set (n x p
). The default isnrow(data)
which means that the number of individuals in the synthesised data is the same as in the original (observed) data (k = n
).- proper
a logical value with default set to
FALSE
. IfTRUE
proper synthesis is conducted.- minnumlevels
a minimum number of values a numeric variable should exceed to be treated as numeric during the synthesis. Numeric variables with only
minnumlevels
or fewer distinct values are changed into factors. If set to1
(default) numeric variables are left unchanged unless they have only one non-missing value.- maxfaclevels
a maximum number of factor levels that can be handled. It can be increased to allow the synthesis to run but too large a value may cause computational problems, especially for parametric methods.
- rules
a named list of rules for restricted values. Restricted values are those that are determined explicitly by values of other variables. The names of the list elements must correspond to the variables names for which the rules need to be specified.
- rvalues
a named list of the values corresponding to the rules specified by
rules
.- cont.na
a named list of codes for missing values for continuous variables if different from the
R
missing data codeNA
. The names of the list elements must correspond to the variables names for which the missing data codes need to be specified.- semicont
a named list of values at which semi-continuous variables have spikes. The names of the list elements must correspond to the names of the semi-continuous variables.
- smoothing
a single string specifying a smoothing method for all numeric variables in the data or a named list specifying a smoothing method to be used for selected variables. Avaliable methods include:
"spline"
(recommended),"rmean"
,"density"
, and""
). Smoothing can only be applied to continuous variables synthesised usingsample
,ctree
,cart
,rf
,bag
,ranger
,normrank
,pmm
ornested
method. The names of the list elements must correspond to the names of the variables whose values are to be smoothed. Smoothing is applied to the synthesised values. For more details seesyn.smooth
.- event
a named list specifying for survival data the names of corresponding event indicators. The names of the list elements must correspond to the names of the survival variables.
- denom
a named list specifying for variables to be modelled using binomial regression the names of corresponding denominator variables. The names of the list elements must correspond to the names of the variables to be modelled using binomial regression.
- drop.not.used
a logical value. If
TRUE
(default) variables not used in synthesis are not saved in the synthesised data and are not included in the corresponding synthesis parameters.- drop.pred.only
a logical value. If
TRUE
(default) variables not synthesised and used as predictors only are not saved in the synthesised data.- default.method
a vector of four strings containing the default parametric synthesising methods for numerical variables, factors with two levels, unordered factors with more than two levels and ordered factors with more than two levels respectively. They are used when
method
is set to"parametric"
or when there is an inconsistency between variable type and provided method.- numtocat
a vector of numbers or names to indicate columns of
data
that are to be divided into groups to allow the grouped variables to be synthesised as factors. The target number of groups for each variable is specified bycatgroups
. After the grouped variables have been synthesised the numeric variables are synthesised from them by the methodsyn.nested
and are placed in the same position in the synthetic data as in the original. The grouped variables are not stored in the synthetic data. If you want to keep the categorised values you should change the relevant variables indata
before runningsyn()
with the functionnumtocat.syn()
- catgroups
An integer or a vector of integers of the same length as
numtocat
giving the target number of groups into which of the numeric variables is to be categorised. The functiongroup_var
from theclassInt
package performs the categorisation.- models
if
TRUE
parameters of models fitted to the original data and used to generate the synthetic values are stored.- print.flag
if
TRUE
(default) synthesising history and information messages will be printed at the console. For silent computation useprint.flag = FALSE
.- seed
an integer to be used as an argument for the
set.seed()
. If no integer is provided, the default"sample"
will generate one and it will be stored. To prevent generating an integer setseed
toNA
.- ...
additional arguments to be passed to synthesising functions. See section 'Details' below for more information.
- strata
a numeric vector with strata identifiers or a string vector with names of stratifying variable(s).
- minstratumsize
minimum size of each stratum.
- tab.strataobs
a logical value indicating whether a frequency table of the number of observations in strata in the original data set should be printed.
- tab.stratasyn
a logical value indicating whether a frequency table of the number of observations in strata in the synthetic data set(s) should be printed.
- x
an object of class
synds
; a result of a call tosyn()
.
Details
Only variables that are in visit.sequence
with corresponding non-empty
method
are synthesised. The only exceptions are event indicators. They
are synthesised along with the corresponding time to event variables and should
not be included in visit.sequence
. All other variables (not in
visit.sequence
or in visit.sequence
with a corresponding blank
method) can be used as predictors. Including them in visit.sequence
generates a default predictor.matrix
reflecting the order of variables
in the visit.sequence
otherwise predictor.matrix
has to be
adjusted accordingly. All predictors of the variables that are not in
visit.sequence
or are in visit.sequence
but with a blank method
are removed from predictor.matrix
.
Variables to be synthesised that are not synthesised yet cannot be used
as predictors. Also all variables used in passive synthesis or in restricted
values rules (rules
) have to be synthesised before the variables they
apply to.
Mismatch between data type and synthesising method stops execution and
print an error message but numeric variables with number of levels less
than minnumlevels
are changed into factors and methods are changed
automatically, if necessary, to methods for categorical variables.
Methods for variables not in a visit sequence will be changed into blank.
The built-in elementary synthesising methods defined by conditional distributions include:
- ctree, cart
classification and regression trees (CART), see
syn.cart
- bagging, random forests, ranger
methods using ensembles of CART trees, see
syn.bag
,syn.rf
, andsyn.ranger
- survctree
classification and regression trees (CART) for duration time data (parametric methods for survival data are not implemented yet), see
syn.survctree
- norm
normal linear regression, see
syn.norm
- normrank
normal linear regression preserving the marginal distribution, see
syn.normrank
- lognorm, sqrtnorm, cubertnorm
normal linear regression after natural logarithmic, square root and cube root transformation of a dependent variable respectively, see
syn.lognorm
- logreg
logistic regression, see
syn.logreg
- polyreg
unordered polytomous regression, see
syn.polyreg
- polr
ordered polytomous regression, see
syn.polr
- pmm
predictive mean matching, see
syn.pmm
- sample
random sample from the observed data, see
syn.sample
- passive
function of other synthesised data, see
syn.passive
- nested
bootstrap sample within each category of the original grouping variable, see
syn.nested
- satcat
bootstrap sample within each category of the crosstabulation of all the predictor variables, see
syn.satcat
These methods use a group of variables that are synthesised together. They must always be together at the start of the visit sequence:
- catall
fit a saturated log-linear model, see
syn.catall
- ipf
fit a log-linear model, defined by its margins, by iterative proportional fitting see
syn.ipf
The functions corresponding to these methods are called syn.method
,
where method
is a string with the name of a synthesising method.
For instance a function corresponding to ctree
function is called
syn.ctree
. A new synthesising method can be introduced by writing
a function named syn.newmethod
and then specifying method
parameter of syn()
function as "newmethod"
.
In order to use "nested"
sampling, method
parameter of syn
function has to be specified as "nested.varname"
, where "varname"
is the name of the grouped (less detailed) variable, the only one used in
nested synthesis. A variable synthesised using "nested"
method is
excluded from synthesising other variables except when used for "nested"
method.
Additional parameters can be passed to synthesising methods as part of the
dots
argument. They have to be named using period-separated method and
parameter name (method.parameter
). For instance, in order to set
a minbucket
(minimum number of observations in any terminal node of
a CART model) for a ctree
synthesising method, ctree.minbucket
has to be specified. The parameters are method-specific and will be used for
all variables to be synthesised using that method. See help for
syn.method
for further details about the allowed parameters for
a specific method.
Value
The summary
function (summary.synds
) can be used
to obtain a summary of the synthesised variables.
An object of class synds
, which stands for 'synthesised
data set'. It is a list with the following components:
- call
an original call to
syn()
.- m
number of synthetic versions of the original (observed) data.
- syn
a data frame (for
m = 1
) or a list ofm
data frames (form > 1
) with synthetic data set(s).- method
a vector of synthesising methods applied to each variable in the saved synthesised data.
- visit.sequence
a vector of column indices of the visiting sequence. The indices refer to the columns in the saved synthesised data.
- predictor.matrix
a matrix specifying the set of predictors used for each variable in the saved synthesised data.
- smoothing
a vector specifying smoothing methods applied to each variable in the saved synthesised data.
- event
a vector of integers specifying for survival data the column indices for corresponding event indicators. The indices refer to the columns in the saved synthesised data.
- denom
a vector of integers specifying for variables modelled using binomial regression the column indices for corresponding denominator variables. The indices refer to the columns in the saved synthesised data.
- proper
a logical value indicating whether proper synthesis was conducted.
- n
a number of cases in the original data.
- k
a number of cases in the synthesised data.
- rules
a list of rules for restricted values applied to the synthetic data.
- rvalues
a list of the values corresponding to the rules specified by
rules
.- cont.na
a list of codes for missing values for continuous variables.
- semicont
a list of values for semi-continuous variables at which they have spikes.
- drop.not.used
a logical value indicating whether variables not used in synthesis are saved in the synthesised data and corresponding synthesis parameters.
- drop.pred.only
a logical value indicating whether variables not synthesised and used as predictors only are saved in the synthesised data.
- models
if
models = TRUE
a named list of estimates of models fitted to the original data and used to generate the synthetic values is returned from the$fit
component of each method (e.g.syn.cart()
). The list is ordered by the variables position in the data, and any models used to predict missing values are appended to the list.- seed
an integer used as a
set.seed()
argument.- var.lab
a vector of variable labels for data imported from SPSS using
read.obs()
.- val.lab
a list of value labels for factors for data imported from SPSS using
read.obs()
.- obs.vars
a vector of all variable names in the observed data set.
When syn.strata()
is used there are two additiona components:
- strata.syn
a factor variable or a list of factor variables containing stratum values for all observation units in
syn
.- strata.lab
a character vector of strata labels.
Note also that when syn.strata
is used most values of the items are matrices
with each row corresponding to a stratum or lists with one element per stratum.
References
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11 .
Examples
### selection of variables
vars <- c("sex","age","marital","income","ls","smoke")
ods <- SD2011[1:1000, vars]
### default synthesis
s1 <- syn(ods)
#>
#> Synthesis
#> -----------
#> sex age marital income ls smoke
s1
#> Call:
#> ($call) syn(data = ods)
#>
#> Number of synthesised data sets:
#> ($m) 1
#>
#> First rows of synthesised data set:
#> ($syn)
#> sex age marital income ls smoke
#> 1 MALE 49 MARRIED 1620 PLEASED NO
#> 2 MALE 41 SINGLE 2000 PLEASED NO
#> 3 FEMALE 20 SINGLE 640 MIXED NO
#> 4 MALE 75 MARRIED 1796 PLEASED NO
#> 5 MALE 23 SINGLE -8 MOSTLY SATISFIED NO
#> 6 MALE 19 SINGLE NA MIXED NO
#> ...
#>
#> Synthesising methods:
#> ($method)
#> sex age marital income ls smoke
#> "sample" "cart" "cart" "cart" "cart" "cart"
#>
#> Order of synthesis:
#> ($visit.sequence)
#> sex age marital income ls smoke
#> 1 2 3 4 5 6
#>
#> Matrix of predictors:
#> ($predictor.matrix)
#> sex age marital income ls smoke
#> sex 0 0 0 0 0 0
#> age 1 0 0 0 0 0
#> marital 1 1 0 0 0 0
#> income 1 1 1 0 0 0
#> ls 1 1 1 1 0 0
#> smoke 1 1 1 1 1 0
### synthesis with default parametric methods
s2 <- syn(ods, method = "parametric", seed = 123)
#>
#> Synthesis
#> -----------
#> sex age marital income ls smoke
s2$method
#> sex age marital income ls smoke
#> "sample" "normrank" "polyreg" "normrank" "polyreg" "polyreg"
### multiple synthesis of selected variables with customised methods
s3 <- syn(ods, visit.sequence = c(2, 1, 4, 5), m = 2,
method = c("logreg","sample","","normrank","ctree",""),
ctree.minbucket = 10)
#>
#> Variable(s): marital, smoke not synthesised or used in prediction.
#> CAUTION: The synthesised data will contain the variable(s) unchanged.
#>
#>
#> Synthesis number 1
#> --------------------
#> age sex income ls
#>
#> Synthesis number 2
#> --------------------
#> age sex income ls
summary(s3)
#> Synthetic object with 2 syntheses using methods:
#> sex age marital income ls smoke
#> "logreg" "sample" "" "normrank" "ctree" ""
#>
#> Summary (average) for all synthetic data sets:
#> sex age marital income
#> MALE :410 Min. :16.00 SINGLE :265 Min. : -8.0
#> FEMALE:590 1st Qu.:30.00 MARRIED :581 1st Qu.: 700.0
#> Median :47.00 WIDOWED :103 Median : 1195.0
#> Mean :46.86 DIVORCED : 41 Mean : 1400.5
#> 3rd Qu.:61.50 LEGALLY SEPARATED : 2 3rd Qu.: 1800.0
#> Max. :92.00 DE FACTO SEPARATED: 5 Max. :15000.0
#> NA's : 3 NA's : 135.5
#> ls smoke
#> DELIGHTED : 38.5 YES :253
#> PLEASED :368.5 NO :745
#> MOSTLY SATISFIED :329.0 NA's: 2
#> MIXED :187.5
#> MOSTLY DISSATISFIED: 57.5
#> UNHAPPY : 10.0
#> TERRIBLE : 9.0
summary(s3, msel = 1:2)
#> Synthetic object with 2 syntheses using methods:
#> sex age marital income ls smoke
#> "logreg" "sample" "" "normrank" "ctree" ""
#>
#> Summary for synthetic data set 1:
#> sex age marital income
#> MALE :429 Min. :16.00 SINGLE :265 Min. : -8
#> FEMALE:571 1st Qu.:30.00 MARRIED :581 1st Qu.: 700
#> Median :46.00 WIDOWED :103 Median : 1168
#> Mean :46.29 DIVORCED : 41 Mean : 1391
#> 3rd Qu.:61.00 LEGALLY SEPARATED : 2 3rd Qu.: 1800
#> Max. :92.00 DE FACTO SEPARATED: 5 Max. :15000
#> NA's : 3 NA's :136
#> ls smoke
#> DELIGHTED : 40 YES :253
#> PLEASED :364 NO :745
#> MOSTLY SATISFIED :325 NA's: 2
#> MIXED :198
#> MOSTLY DISSATISFIED: 58
#> UNHAPPY : 5
#> TERRIBLE : 10
#>
#> Summary for synthetic data set 2:
#> sex age marital income
#> MALE :391 Min. :16.00 SINGLE :265 Min. : -8
#> FEMALE:609 1st Qu.:30.00 MARRIED :581 1st Qu.: 700
#> Median :48.00 WIDOWED :103 Median : 1222
#> Mean :47.42 DIVORCED : 41 Mean : 1410
#> 3rd Qu.:62.00 LEGALLY SEPARATED : 2 3rd Qu.: 1800
#> Max. :92.00 DE FACTO SEPARATED: 5 Max. :15000
#> NA's : 3 NA's :135
#> ls smoke
#> DELIGHTED : 37 YES :253
#> PLEASED :373 NO :745
#> MOSTLY SATISFIED :333 NA's: 2
#> MIXED :177
#> MOSTLY DISSATISFIED: 57
#> UNHAPPY : 15
#> TERRIBLE : 8
### adjustment to the default predictor matrix
s4.ini <- syn(data = ods, visit.sequence = c(1, 2, 5, 3),
m = 0, drop.not.used = FALSE)
#>
#> Variable(s): income, smoke not synthesised or used in prediction.
#> CAUTION: The synthesised data will contain the variable(s) unchanged.
#>
pM.cor <- s4.ini$predictor.matrix
pM.cor["marital","ls"] <- 0
s4 <- syn(data = ods, visit.sequence = c(1, 2, 5, 3),
predictor.matrix = pM.cor)
#>
#> Variable(s): income, smoke not synthesised or used in prediction.
#> CAUTION: The synthesised data will contain the variable(s) unchanged.
#>
#>
#> Synthesis
#> -----------
#> sex age ls marital
### handling missing values in continuous variables
s5 <- syn(ods, cont.na = list(income = c(NA, -8)))
#>
#> Synthesis
#> -----------
#> sex age marital income ls smoke
### rules for restricted values - marital status of males under 18 should be 'single'
s6 <- syn(ods, rules = list(marital = "age < 18 & sex == 'MALE'"),
rvalues = list(marital = 'SINGLE'), method = "parametric", seed = 123)
#>
#> Synthesis
#> -----------
#> sex age marital income ls smoke
with(s6$syn, table(marital[age < 18 & sex == 'MALE']))
#>
#> SINGLE MARRIED WIDOWED DIVORCED
#> 11 0 0 0
#> LEGALLY SEPARATED DE FACTO SEPARATED
#> 0 0
### results for default parametric synthesis without the rule
with(s2$syn, table(marital[age < 18 & sex == 'MALE']))
#>
#> SINGLE MARRIED WIDOWED DIVORCED
#> 10 1 0 0
#> LEGALLY SEPARATED DE FACTO SEPARATED
#> 0 0
### synthesis with ipf for all variables
s7 <- syn(ods[, 1:3], method = "ipf", numtocat = "age")
#> **************************************************************
#> The numeric variable(s): age
#> will been synthesised as grouped variables and their numeric
#> values generated from boostrap samples within categories.
#> **************************************************************
#>
#> Synthesis
#> -----------
#> All 3 variables in the data synthesised together by method 'ipf'
#>
#> ['ipf' converged in 19 iterations]
#>
### alternatively group the numeric variable before synthesis to save
### the grouped data rather than the numeric in the synthetic data set
ods.cat <- numtocat.syn(ods, numtocat = "age", catgroups = 10)$data
#> Variable(s) age grouped into categories.
s8 <- syn(ods.cat[, 1:3], method = "ipf")
#>
#> Synthesis
#> -----------
#> All 3 variables in the data synthesised together by method 'ipf'
#>
#> ['ipf' converged in 19 iterations]
#>
### stratified synthesis
s9 <- syn.strata(ods, strata = "sex")
#> Number of observations in strata (original data):
#> MALE FEMALE
#> 448 552
#>
#> m = 1, strata = MALE
#> -----------------------------------------------------
#> Sample(s) of size 436 will be generated from original data of size 448.
#>
#> Variable sex has only one value so its method has been changed to "constant".
#> Variable sex removed as predictor because only one value.
#>
#> Method "cart" is not valid for a variable without predictors (age)
#> Method has been changed to "sample"
#>
#>
#> Synthesis
#> -----------
#> sex age marital income ls smoke
#>
#> m = 1, strata = FEMALE
#> -----------------------------------------------------
#> Sample(s) of size 564 will be generated from original data of size 552.
#>
#> Variable sex has only one value so its method has been changed to "constant".
#> Variable sex removed as predictor because only one value.
#>
#> Method "cart" is not valid for a variable without predictors (age)
#> Method has been changed to "sample"
#>
#>
#> Synthesis
#> -----------
#> sex age marital income ls smoke