Synthesis of a group of categorical variables from a saturated model
syn.catall.Rd
A saturated model is fitted to a table produced by cross-tabulating all the variables.
Usage
syn.catall(x, k, proper = FALSE, priorn = 1, structzero = NULL,
maxtable = 1e8, epsilon = 0, rand = TRUE, ...)
Arguments
- x
a data frame (
n
xp
) of the set of original variables.- k
a number of rows in each synthetic data set - defaults to
n
.- proper
if
proper = TRUE
x
is replaced with a bootstrap sample before synthesis, thus effectively sampling from the posterior distribution of the model, given the data.- priorn
the sum of the parameters of the Dirichelet prior which can be thought of as a pseudo-count giving the number of observations that inform prior knowledge about the parameters.
- structzero
a named list of lists that defines which cells in the table are structural zeros and will remain as zeros in the synthetic data, by leaving their prior as zeros. Each element of the
structzero
list is a list that describes a set of cells in the table defined by a combination of two or more variables and a name of each such element must consist of those variable names seperated by an underscore, e.g.sex_edu
. The length of each such element is determined by the number of variables and each component gives the variable levels (numeric or labels) that define the structural zero cells (see an example below).- maxtable
a number of cells in the cross-tabulation of all the variables that will trigger a severe warning.
- epsilon
measures scale of laplace noise to be added under differential privacy (DP)
- rand
for DP versions determines if multinomial noise is to be added to DP counts. If it is set to false the DP adjusted counts are simply rounded to a whole number in a manner that preserves the desired sample size (k).
- ...
additional parameters.
Details
When used in syn
function the group of categorical variables
with method = "catall"
must all be together at the start of the
visit.sequence
. Subsequent variables in visit.sequence
are then
synthesised conditional on the synthesised values of the grouped variables.
A saturated model is fitted to a table produced by cross-tabulating all the
variables. Prior probabilities for the proportions in each cell of the table
are specified from the parameters of a Dirichlet distribution with the same
parameter for every cell in the table that is not a structural zero (see above).
The sum of these parameters is priorn
so that each one is \(priorn/N\)
where \(N\) is the number of cells in the table that are not structural zeros.
The default priorn = 1
can be thought of as equivalent to the knowledge
that 1
observation would be equally likely to be in any cell that is not
a structural zero. The posterior expectation, given the observed counts,
for the probability of being in a cell with observed count \(n_i\)
is thus \((n_i + priorn/N) / (N + priorn)\). The synthetic data are generated
from a multinomial distribution with parameters given by these probabilities.
Unlike syn.satcat
, which fits saturated conditional models,
the synthesised data can include any combination of variables, except
those defined by the combinations of variables in structzero
.
NOTE that when the function is called by setting elements of method in
syn()
to "catall"
, the parameters priorn
, structzero
,
maxtable
, epsilon
, and rand
must be supplied to syn
as e.g. catall.priorn
.
Value
A list with two components:
- res
a data frame of dimension
k x p
containing the synthesised data.- fit
the cross-tabulation of all the original variables used.
Examples
ods <- SD2011[, c(1, 4, 5, 6, 2, 10, 11)]
table(ods[, c("placesize", "region")])
#> region
#> placesize Dolnoslaskie Kujawsko-pomorskie Lodzkie Lubelskie
#> URBAN 500,000 AND OVER 60 0 94 0
#> URBAN 200,000-500,000 30 15 0 0
#> URBAN 100,000-200,000 64 28 66 41
#> URBAN 20,000-100,000 0 69 0 46
#> URBAN BELOW 20,000 57 45 31 28
#> RURAL AREAS 108 156 167 186
#> region
#> placesize Lubuskie Malopolskie Mazowieckie Opolskie Podkarpackie
#> URBAN 500,000 AND OVER 0 64 126 0 0
#> URBAN 200,000-500,000 33 10 10 19 17
#> URBAN 100,000-200,000 13 40 78 37 47
#> URBAN 20,000-100,000 0 0 23 0 0
#> URBAN BELOW 20,000 42 37 72 24 37
#> RURAL AREAS 65 220 261 73 212
#> region
#> placesize Podlaskie Pomorskie Slaskie Swietokrzyskie
#> URBAN 500,000 AND OVER 0 0 0 0
#> URBAN 200,000-500,000 0 0 120 0
#> URBAN 100,000-200,000 31 64 121 38
#> URBAN 20,000-100,000 45 76 71 35
#> URBAN BELOW 20,000 25 35 28 30
#> RURAL AREAS 92 131 160 127
#> region
#> placesize Warminsko-mazurskie Wielkopolskie Zachodnio-pomorskie
#> URBAN 500,000 AND OVER 0 48 0
#> URBAN 200,000-500,000 42 10 21
#> URBAN 100,000-200,000 43 88 44
#> URBAN 20,000-100,000 0 0 42
#> URBAN BELOW 20,000 45 53 53
#> RURAL AREAS 129 214 88
# Each `placesize_region` sublist:
# for each relevant level of `placesize` defined in the first element,
# the second element defines regions (variable `region`) that do not
# have places of that size.
struct.zero <- list(
placesize_region = list(placesize = "URBAN 500,000 AND OVER",
region = c(2, 4, 5, 8:13, 16)),
placesize_region = list(placesize = "URBAN 200,000-500,000",
region = c(3, 4, 10:11, 13)),
placesize_region = list(placesize = "URBAN 20,000-100,000",
region = c(1, 3, 5, 6, 8, 9, 14:15)))
syncatall <- syn(ods, method = c(rep("catall", 4), "ctree", "normrank", "ctree"),
catall.priorn = 2, catall.structzero = struct.zero)
#>
#> Synthesis
#> -----------
#> First 4 variables (sex, placesize, region, edu) synthesised together by method 'catall'
#> Error in sampler.syn(p, data, m, syn, visit.sequence, rules, rvalues, event, proper, print.flag, k, pred.not.syn, models, numtocat, ...): object 'struct.zero' not found