Synthesis of a group of categorical variables by iterative proportional fitting
syn.ipf.RdA fit to the table is obtained from the log-linear fit that matches the numbers in the margins specified by the margin parameters.
Usage
syn.ipf(x, k, proper = FALSE, priorn = 1, structzero = NULL,
        gmargins = "twoway", othmargins = NULL, tol = 1e-3,
        max.its = 5000, maxtable = 1e8, print.its = FALSE,
        epsilon = 0, rand = TRUE, ...)Arguments
- x
 a data frame of the set of original data to be synthesised.
- k
 a number of rows in each synthetic data set - defaults to
n.- proper
 if
proper = TRUExis replaced with a bootstrap sample before synthesis, thus effectively sampling from the posterior distribution of the model, given the data.- priorn
 the sum of the parameters of the Dirichlet prior which can be thought of as a pseudo-count giving the number of observations that inform prior knowledge about the parameters.
- structzero
 a named list of lists that defines which cells in the table are structural zeros and will remain as zeros in the synthetic data, by leaving their prior as zeros. Each element of the
structzerolist is a list that describes a set of cells in the table defined by a combination of two or more variables and a name of each such element must consist of those variable names seperated by an underscore, e.g.sex_edu. The length of each such element is determined by the number of variables and each component gives the variable levels (numeric or labels) that define the structural zero cells (see an example below).- gmargins
 a single character to define a group of margins. At present there is "oneway" and "twoway" option that creates, respectively, all 1-way and 2-way margins from the table.
- othmargins
 a list of margins that will be fitted. If
gmarginsis notNULLothmarginswill be added to them.- tol
 stopping criterion for
Ipfp.- max.its
 maximum umber of iterations allowed for
Ipfp.- maxtable
 the number of cells in the cross-tabulation of all the variables that will trigger a severe warning.
- print.its
 if true the iterations from
Ipfpwill be printed on the console. Otherwise only a message as to whether the iterations have converged will be given at the end of the fitting.- epsilon
 epsilon value for overall differential privacy (DP) parameter. This is implemented by dividing the privacy budget equally over all the margins used to fit the data.
- rand
 when epsilon is > 0 and DP synthetic data are created this determines whether the data are created by Poisson counts from the expected fitted counts in the cells of the DP adjusted data.
- ...
 additional parameters.
Details
When used in syn function the group of variables with
method = "ipf" must all be together at the start of the visit sequence.
This function is designed for categorical variables, but it can also be used for
numerical variables if they are categorised by specifying them in the
numtocat parameter of the main function syn. Subsequent variables
in visit.sequence are then synthesised conditional on the synthesised
values of the grouped variables. A fit to the table is obtained from the
log-linear fit that matches the numbers in the margins specified by the margin
parameters. Prior probabilities for the proportions in each cell of the table
are given by a Dirichlet distribution with the same parameter for every cell
in the table that is not a structural zero. The sum of these parameters is
priorn. The default priorn = 1 can be thought of as equivalent
to the knowledge that 1 observation would be equally likely to
fall in any cell of the table. The synthetic data are generated from a multinomial
distribution with parameters given by the expected posterior probabilities for
each cell of the table. If the maximum likelihood estimate from the log-linear
fit to cell \(c_i\) is \(p_i\) and the table has \(N\) cells that are not
structural zeros then the expectation of the posterior probability
for this cell is \((p_i + priorn/N^2) / (1 + priorn / N^2)\) or
equivalently \((N * p_i + priorn/N) / (N + priorn / N)\).
Unlike syn.satcat, which fits saturated models from their conditional
distrinutions, x can include any combination of variables, including
those not present in the original data, except those defined by structzero.
NOTE that when the function is called by setting elements of
method in syn to "ipf", the parameters priorn,
structzero, gmargins, othmargins, tol,
max.its, maxtable, print.its, epsilon,
and rand must be supplied to syn as e.g. ipf.priorn.
Value
A list with two components:
- res
 a data frame with
krows containing the synthesised data.- fit
 a list made up of two lists: the margins fitted and the original data for each margin.
Examples
ods <- SD2011[, c(1, 4, 5, 6, 2, 10, 11)]
table(ods[, c("placesize", "region")])
#>                         region
#> placesize                Dolnoslaskie Kujawsko-pomorskie Lodzkie Lubelskie
#>   URBAN 500,000 AND OVER           60                  0      94         0
#>   URBAN 200,000-500,000            30                 15       0         0
#>   URBAN 100,000-200,000            64                 28      66        41
#>   URBAN 20,000-100,000              0                 69       0        46
#>   URBAN BELOW 20,000               57                 45      31        28
#>   RURAL AREAS                     108                156     167       186
#>                         region
#> placesize                Lubuskie Malopolskie Mazowieckie Opolskie Podkarpackie
#>   URBAN 500,000 AND OVER        0          64         126        0            0
#>   URBAN 200,000-500,000        33          10          10       19           17
#>   URBAN 100,000-200,000        13          40          78       37           47
#>   URBAN 20,000-100,000          0           0          23        0            0
#>   URBAN BELOW 20,000           42          37          72       24           37
#>   RURAL AREAS                  65         220         261       73          212
#>                         region
#> placesize                Podlaskie Pomorskie Slaskie Swietokrzyskie
#>   URBAN 500,000 AND OVER         0         0       0              0
#>   URBAN 200,000-500,000          0         0     120              0
#>   URBAN 100,000-200,000         31        64     121             38
#>   URBAN 20,000-100,000          45        76      71             35
#>   URBAN BELOW 20,000            25        35      28             30
#>   RURAL AREAS                   92       131     160            127
#>                         region
#> placesize                Warminsko-mazurskie Wielkopolskie Zachodnio-pomorskie
#>   URBAN 500,000 AND OVER                   0            48                   0
#>   URBAN 200,000-500,000                   42            10                  21
#>   URBAN 100,000-200,000                   43            88                  44
#>   URBAN 20,000-100,000                     0             0                  42
#>   URBAN BELOW 20,000                      45            53                  53
#>   RURAL AREAS                            129           214                  88
# Each `placesize_region` sublist:
# for each relevant level of `placesize` defined in the first element,
# the second element defines regions (variable `region`) that do not
# have places of that size.
struct.zero <- list(
  placesize_region = list(placesize = "URBAN 500,000 AND OVER",
                          region = c(2, 4, 5, 8:13, 16)),
  placesize_region = list(placesize = "URBAN 200,000-500,000",
                          region = c(3, 4, 10:11, 13)),
  placesize_region = list(placesize = "URBAN 20,000-100,000",
                          region = c(1, 3, 5, 6, 8, 9, 14:15)))
synipf <- syn(ods, method = c(rep("ipf", 4), "ctree", "normrank", "ctree"),
              ipf.gmargins = "twoway", ipf.othmargins = list(c(1, 2, 3)),
              ipf.priorn = 2, ipf.structzero = struct.zero)
#> 
#> Synthesis
#> -----------
#> First 4 variables (sex, placesize, region, edu) synthesised together by method 'ipf'
#> Error in sampler.syn(p, data, m, syn, visit.sequence, rules, rvalues,     event, proper, print.flag, k, pred.not.syn, models, numtocat,     ...): object 'struct.zero' not found