Synthesis with classification and regression trees (CART)
syn.cart.RdGenerates univariate synthetic data using classification and regression trees (without or with bootstrap).
Usage
syn.ctree(y, x, xp, smoothing = "", proper = FALSE,
          minbucket = 5, mincriterion = 0.9, ...)
syn.cart(y, x, xp, smoothing = "", proper = FALSE,
         minbucket = 5, cp = 1e-08, ...)Arguments
- y
 an original data vector of length
n.- x
 a matrix (
nxp) of original covariates.- xp
 a matrix (
kxp) of synthesised covariates.- smoothing
 smoothing method for numeric variable. See
syn.smooth.- proper
 for proper synthesis (
proper = TRUE) a CART model is fitted to a bootstrapped sample of the original data.- minbucket
 the minimum number of observations in any terminal node. See
rpart.controlandctree_controlfor details.- cp
 complexity parameter. Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. Small values of
cpwill grow large trees. Seerpart.controlfor details.- mincriterion
 1 - p-valueof the test that must be exceeded for a split to be retained. Small values ofmincriterionwill grow large trees. Seectree_controlfor details.- ...
 additional parameters passed to
ctree_controlforsyn.ctreeandrpart.controlforsyn.cart.
Details
The procedure for synthesis by a CART model is as follows:
Fit a classification or regression tree by binary recursive partitioning.
For each
xpfind the terminal node.Randomly draw a donor from the members of the node and take the observed value of
yfrom that draw as the synthetic value.
syn.ctree uses ctree function from the
  party package and syn.cart uses rpart
  function from the rpart package. They differ, among others,
  in a selection of a splitting variable and a stopping rule for the
  splitting process.
A Guassian kernel smoothing can be applied to continuous variables
  by setting smoothing parameter to "density". It is recommended
  as a tool to decrease the disclosure risk. Increasing minbucket
  is another means of data protection.
CART models were suggested for generation of synthetic data by Reiter (2005) and then evaluated by Drechsler and Reiter (2011).
Value
A list with two components:
- res
 a vector of length
kwith synthetic values ofy.- fit
 the fitted model which is an object of class
rpart.objectorctree.objectthat can be printed or plotted.
References
Reiter, J.P. (2005). Using CART to generate partially synthetic, public use microdata. Journal of Official Statistics, 21(3), 441--462.
Drechsler, J. and Reiter, J.P. (2011). An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computational Statistics and Data Analysis, 55(12), 3232--3243.