Changelog • synthpop

synthpop 1.8-2

New features

This version brings a maintenance uplift to incorporate tidyverse standards. (#31) The added functionality lowers the barrier of entry to contributors and also aims to lower the workload for package maintainers as reviewers of contributions.

Add a devcontainer config to allow contributing through GitHub Codespaces. The devcontainer provides VS Code with a sensible selection of extensions.
Upgrade NEWS to NEWS.md.
Add a README.Rmd file. (#30)
Migrate hand-written NAMESPACE and .Rd docs to Roxygen docstrings.
Add Roxygen for documentation.
Add a test suite.
Add pre-commit hooks which help to uphold coding standards.
Add continuous testing through GitHub Actions.
Add continuous documentation through GitHub Actions and pkgdown. The resulting pkgdown website is deployed to GitHub pages. (#23)

Applying tidyverse code formatting and migrating reference documentation brings many line changes to the codebase, but does not alter the package API or functionality.

synthpop 1.8-0

CRAN release: 2022-08-31

New features

An option in compare.synds() to add values of a selected utility measure to names of a plot facets (utility.to.plot argument).
Synthesis with “catall” and “ipf” has ## New features that permit the creation of differentialy private (DP) synthetic data. Parameters are epsilon and rand.
New smoothing methods (smoothing argument): “spline” and “rmean”. Now you can also provide a single string specifying a smoothing method, which will be then used to smooth all numeric variables in the data. You can still provide a named list to smooth selected variables only.
New parameters low, high, n.breaks, and breaks for utility.tables() that allow to create a two color binned gradient and set colours for low and high ends of the gradient.

Changes

Optional parameters can be set for syn.norm(), syn.lognorm(), syn.sqrtnorm(), syn.cubertnorm(), syn.normrank(), syn.pmm().
syn.cubertnorm() now allows negative values.
In multi.compare() if msel = NULL, pooled synthetic data copies are compared with the original data.
Parameter incomplete added to summary.fit.synds() function.
In syn.ipf() default value of tol parameter changed to 1e-3 in order to speed up convergence.
Sublist elements of structural zero list must be named.

Bug fixes

In multi.compare() default colour and fill scales restored for ggplot() to allow for more than 9 levels/synthetic data sets to be plotted.
Automatic detection of incomplete synthesis.
Formatting of labels for ploting numeric variables in compare.synds() that could create duplicated values.
Subsetting of columns in data frames that are also tibbles.
Handling of special characters in variable names.
Labeling of multi.compare() plots.

synthpop 1.7-0

CRAN release: 2021-11-17

New features

New function utility.tables() summarises oneway, two- or three-way tables of utility by utility.tab(). Plots and tables of utility measures can be produced.
Functions utility.gen(), utility.tab(), and compare.synds() can be used to assess the utility of synthetic data sets that were created NOT using synthpop. They have to be provided as a data.frame or a list.
utility.gen() and utility.tab() have a parameter k.syn to indicate that the sample size itself has been synthesised. The default value is FALSE that will apply to synthetic data created by synthpop.
The following additional statstics for each synthesis are calculated in utility.tab(): “G” - the adjusted likelihood ratio chi-squared statistic for comparing tables with original and synthesised values, “JSD” - the Jensen-Shannon distance between the tables with original and
These additional statistics are computed by utility.tab() and utility.gen(): “PO50” - the percentage over 50% of each synthetic data set where the model used to predict real or synthetic is correct; “SPECKS” the Kolmogorov-Smirnov distance comparing the prpensity scores for the synthetic and original data.
Parameter print.flag is added to utility.gen() and utility.tab() to suppress, if desired, output for simulations.
compare.synds() produces a table, tab.res, that gives the oneway utility statistics computed by utility.tab with parameter tables = “oneway”
New parameters plot and table in compare.synds() function to indicate, respectively, whether plots should be produced and tables printed. By default plots are printed and tables are suppressed.

Changes

Default minnumlevels changed from -1 to 1. This permits numeric variables with a single distinct value as well as missing values to be synthesised correctly.
In utility.gen() failure to converge is changed from a warning to a failure and message changed.
Improved error messages in syn() to signal small samples or factor variables with too many levels.
Matrix of predictors x in logreg.syn() is centred and xp made to match to improve comptutation (thanks to Heloise Gauvin for noting this problem).
Improved error messages and help file for syn.survctree().
In syn() variables changed to factors from character are changed back to character in synthetic data.
In syn() variables changed to factor from numeric because of only a few levels, are changed back to numeric in synthetic data.
In utility.tab() default for digits changed from 2 to 3.
Several ## Changes to a utility.gen(), including calculating p-values for resampling methods empirically and the calculation of the percentage correctly predicted by the model. Also ## Changes in the saved object and in the print method.
Print method for utility.gen() changed to allow user to change settings when called after the objects are created.
In compare.synds() and multi.compare() numeric variables with fewer than 6 distinct values are changed to factors in order to make plots more readable.
In multi.compare() if barplot.position = “dodge”, source of data (original/sythetic) is mapped to an aesthetic fill.

Bug fixes

Problems with checks when a variable has a class attribute that is of length > 1 corrected.
Bug in survctree.syn() corrected.
Bug in utility.tab() when classInt() returns duplicate values for breaks when style = "quantile", now the default, corrected.
A synthetic version of a logical variable synthesised using function syn.cart() is also logical.
Passive functionality has been improved to allow checking if the passive method would give the right answer in the original data and produce a failure if it fails.
Synthesis of variables with assigned labels (note that synthetic values will not have labels).

synthpop 1.6-0

CRAN release: 2020-09-04

New features

Incomplete synthesis is detected automatically from the details of the methods used to synthesise a data set. Functions glm.synds(), lm.synds(), polr.synds() and multinomial.synds() return another component incomplete of their result. The function summary.fit.synds() no longer has the parameter incomplete. In the previous version when a user fitted a model where some of the variables were not synthesised and m = 1, summary.fit.synds() would stop with an error. In the current version calculations are carried out as if all variables had been synthesised (incomplete is set to FALSE) but a warning is printed.
New function polr.synds() to fit ordered logistic models to synthetic data using the polr() function from the package MASS. The compare.fit.synds() function can be used to compare the results of polr.synds() with those based on the original data.
New synthesising function syn.ranger() which uses a fast implementation of random forests (contributed by Caspar van Lissa).

Changes

The vignette on inference has been updated.
synthpop website address added in a DESCRIPTION file.
“cart” default method is explicitly defined in the syn() function.
Correction of a Value part in the help files for syn.methods.
Warning when numeric variables with five or fewer levels not changed to factors.

Bug fixes

Method “polr” (ordered logistic regression) is not replaced by “polyreg” (unordered logistic regression) if not necessary. A bug due to passing of an unnecessary parameter smoothing removed.
In cart.syn() when a numeric variable is synthesised and the synthetic values of the explanatory variables cannot be classified to a final node of a tree they are randomly assigned to one of the final children nodes.
In glm.synds() when used with family = gaussian the variances are rescaled with the residual variance estimate.
Numeric varaiables with missing values which are not synthesised but used as predictors for other variables are kept unchanged.

synthpop 1.5-1

CRAN release: 2019-03-20

Bug fixes

Not running an example that mostly fails due to too many small categories in the synthesised data set.

synthpop 1.5-0

CRAN release: 2018-08-16

New features

New methods “catall” and “ipf” that synthesise a group of variables together at the start of the synthesis.
New parameters numtocat and catgroups for syn(). Variables in numtocat are converted into categorical variables with breaks determined by their distribution and catgroups gives the target number of groups for each variable. The data with the categorical versions are then synthesised and finally the synthesised variables in numtocat are created from bootstrap samples within the categories. This feature was developed for the second stage of “ipf” and “catall” but can be used with any method. Variables in numtocat must have a method suitable for categorical data.
New function numtocat.syn() can be used to group numeric variables into categories. It can be used before synthesis to create a new data frame that can then be synthesised. It should be used instead of the numtocat parameter of syn() if you want to keep the categorical variables in the synthetic data.
The function nested.syn() can now be used with continuous variables nested within categories. It also allows smoothing of the non-missing data.
New function codebook.syn() that can be used to check features of data before synthesis.
New parameter to compare.synds() stat with possible values “percents” or “counts” allows tables and plots to display counts instead of percentages in groups.

Changes

Some improvements to utility.tab():
- default for print.tables is to print if up to a 3-way table;
- a new parameter useNA to include or exclude NA values from tables;
- a new parameter print.stats to allow a choice of what statistics to be printed and the default is to print only the Voas-Williamson statistic with a simpler format.
Format of printed output from utility.gen() has been changed to emphasise the ratio to the expected as the most important measure.
mincriterion for syn.ctree() changed to 0.9.
If the syn() parameter models = TRUE, models are stored for the variables in the visit sequence and for missing values in the continous variables.

Bug fixes

utility.tab() and utility.gen() for synthetic data generated from syn.strata().
Smoothing for “sample” method.

synthpop 1.4-4

CRAN release: 2018-06-27

Bug fixes

Compatibility with the new version of ggplot2 (2.3.0).
Check in multi.compare() for missing argument var.
Check in sdc() for data type of variables to be smoothed.
Size of $syn when k < n and all synthesising methods without predictors.

synthpop 1.4-3

CRAN release: 2018-03-19

Changes

In padModel.syn() the default for newly defined synthesising methods is NOT to create dummy variables from factors before fitting models.

Bug fixes

Extraction of variable name from a tranformed dependant variable in glm.synds(), lm.synds() and multinom.synds().
In syn.polyreg() numerical variables used as predictors for the multinom() function are scaled so as to have a range (0,1) to improve convergence. An extra warning is included if all the preicted values are in one category, which can result when the model fails to iterate due to sparse data.

synthpop 1.4-1

CRAN release: 2018-01-05

New features

New version of utility.gen() and print.utility.gen(). These now implement the methods described in Snoke et. al. and include the following:
- appropriate method of getting the null distribution is selected based on synthesis details;
- warning messages if CART models fail to split;
- allow a seed value to be set and stored for resampling methods;
- printing Z scores from logit models using the average fit over different syntheses.
New function multinom.synds() to fit multinomial models to synthetic data using the multinom() function from the package nnet.

Changes

Default cp parameter in syn.cart() changed to 1e-8.
Default print.tables parameter in utility.tab() changed to TRUE.
All syn() result components that are lists, e.g. cont.na, have their elements named.

Bug fixes

Call to classIntervals() in utility.tab() uses style = “fisher” as default to avoid problems for variables with a small number of unique values.
utility.tab() computes breaks for grouping from the combined data rather than just from the observed. This avoids problems when synthetic values are outside the range of the observed ones.
In compare.fit.synds() the quantity real.varcov needs to be multiplied by sigma^2 when fitting.function is “lm”.

synthpop 1.4-0

CRAN release: 2017-11-20

New features

Vignette on inference from fitted models (we are grateful to Joerg Drechsler for his comments).
New function syn.satcat allows saturated categorical models to be fitted from all possible interactions of the predictor variables.

Changes

utility.tab() returns p-values for utility measures rather than standardised versions of the measures.
smooth.vars parameter added to sdc() function which allows smoothing of numeric variables in the synthesised dataset.
If models = TRUE in syn(), for logreg and ployreg coefficients of the fitted model are returned.
Output from print.summary.fit.synds() is labelled differently according to whether population.inference is TRUE or FALSE (see vignette on inference). Also it now includes p-values and stars as are shown for lm() and glm().
Warning messages are given in summary.fit.synds() when inference is attempted from an invalid model where synthsis has not conditioned on any variables that have been left unsynthesised.
The msel parameter of print.summary.fit.synds() prints a table of estimates rather than a lsiting of each fit in detail.
Several ## Changes in compare.fit.synds() are described in detail in the new vignette on inference (‘Inference from fitted models in synthpop’). These include extensions of the lack-of-fit tests and confidence interval overlap measures for different options.

Bug fixes

replicated.uniques() for a single variable.
syn.cart() for logical variables without missings (thanks to bug report by Ruben Arslan).
syn() can be used without loading the package (thanks to reported issue by Ruben Arslan)
Missing data factor level as a stratum in syn.strata()
seed for syn.strata().
print.fit.synds() for syn.strata() object.

synthpop 1.3-2

CRAN release: 2017-07-10

Changes

For consistency reasons tab.utility() changed to utility.tab() and utility.synds() to utility.gen().
utility.gen() under major revision and temporarily unavailable.
Revision of utility.tab() function: a measure of fit proposed by Voas and Williams and one proposed by Freeman and Tukey are calculated; continous variables are categorised using classIntervals() function.
Revision of compare.fit.synds() function: new lack-of-fit measures and mean values for existing analysis-specific utility measures (confidence interval overlap and absolute standardized difference between coefficient); return.result parameter replaced by print.coef with slightly different functionality - analysis-specific utility measures are always printed but you can choose whether to print or not model estimates.

Bug fixes

compare.synds() for integer variables without missing values (thanks to bug report by Joerg Drechsler).

synthpop 1.3-1

CRAN release: 2016-11-23

Changes

Update of the vignette.

Bug fixes

compare.synds() for variables with NA values in observed but not in synthetic data returns correct value (0) for NA category in synthetic data.
Invalid times argument corrected (lists of numbers coerced to numbers).

synthpop 1.3-0

CRAN release: 2016-10-24

New features

Storing results of CART models when models set to TRUE.
Function syn.strata() for stratified synthesis.
Function multi.compare() for multivariate comparison of synthesised and observed data.
Synthesising method “nested” for a variable nested within another variable.
Tabular utility function tab.utility() for comparing contingency tables from observed and synthesized data.
Parameter uniques.exclude for the sdc() function, which can be used to remove some variables from the identification of uniques.
Function replicated.uniques() returns a number of unique individuals in the original data set ($no.uniques).

Changes

Synthetic values of collinear variables are derived based on the one that is synthesised first and their method is set to “collinear”. They do not have to be removed prior to synthesis.
Synthesising method for constant variables is set to “constant” and the variables are not removed from the synthesised data set when drop.not.used = TRUE.
Default synthesising method changed to “cart”.
Default minnumlevels changed to -1 (during synthesis numeric variables are not changed to factors regardless of the number of distinct values).
Coefficient estimates and their confidence intervals are ploted in the same order as they are presented in a tabular form.
No message on the seed value used (it is stored in the result object).
Formula of the model to be fitted using glm.synds() or lm.synds() can be specified outside the function.
Massage for sdc() on number of replicated uniques also when it is equal to zero.
Maximum number of iterations for a multinomial model used in polyreg and polr method increased to 1000 (maxit parameter). Message if the limit is reached.
write.syn() saves complete synds object into a file synobject_filename.RData.
Error on exceeding maxfaclevels in not generated if method for the factor is set to “sample” or “nested”.
For constant variables method is changed to “constant”.
Year format for variables ymarr and ysepdiv in SD2011 dataset changed from yy to yyyy.

Bug fixes

Types and placement of special signs that are allowed in rules have been extended and include e.g. initial and closing round bracket.
compare.synds() provides output for logical variables.
Synthesis of logical variables with missing values.
Message about a change of method for a variable without predictors.
Check for filetype in write.syn()

synthpop 1.2-1

CRAN release: 2016-03-15

Bug fixes

No calling var(x) on a factor x (in checks).
No contrasts attribute for factors synthesised using parametric method.
Misspelled vector name (nlevels) replaced with a correct one (nlevel).

synthpop 1.2-0

CRAN release: 2016-02-03

New features

A new function utility.synds() for distributional comparison of synthesised data with the original (observed) data using propensity scores.
New measures for comparing model estimates based on synthesised and observed data implemented in compare.fit.synds() function: standardized differences in coefficient values(coef.diff) and confidence interval overlap (ci.overlap).

Changes

No dependency on coefplot package.
Default for drop.not.used changed to FALSE.

synthpop 1.1-1

CRAN release: 2015-07-08

Changes

Both variable names and their column indices can be used in visit.sequence.
Arguments rules, rvalues, cont.na, semicont, smoothing, event, denom are specified as named lists, e.g. rules = list(marital = “age < 18”) and do not have to be specified for all variables.
Optional arguments can be passed to synthesising functions by specifying funname.argname arguments, e.g. ctree.minbucket = 5; they are function-specific; minbucket removed from arguments.
Smoothing is possible for numeric variables when synthesised with the method “sample”.
compare() is a generic function with two methods (for class synds and fit.synds); it replaced two separate functions.
New argument return.plot for compare() method for class fit.synds.
New argument msel for compare() method for class synds, which allows comparison for pooled or selected data set(s). Results for multiple synthetic data sets can be plotted on the same graph.
New argument nrow for compare() method for class synds; nrow and ncol determine number of plots per screen.
Argument plot.na for compare() method for class synds is no longer required and missing data categories for numeric variables are ploted on the same plot as non-missing values.
Argument object of lm.synds() and glm.synds() functions changed to data.
print() method for class fit.synds gives by default combined coefficient estimates only.
summary() method for class fit.synds gives combined coefficient estimates and their standard errors.
summary() method for class synds with multiple synthetic data sets provides by default summaries that are calculated by averaging summary values for all synthetic data copies.
Argument obs.data of compare.fit.synds() function changed to data.
Method surv.ctree and cart.bboot changed to survctree and cartbboot.

Bug fixes

denom and event for variables with missing data.
maxfaclevels can be increased.
Continuous variables with missing data when zero is a non-missing value.
Synthesis of a single variable (with or without auxiliary predictors) now works.

synthpop 1.1-0

CRAN release: 2015-02-21

New features

Function sdc() for statistical disclosure control of the synthesised data set(s); function replicated.uniques() to determine which unique units in the synthesised data set(s) replicates unique units in the original data set.
Function read.obs() to import original data sets form external files.
Function write.syn() to export synthetic data sets to external files and create a text file with information about the synthesis.
syn() has new semicont parameter that allows to define spike(s) for semi-continuous variables in order to synthesise them separately.
lognorm, sqrtnorm and cubertnorm methods for synthesis by linear regression after natural logarithm, square root or cube root transformation of a dependent variable.
seed argument for syn() function.

Changes

Revised output of summary.fit.synds() and compare.fit.synds(); standard errors of Z scores corrected (se(Z.syn)) (thanks to Joerg Drechsler).
Figures for compare.fit.synds() and compare.synds() functions plotted using ggplot2 functions.
period.separated or alllowercase naming convention has been adopted and parameter names populationInference, visitSequence, predictorMatrix, contNA, defaultMethod, printFlag and nlevelmax have been changed to population.inference, visit.sequence, predictor.matrix, cont.na, default.method, print.flag and minnumlevels respectively.
Default for drop.pred.only changed to FALSE.

Bug fixes

Rounding procedure (thanks to bug report by Joerg Drechsler).
Warning about extra disregarded argument family in compare.fit.synds().