Inference from synthetic data
summary.fit.synds.RdCombines the results of models fitted to each of the m
  synthetic data sets.
Arguments
- object
 an object of class
fit.syndscreated by fitting a model to synthesised data set using functionglm.synds,lm.synds,multinom.syndsorpolr.synds.- population.inference
 a logical value indicating whether inference should be made to population quantities. If
FALSEinference is made to the results that would be expected from an analysis of the original data. This option should be selected if the synthetic data are being used for exploratory analysis, but the final published results will be obtained by running code on the original confidential data. Ifpopulation.inference = TRUEresults would allow population inference to be made from the synthetic data. In both cases the inference will depend on the synthesising model being correct, but this can be checked by running the same analysis on the real data, seecompare.fit.synds.- msel
 index or indices of the synthetic datasets (
1,...,m), for which summaries of fitted models are to be produced. IfNULL(default) only the summary of combined estimates is produced.- real.varcov
 the estimated variance-covariance matrix of the fit of the model to the original data. This parameter is used in the function
compare.fit.syndswhich has the original data as one of its parameters.- incomplete
 Logical variable as to whether population inference for incomplete synthesis is to be used. If this is left at a
NULLvalue it will be determined by whether the dependent variable has been synthesised. See also below as output.- ...
 additional parameters.
- x
 an object of class
summary.fit.synds.
Details
The mean of the estimates from each of the m synthetic data sets yields asymptotically unbiased estimates of the coefficients if the observed data conform to the distribution used for synthesis. The standard errors are estimated differently depending whether inference is made for the results that we would expect to obtain from the observed data or for the parameters of the population that we assume the observed data are sampled from. The standard errors also differ according to whether synthetic data were produced using simple or proper synthesis (for details see Raab et al. (2017)).
Value
An object of class summary.fit.synds which is a list with the
  following components:
- call
 the original call to
glm.syndsorlm.synds.- proper
 a logical value indicating whether synthetic data were generated using proper synthesis.
- population.inference
 a logical value indicating whether inference is made to population coefficients or to the results that would be expected from an analysis of the original data (see above).
- incomplete
 a logical value indicating whether the dependent variable in the model was not synthesised. It is derived in the synthpop implementation of the fitting functions (
lm.synds,glm.synds,multinom.syndsandpolr.synds) and saved with the fitted object. WhenTRUEinference withpopulation.inference = TRUEuses the method proposed by Reiter (2003) for what he terms partially synthetic data. This method requires multiple syntheses (m > 1). Ifm = 1,incomplete = TRUEandpopulation.inference = TRUEthe results will be still calculated and returned with warning. This will usually give standard errors that are larger than they should be. This method can be forced by settingincomplete = TRUEas a parameter because it can also be used for complete synthesis.- fitting.function
 function used to fit the model.
- m
 the number of synthetic versions of the original (observed) data.
- coefficients
 a matrix with combined estimates. If inference is required to the results that would be obtained from an analysis of the original data, (
population.inference = FALSE) the coefficients are given byxpct(Beta), the standard errors byxpct(se.Beta)and the corresponding Z-statistic byxpct(Z). If the synthetic data are to be used to make inferences to population quantities (population.inference = TRUE), the coefficients are given byBeta.syn, their standard errors byse.Beta.synand the Z-statistic byZ.syn(see vignette on inference for more details).- n
 a number of cases in the original data.
- k
 the number of cases in the synthesised data. Note that if
kandnare not equal andpopulation.inference = FALSE(the default), then the standard errors produced will estimate what would be expected by an analysis of the original data set of sizen.- analyses
 summary.glmorsummary.lmobject respectively or a list ofmsuch objects.- msel
 index or indices of synthetic data copies for which summaries of fitted models are produced. If
NULLonly a summary of combined estimates is produced.
References
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11 .
Raab, G.M., Nowok, B. and Dibben, C. (2017). Practical data synthesis for large samples. Journal of Privacy and Confidentiality, 7(3), 67-97. Available at: https://journalprivacyconfidentiality.org/index.php/jpc/article/view/407
Reiter, J.P. (2003) Inference for partially synthetic, public use microdata sets. Survey Methodology, 29, 181-188.
Examples
ods <- SD2011[1:1000,c("sex","age","edu","ls","smoke")]
### simple synthesis
s1 <- syn(ods, m = 5)
#> 
#> Synthesis number 1
#> --------------------
#>  sex age edu ls smoke
#> 
#> Synthesis number 2
#> --------------------
#>  sex age edu ls smoke
#> 
#> Synthesis number 3
#> --------------------
#>  sex age edu ls smoke
#> 
#> Synthesis number 4
#> --------------------
#>  sex age edu ls smoke
#> 
#> Synthesis number 5
#> --------------------
#>  sex age edu ls smoke
f1 <- glm.synds(smoke ~ sex + age + edu + ls, data = s1, family = "binomial")
summary(f1)
#> Fit to synthetic data set with 5 syntheses. Inference to coefficients
#> and standard errors that would be obtained from the original data.
#> 
#> Call:
#> glm.synds(formula = smoke ~ sex + age + edu + ls, family = "binomial", 
#>     data = s1)
#> 
#> Combined estimates:
#>                              xpct(Beta) xpct(se.Beta) xpct(z) Pr(>|xpct(z)|)  
#> (Intercept)                   1.0576149     0.5987047  1.7665        0.07731 .
#> sexFEMALE                     0.2483542     0.1560689  1.5913        0.11154  
#> age                           0.0101285     0.0045425  2.2297        0.02577 *
#> eduVOCATIONAL/GRAMMAR        -0.2018552     0.2369669 -0.8518        0.39431  
#> eduSECONDARY                  0.4805722     0.2536052  1.8950        0.05810 .
#> eduPOST-SECONDARY OR HIGHER   0.5380069     0.2934553  1.8334        0.06675 .
#> lsPLEASED                    -0.5532934     0.5227991 -1.0583        0.28991  
#> lsMOSTLY SATISFIED           -0.6650514     0.5259344 -1.2645        0.20605  
#> lsMIXED                      -0.6194228     0.5463481 -1.1338        0.25690  
#> lsMOSTLY DISSATISFIED        -1.3999122     0.5891359 -2.3762        0.01749 *
#> lsUNHAPPY                     1.6131194   242.2154785  0.0067        0.99469  
#> lsTERRIBLE                   -4.4127075   196.8215439 -0.0224        0.98211  
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary(f1, population.inference = TRUE)
#> Fit to synthetic data set with 5 syntheses. Inference to population
#> coefficients when all variables in the model are synthesised.
#> 
#> Call:
#> glm.synds(formula = smoke ~ sex + age + edu + ls, family = "binomial", 
#>     data = s1)
#> 
#> Combined estimates:
#>                                Beta.syn se.Beta.syn   z.syn Pr(>|z.syn|)  
#> (Intercept)                   1.0576149   0.6558482  1.6126      0.10683  
#> sexFEMALE                     0.2483542   0.1709649  1.4527      0.14632  
#> age                           0.0101285   0.0049761  2.0354      0.04181 *
#> eduVOCATIONAL/GRAMMAR        -0.2018552   0.2595842 -0.7776      0.43680  
#> eduSECONDARY                  0.4805722   0.2778106  1.7299      0.08366 .
#> eduPOST-SECONDARY OR HIGHER   0.5380069   0.3214642  1.6736      0.09421 .
#> lsPLEASED                    -0.5532934   0.5726977 -0.9661      0.33399  
#> lsMOSTLY SATISFIED           -0.6650514   0.5761322 -1.1543      0.24836  
#> lsMIXED                      -0.6194228   0.5984944 -1.0350      0.30068  
#> lsMOSTLY DISSATISFIED        -1.3999122   0.6453660 -2.1692      0.03007 *
#> lsUNHAPPY                     1.6131194 265.3337627  0.0061      0.99515  
#> lsTERRIBLE                   -4.4127075 215.6071988 -0.0205      0.98367  
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
### proper synthesis
s2 <- syn(ods, m = 5, method = "parametric", proper = TRUE)
#> 
#> Synthesis number 1
#> --------------------
#>  sex age edu ls smoke
#> 
#> Synthesis number 2
#> --------------------
#>  sex age edu ls smoke
#> 
#> Synthesis number 3
#> --------------------
#>  sex age edu ls smoke
#> 
#> Synthesis number 4
#> --------------------
#>  sex age edu ls smoke
#> Warning: NaNs produced
#> 
#> 
#> Synthesis number 5
#> --------------------
#>  sex age edu ls smoke
f2 <- glm.synds(smoke ~ sex + age + edu + ls, data = s2, family = "binomial")
summary(f2)
#> Fit to synthetic data set with 5 syntheses. Inference to coefficients
#> and standard errors that would be obtained from the original data.
#> 
#> Call:
#> glm.synds(formula = smoke ~ sex + age + edu + ls, family = "binomial", 
#>     data = s2)
#> 
#> Combined estimates:
#>                              xpct(Beta) xpct(se.Beta) xpct(z) Pr(>|xpct(z)|)   
#> (Intercept)                   4.0247714   180.8090534  0.0223       0.982241   
#> sexFEMALE                     0.3772038     0.1618108  2.3311       0.019746 * 
#> age                           0.0133479     0.0047414  2.8152       0.004875 **
#> eduVOCATIONAL/GRAMMAR        -0.2580407     0.2499416 -1.0324       0.301883   
#> eduSECONDARY                  0.5116834     0.2720637  1.8807       0.060006 . 
#> eduPOST-SECONDARY OR HIGHER   0.3528590     0.2996747  1.1775       0.239006   
#> lsPLEASED                    -3.4429469   180.8087728 -0.0190       0.984808   
#> lsMOSTLY SATISFIED           -3.7626884   180.8087802 -0.0208       0.983397   
#> lsMIXED                      -4.1819517   180.8088312 -0.0231       0.981547   
#> lsMOSTLY DISSATISFIED        -4.0879885   180.8090302 -0.0226       0.981962   
#> lsUNHAPPY                     7.2638173   678.5999376  0.0107       0.991460   
#> lsTERRIBLE                   -7.3409068   326.4755793 -0.0225       0.982061   
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary(f2, population.inference = TRUE)
#> Fit to synthetic data set with 5 syntheses. Inference to population
#> coefficients when all variables in the model are synthesised.
#> 
#> Call:
#> glm.synds(formula = smoke ~ sex + age + edu + ls, family = "binomial", 
#>     data = s2)
#> 
#> Combined estimates:
#>                                Beta.syn se.Beta.syn   z.syn Pr(>|z.syn|)  
#> (Intercept)                   4.0247714 213.9361570  0.0188      0.98499  
#> sexFEMALE                     0.3772038   0.1914571  1.9702      0.04882 *
#> age                           0.0133479   0.0056101  2.3793      0.01735 *
#> eduVOCATIONAL/GRAMMAR        -0.2580407   0.2957349 -0.8725      0.38291  
#> eduSECONDARY                  0.5116834   0.3219101  1.5895      0.11194  
#> eduPOST-SECONDARY OR HIGHER   0.3528590   0.3545799  0.9951      0.31966  
#> lsPLEASED                    -3.4429469 213.9358251 -0.0161      0.98716  
#> lsMOSTLY SATISFIED           -3.7626884 213.9358339 -0.0176      0.98597  
#> lsMIXED                      -4.1819517 213.9358942 -0.0195      0.98440  
#> lsMOSTLY DISSATISFIED        -4.0879885 213.9361296 -0.0191      0.98475  
#> lsUNHAPPY                     7.2638173 802.9302744  0.0090      0.99278  
#> lsTERRIBLE                   -7.3409068 386.2911149 -0.0190      0.98484  
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1