Department seminar
Bei
Jiang Link to join seminar: Hosted on Zoom |
Synthetic data generation: balancing between data utility and privacy preservation
There is a growing expectation that data collected by government-funded studies should be openly available to ensure research reproducibility, which also increases concerns about data privacy. Synthetic data generation is emerging rapidly as a practical solution to protect privacy while sharing research data. The idea is not new. In the context of surveys, Rubin (1993) proposed the first model-based synthetic method within the framework of multiple imputation (MI), which replaces the original data by the imputed values drawn from the posterior predictive distributions. However, information loss due to incorrectly specified imputation models can weaken or invalidate the inferences obtained from the synthetic datasets. In this talk, we will discuss a new data-augmented MI (DA-MI) synthetic framework to remedy such an issue, where the introduced tuning mechanism further allows one to balance between data utility and privacy protection in the final synthetic datasets. Our numerical investigations demonstrate that this new DA-MI synthetic framework facilitates sharing of useful research data while protecting participants' identities.