A presentation by Christian Arnold (Cardiff University). This event is part of the Digital Governance Research Colloquium. Prior registration is not required.
Social scientists pervasively use data that contains sensitive information – e.g. micro-level data about individuals. However, researchers face a dilemma: while data has to be publicly available to make research reproducible, information about individuals needs to be protected. Synthetic copies of original data can address this concern, because ideally they contain all relevant statistical characteristics without disclosing private information. But generating synthetic data that preserves – eventually undiscovered – statistical relationships and protects privacy at the same time is challenging. This paper surveys first, the different available mechanisms for generating synthetic data and second, the status quo in controlling the amounts of privacy that is being released. We experimentally evaluate the trade-offs of these methods for typical data challenges social scientists are facing such as discrete variables, structural zeros, multi-collinearity or complex dependencies. We then summarise the consequences for core metrics relevant to social scientists, specifically regression coefficients, marginal distributions and correlation structures. Our findings suggest that while some first challenges have been met, the generating algorithms still need to improve the usefulness of the resulting synthetic data. We hope to encourage inter-disciplinary work between computer scientists and social scientists to develop more powerful algorithms in the future.
Christian Arnold's research focuses on institutions in governance. What drives and determines the rules of political decision making? Using data driven methods from statistics and machine learning, his work lies at the intersection between social science and computer science. Prior to joining Cardiff, he held a position at Oxford University and worked as a Data Scientist in industry. He graduated with a PhD in Political Science from the University of Mannheim.
This new research colloquium is a joint initiative of the Hertie School's Digital Governance Centre and the Data Science Lab. It brings together Hertie School's research community in the areas of digital governance, digital government and data science. It offers a forum for debating research on key issues of current research related to questions of digital governance, digital government and AI with an interdisciplinary audience.