The UKRI CDT in ART-AI’s ‘Synthetic Data for Privacy, Security and Augmentation’ event took place on the 11th and 12th January 2022.
On each day there were three to four talks given by researchers on and professional users of Synthetic Data, followed by challenge led discussions related to Privacy, Security and Augmentation. The event was a success, enabling various discussions of opportunities & challenges in the use of synthetic data for AI and identifying common interests amongst attendees.
One participant commented “I enjoyed having more of a balance of technical vs social science content, instead of predominant social science content.” Whilst another said “I have many new ideas on how synthetic data can be used beyond the facilitation of data sharing.”
As a result of this workshop we are hoping to produce a green paper on “Synthetic Data: Privacy, Security and Augmentation”. If you would like to be involved in this or for more information please e-mail: [email protected].
For the videos of the speakers presentations, please click on the relevant talk in the programme below.
Main Image by Professor Peter Hall, University of Bath
Tuesday 11th January
13.00 Welcome and Introduction Alan Hunter, University of Bath, ART-AI Theme Lead – Partnerships and International Relations and Academic Supervisor
16:10 Challenge led discussions [Themes: Privacy, Security and Augmentation] –Thales, STATCAN
- Thales– How do we generate underwater sounds (ships, biological, natural) to augment databases for a classification algorithm when there is so little labelled data?
- Statcan- How can synthetic data support bilateral and multilateral data sharing without unintended revelation?
17:10 Closing remarks Eamonn O’Neill, ART AI Centre Director & Academic Supervisor
17:20 End of day 1
Wednesday 12th January
09:15 Welcome and Introduction Julian Padget, University of Bath, ART-AI Theme Lead – Innovations in AI Technologies & Academic Supervisor
11:50 Challenge led discussions [Themes: Privacy, Security and Augmentation] -Airbus, Dstl, Xi Chen
- Airbus– What are the challenges of ‘Making Synthetic Data Trustable’?
- Dstl-How do we stay ahead of the security game in a ML world?
- Xi Chen– Synthetic data for AI model training: advantages & disadvantages
12:50 Closing remarks Eamonn O’Neill, ART AI Centre Director & Academic Supervisor
13:00 End of day 2
Abstracts and Bios
Abstract ‘A Statistics Canada Perspective of Synthetic Data’
Statistics Canada supports several different access solutions that allow researchers to have access to the key information that they need. The ‘Five Safes Framework’ is used to ensure that these access solutions are safe and that the information being accessed does no harm to the privacy of Canadians. Open data is one solution where researchers can have access to microdata in an uncontrolled working environment. Public-use files and synthetic dummy files have been made available for decades. More recent developments in synthetic data aim to have information that maintains the analytic utility of the original data without disclosing personal information. This workshop will highlight how synthetic data of high analytic value fits into the access landscape, some of the challenges in creating safe yet useful open data, as well as some of the methods and tools available for those looking to produce safe open data sets.
Steven Thomas is the section chief in charge of the Centre for Confidentiality and Access at Statistics Canada. He has a degree in statistics from Memorial University of Newfoundland and has worked at Statistics Canada since 1997. The confidentiality and access group is responsible for developing statistical disclosure control strategies that support the safe release of aggregated tabular outputs for both economic and social statistics. The group is also responsible for developing and supporting the release of safe open data sets including the standard Public Use Microdata Files as well as synthetic datasets.
Abstract ‘Fake It Till You Make It: Face analysis in the wild using synthetic data alone’
We show that it is possible to perform face-related computer vision in the wild using synthetic data alone. The community has long enjoyed the benefits of synthesizing training data with graphics, but the domain gap between real and synthetic data has remained a problem, especially for human faces. We show that it is possible to synthesize data with minimal domain gap so that models trained on synthetic data generalize to real in-the-wild datasets. Using synthetic data, we train machine learning systems for landmark localization and face parsing, showing that synthetic data can both match real data in accuracy, as well as open up new approaches where manual labelling would be impossible.
Erroll is a Principal Scientist at Microsoft’s Mixed-Reality and AI Lab in Cambridge, UK. There, he has worked on hand tracking for HoloLens 2, avatars for Microsoft Mesh, synthetic data for face tracking, and Holoportation. Before that, he did his PhD at the University of Cambridge, on Gaze Estimation with Graphics.
Abstract ‘Coupling rendering and generative adversarial networks for artificial sonar image generation’
Generative adversarial networks, or GANs, are a popular method used to create realistic, but synthetic data. One pitfall of these methods is the inability to create fine-controlled scene content. To circumvent this issue, we demonstrate how simulation and GANs can be used in tandem to create realistically looking but synthetic data with fine control over the scene content. We show results for a seabed remote sensing task whereby we use a low fidelity, but off the shelf, optical simulator and an unstructured database of sonar images as a method to seed a GAN in making realistic imagery of the seafloor. We shows several facets of our results compared to real sonar imagery and direct the audience to current research thrusts in this area.
Isaac D. Gerg received a B.S. degree (Hons.) in computer engineering and the M.S. degree in electrical engineering from The Pennsylvania State University (PSU) in 2004 and 2008, respectively. He is currently pursuing a PhD degree in electrical engineering with the Information Processing and Algorithms Laboratory, PSU. He is currently with the Pennsylvania State University Applied Research Laboratory (PSU-ARL) where his research interests include remote sensing and machine learning. He is also with the Information Processing and Algorithms Laboratory, PSU. Isaac was the co-recipient of the PSU-ARL Engineering Award of Excellence for his work on synthetic aperture sonar beamforming.
‘How to bulletproof privacy protection with synthetic data’
Alexandra Ebert is an ethical AI & privacy expert and serves as Chief Trust Officer at MOSTLY AI. She hosts the Data Democratization Podcast and chairs the IEEE IC expert group on Synthetic Data. Alexandra is engaged in public policy discussions in the emerging field of synthetic data & responsible AI and a regular speaker at international AI, privacy & digital banking conferences. Besides being an advocate for privacy protection, Alexandra is deeply passionate about ensuring the fair and responsible use of AI algorithms. She co-authored an ICLR paper and a popular blog series on fair synthetic data & fairness in AI, which was featured by Forbes and other leading business magazines. Moreover, she serves as AI expert for the #humanAIze initiative, which aims to make AI more inclusive and accessible to everyone.
‘Practical Differentially Private Generative Modelling’
Nicolas Grislain is Chief Science Officer at Sarus Technologies. He graduated from École Normale Supérieure de Lyon in Mathematics and Computer Science. Nicolas started his career in economics and finance modeling at the French Treasury and then at Société Générale. He co-founded a first company: AlephD, in 2012, where he was also leading Research and Development. AlephD was acquired by Yahoo in 2016. In 2020 he co-founded Sarus Technologies with the same founding team as AlephD. Nicolas has a strong taste for, and experience in AI, data science, technical team management and entrepreneurship. He is skilled in applied mathematics, software development, dev ops, quantitative finance and economics.
‘The Nuts and Bolts of Building Out a Radiology/Pathology/Genomics Synthetic Dataset At Scale’
Hugh Lyshkow is Co-Founder and CEO of DesAcc, a company that has been building software solutions for better utilising healthcare data for over 28 years. Starting as a neuroscientist at the Kyoto University School of Medicine, Hugh realised that to drive healthcare innovation meant understanding the core nature of how medical imaging and health data is stored, its file formats and storage methods. He believes that it is only by being able to access Radiology, Pathology and Genomics data at Petabyte scale that true innovations in Precision Medicine will be made, unlocking data in order to inspire cures.
Abstract ‘Moving towards practical user-friendly synthesis: Scalable synthetic data methods for large confidential administrative databases using saturated count models’
When releasing data, respondents’ privacy is protected through statistical disclosure control (SDC). Over the past three decades, the use of synthetic data (Rubin, 1993; Little, 1993) for SDC has continually developed. Methods have adapted to account for different data types, but mainly within the domain of survey data sets.
Administrative databases are being increasingly considered as a way to fuel researchers’ demand for data. Such databases are inherently confidential: respondents provide their infor- mation for administrative purposes, not for statistical analyses. There are characteristics of administrative databases which present challenges for synthetic data generation. First, these databases tend to be comprised of categorical variables, some with many categories which, when converted from microdata into frequency tables, gives rise to large, sparse tables. Also, these tables can have structural zeros, that is, unobservable combinations of levels.
The categorical nature of administrative databases allows the synthesis to be undertaken at the tabular level rather than at the individual level. We show that the fitting of saturated models allows administrative databases to not only be synthesized quickly, but also allows risk and utility to be formalised in a manner inherently unfeasible in other techniques. The flexibility afforded by multi-parameter count distributions, such as the negative binomial, can be utilised to protect respondents’ privacy. The synthesis distribution’s parameters can be adjusted to achieve, analytically, certain criteria post-synthesis, for example, the probability that a count of one is synthesized to one, which is equivalent to saying that a unique in the original data is also a unique in the synthetic data. Finally, we give an empricial example, synthesising a database that can be viewed as a surrogate to the English School Census. (This data set was constructed using information from various public sources, primarily 2011 census tables.)
Robin is a ONS Senior Lecturer in Statistics. Prior to his appointment at Cardiff University he was a Lecturer at Lancaster University and prior to that a Lecturer at the University of Southampton.
Robin’s main research areas are dealing with problems arising due to missing data and data confidentiality. He also have interests in Bayesian methods more generally. He enjoys working collaboratively, both with colleagues in academia as well as with non-academic partners. Some of his previous collaborations have included working with the Office for National Statistics and the National Health Service Blood and Transplant as well as the Institute of Employment Research in Germany.
Robin is an active member of the Royal Statistical Society (RSS) and was Chair of the Medical Section from January 2017- January 2020 and currently a RSS Council member. In 2020 Robin also taught a course for the African Institute for Mathematical Sciences that was supported by the RSS.