Privacy issues related to Data Provenance of Database containing End-user Data

Starting Date: June 2018
Prerequisites: Second Year
Will results be assigned to University:

Privacy issues related to the data stored, regarding end-users are well understood and studies. However, a large set of data can also be collected that is not directly related to the user but related to the user data – known as data provenance.

Data provenance is the field of recording the history of data, from its inceptions to various stages of the data lifecycle. Data provenance provides a detailed picture of how a data item was collected, where it was stored and how it was used. Such an information can be useful to data auditing and to understand whether the organisation is following its own stated data privacy policy.

This project will deploy a database with traditional data provenance framework and then populate it with synthetic data. The aim is to understand whether can data provenance records can violate the privacy policies of an organisation and provide less effective privacy to the end-users. Furthermore, if a third party only has access to the data provenance records, can they violate the privacy requirements of an individual – as stipulated by the data governance policies of the respective organisation.

The student should have an interest in and willingness to learn basic data provenance would have prior knowledge of basic MySQL. Ideally, would be familiar with user privacy requirements, have a firm grasp of C or C++, Java, or C# programming language. Good time-management and strong writing skills. We would use git and latex to write up the results; prior experience of these tools would be helpful but not required.

It is intended that once the implementation is working it can be used for practical using synthetic data, and we would anticipate a potential conference paper may be submitted for publication based on the implementation and subsequent trials; the author of the code would be a co-author of this paper.