Building Data Provenance from Database Log Files

Starting Date: June 2017
Duration: 8-10weeks
Time commitment: 20h/week
Prerequisites: Second Year

Databases are an integral part of any organisations operations. They act as storage repositories for a large set of data that the respective organisation relies upon for their efficient operations. One of the commonly deployed and open source database application is MySQL. It collects and stores a large set of log files related to the data operations performed on the databased it is hosting, this is a rich user/data activity record. The log files can be used to build a full or partial data provenance.

Data provenance is the field of recording the history of data, from its inception to various stages of the data lifecycle. Data provenance provides a detailed picture of how a data item was collected, where it was stored and how it was used. Such information can be useful to data auditing and to understand whether the organisation is following its own stated data privacy policy.

This project will build two sets of MySQL databases with synthetic data and activities for a pre-defined duration. One database would only be collecting the standard log files related to database operations. Whereas the second database will also deploy a provenance collection framework that will collect relevant provenance records. Subsequently, the database log files will be analysed and converted into a provenance record as close to the one collected by the deployed provenance collection framework. This work aims to show whether provenance records can be built as rigorously from the database log files or not.

The student should have an interest in and willingness to learn basic MySQL, ideally would have prior knowledge of basic MySQL logs management. Ideally, would be familiar with Linux environment, have a firm grasp of C or C++ programming language, and like experimenting around with Operating Systems, low-level system calls, etc. Good time-keeping, self-starter, self-motivated, responsible and strong writing skills.  We would use git and latex to write up the results; prior experience of these would be helpful but not required.

It is intended that once the prototype is built, a controlled lab trial will be conducted, and we anticipate a potential conference paper may be submitted for publication based on the implementation and subsequent trials; the author of the code would be a co-author of this paper.

As part of the project, you will work with an experienced and dedicated team of researchers who encourage innovative thinking and students taking ownership. You will be given necessary support throughout the project period with regular meetings, blackboard sessions, and guidance on how to carry out research effectively. This project is part of a much larger EPSRC funded project, so you would have an opportunity to work and contribute to a research project with real-world significance and impact. In previous year’s projects, a student was co-inventor on the generated patent application from the respective UROP project and also a co-author on the related research paper.