[Taken] Building Data Provenance from Database Log Files

This project is already taken.

Starting Date: June 2017
Duration: 3 Months
Time commitment: 20h/week
Prerequisites: Second year

Databases are integral part of any organisations operations. They act as storage repositories for large set of data that the respective organisation relies upon for their efficient operations. One of the commonly deployed and open source database application is MySQL. It collects and store a large set of log files (this is a rich user/data activity record) related to the data operations performed on the databased it is hosting.. The log files can be used to build a full or partial data provenance.

Data provenance is the field of recording the history of data, from its inception to various stages of the data lifecycle. Data provenance provides a detail picture of how a data item was collected, where it was stored and how it was used. Such an information can be useful to data auditing and to understand whether the organisation is following its own stated data privacy policy.

This project will build a two sets of MySQL databases with synthetic data and activities for a pre-defined duration. One database would only be collecting the standard log files related to database operations. Whereas the second database will also deploy a provenance collection framework that will collect relevant provenance records. Subsequently, the database log files will be analysed and converted into a provenance record as close to the one collected by the deployed provenance collection framework. This work aims to show whether provenance records can be built as rigorously from the database log files or not.

 The student should have an interest in and willingness to learn basic MySQL, ideally would have prior knowledge of basic MySQL logs management. Ideally, would be familiar with Linux environment, have a firm grasp of C or C++ programming language, and like experimenting around with Operating Systems, low level system calls, etc. Good time-keeping, self-starter, self-motivated, responsible and strong writing skills. We would use git and latex to write up the results; prior experience of these would be helpful but not required.

It is intended that once the prototype is build, a controlled lab trail will be conducted, and we anticipate a conference paper being submitted for publication based on the implementation and subsequent trials; the author of the code would be a co-author of this paper.