Databases are integral part of any organisations operations. They act as storage repositories for large set of data that the respective organisation relies upon for their efficient operations. One of the commonly deployed and open source database application is MySQL. It collects and store a large set of log files (this is a rich user/data activity record) related to the data operations performed on the databased it is hosting.. The log files can be used to build a full or partial data provenance.
This project will build a two sets of MySQL databases with synthetic data and activities for a pre-defined duration. One database would only be collecting the standard log files related to database operations. Whereas the second database will also deploy a provenance collection framework that will collect relevant provenance records. Subsequently, the database log files will be analysed and converted into a provenance record as close to the one collected by the deployed provenance collection framework. This work aims to show whether provenance records can be built as rigorously from the database log files or not.
The student should have an interest in and willingness to learn basic MySQL, ideally would have prior knowledge of basic MySQL logs management. Ideally, would be familiar with Linux environment, have a firm grasp of C or C++ programming language, and like experimenting around with Operating Systems, low level system calls, etc. Good time-keeping, self-starter, self-motivated, responsible and strong writing skills. We would use git and latex to write up the results; prior experience of these would be helpful but not required.
It is intended that once the prototype is build, a controlled lab trail will be conducted, and we anticipate a conference paper being submitted for publication based on the implementation and subsequent trials; the author of the code would be a co-author of this paper.