Data Provenance for Multi-Database Servers Enterprise Architecture

Starting Date: June 2018
Duration: 8-10weeks
Time commitment: 20h/week
Prerequisites: Second Year

Enterprise architecture, with increasing frequencies, is based on multiple databases that split the enterprise data among itself and store them on separate database servers. Such a scheme enables an effective load balancing and management of enterprise data. However, my splitting data over multiple databases make it challenging to build a unified data provenance view of the data collected, managed and used by an organisation.

Data Provenance refers to records of the inputs, entities, systems, and process that influence data of interest, providing a historical record of the data and its origins. To provide a holistic view of the data provenance in an enterprise system, the provenance records of the activities carried out on a client workstation is important.

If an enterprise handles EU residents’ personal data, it has to provide mechanisms that allow an individual to exercise the rights that are afforded by the GDPR. The ability to search, discover and review and delete data is a critical component of GDPR compliance. But if that data is stored in multiple systems, and potentially shared with multiple partners, the tasks become dramatically more complex – requiring the technological ability to find and address all affected data promptly. All of these rights require a new level of enterprise-wide data mapping, data governance, data architecture and system management.

The aim is collected data provenance records from multiple databases and build a unified provenance record. For this, multiple database servers will be deployed that would host a split database – populated with synthetic data over a period of time. The activities generated during this period of time, the provenance would be collected individually from each database server but they would be merged in the final provenance repository.

The student should have an interest in and willingness to learn basic data provenance would have prior knowledge of basic MySQL and/or Mongo (No-SQL databases). Ideally, would be familiar with C, C# and/or Java programming languages. Good time-management and strong writing skills.  We would use git and latex to write up the results; prior experience of these tools would be helpful but not required. Even if you do not have the right skills as listed above but you consider yourself dedicated, passionate, hardworking and willing to learn new skills, we would like to hear from you.

It is intended that once the implementation is working it can be used for practical trials, and we would anticipate a potential conference paper may be submitted for publication based on the implementation and subsequent trials; the respective student would be a co-author of this paper.

As part of the project, you will work with an experienced and dedicated team of researchers who encourage innovative thinking and students taking ownership. You will be given necessary support throughout the project period with regular meetings, blackboard sessions, and guidance on how to carry out research effectively. This project is part of a much larger EPSRC funded project, so you would have an opportunity to work and contribute to a research project with real-world significance and impact. In previous year’s projects, a student was co-inventor on the generated patent application from the respective UROP project and also a co-author on the related research paper.