Metagenomics pipeline (available)

Starting Date: 1 June 2019
Duration: 10 Weeks
Time commitment: Full time
Prerequisites: experience with Python programming or workflow software (desirable), experience of querying web databases using RESTful interfaces.

Starting Date: June 2019

Duration: 10 weeks

Time commitment: Full time

Prerequisites: experience with Python programming or workflow software
(desirable), experience of querying web databases using RESTful
interfaces.

Metagenomics is the genomic sampling of environments (e.g. soil, sea
water the human microbiome) which are composed of an unknown range of
different species (usually bacteria and viruses). This is an area of
intense research interest as it is believed that this area will be of
enormous use in areas such as the study of detecting environments under
stress[1] and clinical applications[2].

The analysis of such data sets requires the use of workflows where each
step requires the use of an individual piece of Bioinformatics software
or querying Bioinformatics data sets. This normally requires people with
a strong background in the use of scripts or workflow software. A
workflow that can be used by BIOlogists is required.

In collaboration with Prof. Kim Watson at the University of Reading a
pipeline will be developed. The resulting pipeline should have a
flexible user interface, to allow the user to easily combine complete
DNA sequences (ultimately converted to protein amino acid sequences),
from metagenomics databases, which can produce tailored subsets of novel
sequences that have been filtered for structure-function properties. It
would be advantageous to be able to see these relationships, at the
level of sequence homology (phylogeny). This could help identify novel,
but related sequences, for which new or alternative properties could be
exploited for use in the biotech sector.

For example, ultimately, what we would like to be able to do is to
create a set of specific metagenomics sequence databases that have been
tailored/screened for specific protein family, cellular localisation and
function. Additionally, we would then like to screen these tailored
databases for further specific properties, for example, exhibit
thermostable or thermolabile properties, or other properties related to
a particular reaction (hydrolysis, isomerisation, etc) and/or
substrate/product (carbohydrates, steroids, etc).

The tool can be developed in Python or using workflow software such as
WDL[3].

[1] C. L. Hemme et al., ‘Comparative metagenomics reveals impact of
contaminants on groundwater microbiomes’, Front. Microbiol., vol. 6,
p. 1205, 2015.

[2] C. Y. Chiu and S. A. Miller, ‘Clinical metagenomics’, Nat. Rev.
Genet., p. 1, Mar. 2019.

[3] WDL Index.
https://software.broadinstitute.org/wdl/documentation/.