Minimising gene sets through querying ontologies in published literature (completed)

Starting Date: July 4 2016
Duration: 4 weeks
Time commitment: Full time
Prerequisites: Confident Python programmer

The aim of this project is to build a tool that will determine sets of genes  that have a high priority of being associated with a specific Biological process.

Molecular Biology is now generating colossal amounts of data. In particular, there are a variety of technologies that can scan entire genomes or transcriptomes and determine the likelihood  of their association with a particular Biological process. However such results are very noisy and  suffer from a “large p, small n” effect, namely that the ratio of possible degrees of freedom, p  to the number of independent observations is much greater than 1 and hence suffer from very large  false positive rates.  On the other hand, the very extensive body of biomedical literature could be used to filter out spurious genes from the data before undergoing analysis. Tools such as GOCAT mine the literature to relate  Gene Ontology terms to specific search terms representing Biological processes.   This project will leverage such tools to isolate relevant gene lists.