July 4 2016
Confident Python programmer
The aim of this project is to build a tool that will determine sets of genes that have a high priority of being associated with a specific Biological process.
Molecular Biology is now generating colossal amounts of data. In particular, there are a variety of technologies that can scan entire genomes or transcriptomes and determine the likelihood of their association with a particular Biological process. However such results are very noisy and suffer from a “large p, small n” effect, namely that the ratio of possible degrees of freedom, p to the number of independent observations is much greater than 1 and hence suffer from very large false positive rates. On the other hand, the very extensive body of biomedical literature could be used to filter out spurious genes from the data before undergoing analysis. Tools such as GOCAT mine the literature to relate Gene Ontology terms to specific search terms representing Biological processes. This project will leverage such tools to isolate relevant gene lists.