"Prediction of regulatory SNPs affecting transcription factor binding using functional genomics data"
Evidence from Genome Wide Association Studies (GWAS) has provided us with insights into human phenotypes by identifying variation statistically associated with disease. However, it is desirable to extend these studies beyond association to an understanding of biological impact. Unfortunately, determining the function of these variants is a major challenge, especially for single-nucleotide polymorphisms (SNPs) in the non-coding regions of the genome where most of these variants fall.
To functionally annotate SNPs, computational tools have been developed by using functional genomics data, which are specifically related to regulatory elements and transcription factor (TF) binding. These computational tools provide powerful ways to narrow down from a huge list of candidates to the causative SNPs leading to human disease. One example of such computational tools is RegulomeDB. However, the heuristic scoring system in RegulomeDB tends to bias against rare variants because of the reliance on expression quantitative trait loci (eQTL) data. In addition, the results can be misleading for studies of disease in a specific cell type because RegulomeDB combines functional genomic annotations from multiple cell types.
Here, I will present a prediction model to improve the current scoring system in RegulomeDB by applying machine learning methods. We generated our training data by calling allele-specific TF binding SNPs from ChIP-seq data. The final model takes advantage of functional genomics data as well as sequence context in order to more accurately predict function. We show the feasibility of this method with training results in a lymphoblastoid cell line (GM12878). Furthermore, we are collecting training data from other cell types, and we expect our new scoring method to predict novel regulatory SNPs disrupting TF binding in a variety of cell types.