Evidence from Genome Wide Association Studies (GWAS) has provided us with insights into human phenotypes by identifying genetic variation statistically associated with diseases and complex traits. However, the functional consequences of these genetic variants remain unknown in many cases, especially for those in the non-coding regions of the human genome.
My dissertation focuses on the single nucleotide polymorphisms (SNPs) as the most common genetic variation type. I define some SNPs as regulatory SNPs that can alter the transcription factor binding affinities within the DNA sequences of regulatory elements. This change affects downstream gene expression and plays a role in disease progression and trait development. Characterizing genome-wide regulatory variants is particularly challenging because the gene regulatory network is dynamic across various cell types and environmental conditions. In addition to the DNA sequence context, the gene regulatory network relies on epigenetic factors, such as chromatin accessibility, histone modification, and chromatin looping.
In this dissertation, I applied computational approaches to predict regulatory variants by incorporating sequence information and functional genomics annotations from various high-throughput assays. In chapter 2, I developed a computation tool, SURF, to prioritize the regulatory variants within promoters and enhancers with clinical relevance, which achieved the best performance in CAGI5 “Regulation Saturation” challenge.
In chapter 3, I extended SURF to TURF, a computational tool to predict tissue-specific functions of regulatory variants and provide a more robust prediction on genome-wide non-coding regions. By leveraging tissue-specific genomic annotations of tissues from the same organ, I also calculated TURF organ-specific scores covering most ENCODE project organs. Many of the GWAS traits showed enrichment of regulatory variants prioritized by TURF scores in their relevant organs, which indicates that these regulatory variants are likely to be involved in the trait developments and can be a valuable source for future studies.
In chapter 4, to enable the quick annotation on non-coding variants for the scientific community, I designed some major updates to an online tool, RegulomeDB. RegulomeDB returns the evidence from diverse functional genomics assays that overlaps the query variant’s position, displayed with interactive charts and a genome browser view. To further provide functional hypotheses to putative regulatory variants, I finally explored the pipeline to assign their target genes with evidence from eQTL studies and Hi-C experiments.