Adviser: Maureen Sartor
Abstract
High-throughput omics experiments produce an incredible amount of data which must be put into context to make it useful. This is true of transcriptomics assays, epigenomics assays such as those measuring transcription factor binding and histone modifications (e.g. ChIP-seq) or those measuring DNA methylation (e.g. WGBS and RRBS), as well as for metabolomics assays quantifying small molecules (e.g. LC-MS). The field of transcriptomics, having been developed earlier than epigenomics and metabolomics, benefits from more and more mature interpretaive tools. The work of this dissertation is to develop software tools to interpret epigenomics and metabolomics data.
First, we developed Broad-Enrich, a gene set enrichment tool designed for histone modification (HM) ChIP-seq data and other broad genomic regions. We employ a logistic regression model with a smoothing spline to account for the relationship between gene coverage in relation to a gene's length. We demonstrate Broad-Enrich achieves the correct Type I error across 55 ENCODE HM datasets, that Broad-Enrich returns more biologically relevant results than other approaches, and that the correct choice of gene locus definition can improve the strength of enrichments.
Second, we developed ConceptMetab, an interactive web-based tool for mapping and exploring the relationships among biologically-defined metabolite sets developed from Gene Ontology, KEGG, and Medical Subject Headings, and based on statistical tests for association. We demonstrate the utility of ConceptMetab with multiple scenarios, showing it can be used to identify known and potentially novel relationships among metabolic pathways, cellular processes, phenotypes, and diseases, and provides an intuitive interface for linking compounds to their molecular functions and higher level biological effects.
Third, we developed annotatr, a tool for annotating genomic regions to genomic annotations. The annotatr package reports all intersections of regions and annotations, giving a better understanding of the genomic context of the regions. A variety of functions are implemented to easily plot covariate data associated with the regions across the annotations, and across annotation intersections, providing insight into how characteristics of the regions differ across the genome.
Fourth, we developed mint, a pipeline for analyzing, integrating, and annotating DNA methylation (5mC) and hydroxymethylation data (5hmC). Current gold-standard methods for measuring 5mC also capture 5hmC signal, which can confound biological conclusions. The mint pipeline aims to separate the signals in silico to discern the effects of each epigenetic mark in the experiment under consideration. The pipeline supports group comparisons for general designs with covariate information, and data are integrated based upon overlapping signal of 5mC and 5hmC. Genomic annotations and summary visualizations are output at various checkpoints to facilitate interpretation.
In sum, this body of work establishes tools enabling the interpretation of epigenomics and metabolomics data via functional enrichment, genomic annotation, data integration, and visualization.