Hongjiu Zhang
19

Ph.D. Program

Software Engineer

Microsoft, Inc.

zhanghj@umich.edu

Chair

Yuanfang Guan

Dissertation Title

Computational Methods for Resolving Heterogeneity in Biological Data

Research Interest

The complexity in biological data reflects the heterogeneous nature of biological processes. Computational methods need to preserve as much information regarding the biological process of interest as possible. In this work, we explore three specific tasks about resolving biological heterogeneity. The first task is to infer heterogeneous phylogenetic relationship using molecular data. The common likelihood models for phylogenetic inference often makes strong assumptions about the evolution process across different lineages and different mutation sites. We use convolutional neural network to infer phylogenies instead, allowing the model to describe more heterogeneous evolution process. The model outperformes commonly used algorithms on diverse simulation datasets. The second task is to infer the clonal composition and phylogeny from bulk DNA sequencing data of tumour samples. Estimating clonal information from bulk data often involves resolving mixture models. Unfortunately, simpler models are often unable to capture complex genetic alteration events in tumour cells, while more sophisticated models incur heavy computational burdens and are hard to converge. We solve the challenge through density-hinted optimization with post hoc adjustment. The model makes conservative predications but yields better accuracy in assessing co-clustering relationship among the somatic mutations. The third task is to estimate the abundance of splicing transcripts from full-length single-cell RNA sequencing data. Transcript inference from RNA sequencing data needs a plethora of reads for accurate abundance estimation. Yet single-cell sequencing yields much fewer reads than bulk sequencing. To recover transcripts from full-length single-cell RNA sequencing data, we pool reads from similar cells to help assign transcripts without disrupting the cluster structures. These methods describe complex biological processes with minimal runtime overhead. Taking these methods as examples, we will briefly discuss the rationale and some general principals in designing these methods.

Current Placement