Fan Zhang

Fan Zhang
19

Ph.D. Program
Bioinformatics Scientist
Illumina, Inc.

Chair

Dissertation Title

Leveraging Genetic Variants for Rapid and Robust Upstream Analysis of Massive Sequence Data

Research Interest

The rapidly increasing throughput of sequencing technologies allows us to sequence genomes, transcriptomes, and epigenomes at an unprecedented scale. Robust, efficient, and accurate computational methods to analyze sequence reads are crucial for successful large-scale studies. In this dissertation, I address specific computational and statistical challenges in quality assessment of sequence reads, ancestry-agnostic estimation of DNA sample contamination, and deconvolution of genetically multiplexed scRNA-seq sequence data by leveraging genetic variants. In Chapter 2, I describe rapid and accurate algorithms to produce comprehensive quality metrics directly from raw sequence reads without the requirement of full sequence alignment. To produce a comprehensive set of quality metrics such as GC bias metrics, insert size distribution, contamination rates, and genetic ancestry, existing quality assessment methods usually require full sequence alignment which is the most time-consuming step. My methods offer orders of magnitude faster turnaround time by eliminating this requirement when compared to the widely used 1000 Genomes QC pipeline. The results show that the quality metrics estimated from my methods are highly concordant to full-alignment based methods. In Chapter 3, I present a robust statistical method that accurately estimates DNA contamination agnostic to genetic ancestry of the intended or contaminating samples. Through experiments with in-silico contaminated and real sequence datasets, I demonstrate that existing methods may fail to screen highly contaminated samples at a stringent contamination threshold due to the bias when the genetic ancestry is misspecified. Meanwhile, in the presence of contamination, genetic ancestry estimates can be substantially biased if contamination is ignored. My method integrates genetic ancestry and DNA contamination into a mixture model by leveraging individual-specific allele-frequencies projected from reference genotypes onto principal component coordinates. I show that my method robustly corrects for the bias in both estimates of contamination rate and genetic ancestry under various scenarios of contamination. In Chapter 4, I enable genetic multiplexing of single-cell RNA-seq (scRNA-seq) experiment without requiring external genotyping by developing genotyping-free scRNA-seq deconvolution method, freemuxlet. Genetic multiplexing of scRNA-seq (mux-seq) allows us to cost-effectively sequence single cell transcriptomes across multiple samples in a single library preparation by harnessing natural genetic variations while dramatically reducing the batch effect. However, the existing statistical method, demuxlet, which enables mux-seq, requires external genotypes to be collected a priori, limiting its applications when it is difficult to obtain high-quality genotypes such as in model organisms or cancer cells. Furthermore, the additional steps to obtain, process, and impute the external genotypes become a substantial bottleneck to analyze the data within rapid turnaround time. Freemuxlet defines the distances between a pair of cell barcodes as Bayes Factors (BF) to determine statistical confidence between possible hypotheses of genetic identities of each cell barcodes. The iterative procedure of multi-class clustering guided by BF distances simultaneously estimates the consensus genotypes of each individual while detecting multiplets and deconvoluting the sample provenances of singlets. I apply freemuxlet to real datasets and demonstrate high concordance of estimated droplets identities with other methods (cell hashing, demuxlet). I further demonstrate that freemuxlet can enable mux-seq on cancer cell line mixtures, where demuxlet could not due to the difficulty of accurately genotyping. My results suggest that freemuxlet can deconvolute mux-seq experiment as accurate as methods that utilize external information, facilitating a broader range of applications of population-scale single-cell sequencing.

Current Placement

Illumina, Inc.