"VerifyBamID 2.0: More accurate detection and estimation of sample contamination from DNA sequence data"
Detecting and estimating DNA sample contamination has become an important step to ensure high quality sequence reads and reliable downstream analysis. Existing methods rely on external allele frequency information for accurate estimation of contamination levels. Correctly specifying population allele frequencies in early the stage of sequence analysis is cumbersome or sometimes even impossible for large-scale sequencing centers that simultaneously process samples from multiple studies across diverse populations. Incorrectly specified allele frequencies may result in substantial bias in estimated contamination levels, leading to incomplete sample quality controls and increased genotyping errors, which is particularly problematic in deeply sequenced genomes and exomes.
Through experiments with in-silico contaminated and/or real sequence data, we demonstrate existing methods fail to screen highly contaminated samples (e.g. 10%) at a stringent contamination threshold (e.g. 3%) due to the bias when the genetic ancestry is misspecified. On the other hand, in the presence contamination, the genetic ancestry estimates can also be substantially biased if contamination is ignored.
We propose a robust statistical method that accurately estimates DNA contamination agnostic to genetic ancestry of the intended or contaminating samples. Our method integrates the estimation of genetic ancestry and DNA contamination in a unified likelihood framework by leveraging individual-specific allele-frequencies projected from reference genotypes onto principal component coordinates. Based on the evaluation of our method on a real dataset, we show that our method robustly corrects for the bias in both contamination level estimates and genetic ancestry estimates under different scenarios of contamination.