"GWAS Approaches for Complex Diseases in EHR-linked Biobanks"
Large, longitudinal, population-based studies with EHR-linked biobanks and genetic data present many opportunities for genetic discovery. In addition to the study of quantitative traits such as blood lipids, biobanks have now reached adequate size to be well-powered for the study of complex diseases such as type 2 diabetes (T2D) or coronary artery disease (CAD).
A given cohort is typically divided into cases and controls for genome wide association studies (GWAS) of complex diseases. These cohorts inherently contain unaffected relatives of disease cases (proxy-cases) that exist in the control group but carry genetic liability for disease. Previous studies such as the kin-cohort method (Wacholder et. al.), GWAS for parental age of death (Joshi et. al.), and GWAS by proxy (GWAX) (Liu et. al.) demonstrate the utility of phenotyped, but not genotyped, relatives of subjects who have been genotyped. We extend this concept to model genetic liability in large cohorts for which cases, proxy-cases, and controls are available. Through simulations we demonstrate that by removing proxy-cases from controls, we increase power to detect true associations. By modeling proxy-cases with cases and controls we further increase power. These trends hold when we use a linear mixed model to test coefficient of relationship to a case as a semi-continuous trait (F=1 for cases, F=0.5 for proxy-cases, and F=0 for controls) in the Norwegian Nord-Trøndelag Health Study (HUNT). We also use sex, age, and family history of disease to calculate the posterior mean liability of disease as a quantitative trait for GWAS. We are evaluating these approaches in the UK Biobank (UKBB) and continuing methods development to appropriately define and model the liability for disease in biobank samples.
We also report analyses performed as part of a CAD GWAS meta-analysis with the international CARDIOGRAMplusC4D Million Hearts Consortium. In 2015, the CARDIoGRAMplusC4D Consortium tested 9.4 million variants for association with CAD in 60,801 cases and 123,504 controls and found 46 loci at genome wide significance (p-value < 5e-8). By combining these samples with UKBB, HUNT, and other large studies, we have increased our power to detect true signal at variants with low minor allele frequency (MAF). Within UKBB, we demonstrate that the use of a logistic mixed model performs better for testing these rarer variants than logistic regression. This is important for researchers to consider when selecting statistical approaches for GWAS in biobanks with even minimal population structure.