A Novel One-Sample Mendelian Randomization Approach for Count-Type Outcomes That Is Robust to Correlated and Uncorrelated Pleiotropic Effects
We propose two novel one-sample Mendelian randomization (MR) approaches to causal inference from count-type health outcomes, tailored to both equidispersion and overdispersion conditions. Selecting valid single-nucleotide polymorphisms (SNPs) as instrumental variables (IVs) poses a key challenge for MR approaches, as it requires meeting the necessary IV assumptions. To bolster the proposed approaches by addressing violations of IV assumptions, we incorporate a process for removing invalid SNPs that violate the assumptions. In simulations, our proposed approaches demonstrate robustness to the violations, delivering valid estimates, and interpretable type-I errors and statistical power. This increases the practical applicability of the models. We applied the proposed approaches to evaluate the causal effect of fetal hemoglobin (HbF) on the vaso-occlusive crisis and acute chest syndrome (ACS) events in patients with sickle cell disease (SCD) and revealed the causal relation between HbF and ACS events in these patients. We also developed a user-friendly Shiny web application to facilitate researchers' exploration of causal relations.
Fine-Mapping the Results From Genome-Wide Association Studies of Primary Biliary Cholangitis Using Susie and h2-D2
The main goal of fine-mapping is the identification of relevant genetic variants that have a causal effect on some trait of interest, such as the presence of a disease. From a statistical point of view, fine mapping can be seen as a variable selection problem. Fine-mapping methods are often challenging to apply because of the presence of linkage disequilibrium (LD), that is, regions of the genome where the variants interrogated have high correlation. Several methods have been proposed to address this issue. Here we explore the 'Sum of Single Effects' (SuSiE) method, applied to real data (summary statistics) from a genome-wide meta-analysis of the autoimmune liver disease primary biliary cholangitis (PBC). Fine-mapping in this data set was previously performed using the FINEMAP program; we compare these previous results with those obtained from SuSiE, which provides an arguably more convenient and principled way of generating 'credible sets', that is set of predictors that are correlated with the response variable. This allows us to appropriately acknowledge the uncertainty when selecting the causal effects for the trait. We focus on the results from SuSiE-RSS, which fits the SuSiE model to summary statistics, such as z-scores, along with a correlation matrix. We also compare the SuSiE results to those obtained using a more recently developed method, h2-D2, which uses the same inputs. Overall, we find the results from SuSiE-RSS and, to a lesser extent, h2-D2, to be quite concordant with those previously obtained using FINEMAP. The resulting genes and biological pathways implicated are therefore also similar to those previously obtained, providing valuable confirmation of these previously reported results. Detailed examination of the credible sets identified suggests that, although for the majority of the loci (33 out of 56) the results from SuSiE-RSS seem most plausible, there are some loci (5 out of 56 loci) where the results from h2-D2 seem more compelling. Computer simulations suggest that, overall, SuSiE-RSS generally has slightly higher power, better precision, and better ability to identify the true number of causal variants in a region than h2-D2, although there are some scenarios where the power of h2-D2 is higher. Thus, in real data analysis, the use of complementary approaches such as both SuSiE and h2-D2 is potentially warranted.
Genetic Associations of Persistent Opioid Use After Surgery Point to OPRM1 but Not Other Opioid-Related Loci as the Main Driver of Opioid Use Disorder
Persistent opioid use after surgery is a common morbidity outcome associated with subsequent opioid use disorder, overdose, and death. While phenotypic associations have been described, genetic associations remain unidentified. Here, we conducted the largest genetic study of persistent opioid use after surgery, comprising ~40,000 non-Hispanic, European-ancestry Michigan Genomics Initiative participants (3198 cases and 36,321 surgically exposed controls). Our study primarily focused on the reproducibility and reliability of 72 genetic studies of opioid use disorder phenotypes. Nominal associations (p < 0.05) occurred at 12 of 80 unique (r < 0.8) signals from these studies. Six occurred in OPRM1 (most significant: rs79704991-T, OR = 1.17, p = 8.7 × 10), with two surviving multiple testing correction. Other associations were rs640561-LRRIQ3 (p = 0.015), rs4680-COMT (p = 0.016), rs9478495 (p = 0.017, intergenic), rs10886472-GRK5 (p = 0.028), rs9291211-SLC30A9/BEND4 (p = 0.043), and rs112068658-KCNN1 (p = 0.048). Two highly referenced genes, OPRD1 and DRD2/ANKK1, had no signals in MGI. Associations at previously identified OPRM1 variants suggest common biology between persistent opioid use and opioid use disorder, further demonstrating connections between opioid dependence and addiction phenotypes. Lack of significant associations at other variants challenges previous studies' reliability.
Integrative Multi-Omics Approach for Improving Causal Gene Identification
Transcriptome-wide association studies (TWAS) have been widely used to identify thousands of likely causal genes for diseases and complex traits using predicted expression models. However, most existing TWAS methods rely on gene expression alone and overlook other regulatory mechanisms of gene expression, including DNA methylation and splicing, that contribute to the genetic basis of these complex traits and diseases. Here we introduce a multi-omics method that integrates gene expression, DNA methylation, and splicing data to improve the identification of associated genes with our traits of interest. Through simulations and by analyzing genome-wide association study (GWAS) summary statistics for 24 complex traits, we show that our integrated method, which leverages these complementary omics biomarkers, achieves higher statistical power, and improves the accuracy of likely causal gene identification in blood tissues over individual omics methods. Finally, we apply our integrated model to a lung cancer GWAS data set, demonstrating the integrated models improved identification of prioritized genes for lung cancer risk.
GWASBrewer: An R Package for Simulating Realistic GWAS Summary Statistics
Many statistical genetics analysis methods make use of GWAS summary statistics. Best statistical practice requires evaluating these methods in realistic simulation experiments. However, simulating summary statistics by first simulating individual genotype and phenotype data is extremely computationally demanding. This high cost may force researchers to conduct overly simplistic simulations that fail to accurately measure method performance. Alternatively, summary statistics can be simulated directly from their theoretical distribution. Although this is a common need among statistical genetics researchers, no software packages exist for comprehensive GWAS summary statistic simulation. We present GWASBrewer, an open source R package for direct simulation of GWAS summary statistics. We show that statistics simulated by GWASBrewer have the same distribution as statistics generated from individual level data, and can be produced at a fraction of the computational expense. Additionally, GWASBrewer can simulate standard error estimates, something that is typically not done when sampling summary statistics directly. GWASBrewer is highly flexible, allowing the user to simulate data for multiple traits connected by causal effects and with complex distributions of effect sizes. We demonstrate example uses of GWASBrewer for evaluating Mendelian randomization, polygenic risk score, and heritability estimation methods.
Hierarchical joint analysis of marginal summary statistics-Part II: High-dimensional instrumental analysis of omics data
Instrumental variable (IV) analysis has been widely applied in epidemiology to infer causal relationships using observational data. Genetic variants can also be viewed as valid IVs in Mendelian randomization and transcriptome-wide association studies. However, most multivariate IV approaches cannot scale to high-throughput experimental data. Here, we leverage the flexibility of our previous work, a hierarchical model that jointly analyzes marginal summary statistics (hJAM), to a scalable framework (SHA-JAM) that can be applied to a large number of intermediates and a large number of correlated genetic variants-situations often encountered in modern experiments leveraging omic technologies. SHA-JAM aims to estimate the conditional effect for high-dimensional risk factors on an outcome by incorporating estimates from association analyses of single-nucleotide polymorphism (SNP)-intermediate or SNP-gene expression as prior information in a hierarchical model. Results from extensive simulation studies demonstrate that SHA-JAM yields a higher area under the receiver operating characteristics curve (AUC), a lower mean-squared error of the estimates, and a much faster computation speed, compared to an existing approach for similar analyses. In two applied examples for prostate cancer, we investigated metabolite and transcriptome associations, respectively, using summary statistics from a GWAS for prostate cancer with more than 140,000 men and high dimensional publicly available summary data for metabolites and transcriptomes.
Statistics to prioritize rare variants in family-based sequencing studies with disease subtypes
Family-based sequencing studies are increasingly used to find rare genetic variants of high risk for disease traits with familial clustering. In some studies, families with multiple disease subtypes are collected and the exomes of affected relatives are sequenced for shared rare variants (RVs). Since different families can harbor different causal variants and each family harbors many RVs, tests to detect causal variants can have low power in this study design. Our goal is rather to prioritize shared variants for further investigation by, for example, pathway analyses or functional studies. The transmission-disequilibrium test prioritizes variants based on departures from Mendelian transmission in parent-child trios. Extending this idea to families, we propose methods to prioritize RVs shared in affected relatives with two disease subtypes, with one subtype more heritable than the other. Global approaches condition on a variant being observed in the study and assume a known probability of carrying a causal variant. In contrast, local approaches condition on a variant being observed in specific families to eliminate the carrier probability. Our simulation results indicate that global approaches are robust to misspecification of the carrier probability and prioritize more effectively than local approaches even when the carrier probability is misspecified.
Estimating Causal Effects on a Disease Progression Trait Using Bivariate Mendelian Randomisation
Genome-wide association studies (GWAS) have provided large numbers of genetic markers that can be used as instrumental variables in a Mendelian Randomisation (MR) analysis to assess the causal effect of a risk factor on an outcome. An extension of MR analysis, multivariable MR, has been proposed to handle multiple risk factors. However, adjusting or stratifying the outcome on a variable that is associated with it may induce collider bias. For an outcome that represents progression of a disease, conditioning by selecting only the cases may cause a biased MR estimation of the causal effect of the risk factor of interest on the progression outcome. Recently, we developed instrument effect regression and corrected weighted least squares (CWLS) to adjust for collider bias in observational associations. In this paper, we highlight the importance of adjusting for collider bias in MR with a risk factor of interest and disease progression as the outcome. A generalised version of the instrument effect regression and CWLS adjustment is proposed based on a multivariable MR model. We highlight the assumptions required for this approach and demonstrate its utility for bias reduction. We give an illustrative application to the effect of smoking initiation and smoking cessation on Crohn's disease prognosis, finding no evidence to support a causal effect.
Proteome-wide association study using cis and trans variants and applied to blood cell and lipid-related traits in the Women's Health Initiative study
In most Proteome-Wide Association Studies (PWAS), variants near the protein-coding gene (±1 Mb), also known as cis single nucleotide polymorphisms (SNPs), are used to predict protein levels, which are then tested for association with phenotypes. However, proteins can be regulated through variants outside of the cis region. An intermediate GWAS step to identify protein quantitative trait loci (pQTL) allows for the inclusion of trans SNPs outside the cis region in protein-level prediction models. Here, we assess the prediction of 540 proteins in 1002 individuals from the Women's Health Initiative (WHI), split equally into a GWAS set, an elastic net training set, and a testing set. We compared the testing r between measured and predicted protein levels using this proposed approach, to the testing r using only cis SNPs. The two methods usually resulted in similar testing r, but some proteins showed a significant increase in testing r with our method. For example, for cartilage acidic protein 1, the testing r increased from 0.101 to 0.351. We also demonstrate reproducible findings for predicted protein association with lipid and blood cell traits in WHI participants without proteomics data and in UK Biobank utilizing our PWAS weights.
Comparing Ancestry Standardization Approaches for a Transancestry Colorectal Cancer Polygenic Risk Score
Colorectal cancer (CRC) is a complex disease with monogenic, polygenic and environmental risk factors. Polygenic risk scores (PRSs) aim to identify high polygenic risk individuals. Due to differences in genetic background, PRS distributions vary by ancestry, necessitating standardization. We compared four post-hoc methods using the All of Us Research Program Whole Genome Sequence data for a transancestry CRC PRS. We contrasted results from linear models trained on A. the entire data or an ancestrally diverse subset AND B. covariates including principal components of ancestry or admixture. Standardization with the training subset also adjusted the variance. All methods performed similarly within ancestry, OR (95% C.I.) per s.d. change in PRS: African 1.5 (1.02, 2.08), Admixed American 2.2 (1.27, 3.85), European 1.6 (1.43, 1.89), and Middle Eastern 1.1 (0.71, 1.63). Using admixture and an ancestrally diverse training set provided distributions closest to standard Normal. Training a model on ancestrally diverse participants, adjusting both the mean and variance using admixture as covariates, created standard Normal z-scores, which can be used to identify patients at high polygenic risk. These scores can be incorporated into comprehensive risk calculation including other known risk factors, allowing for more precise risk estimates.
Exploring and Accounting for Genetically Driven Effect Heterogeneity in Mendelian Randomization
Mendelian randomization (MR) is a framework to estimate the causal effect of a modifiable health exposure, drug target or pharmaceutical intervention on a downstream outcome by using genetic variants as instrumental variables. A crucial assumption allowing estimation of the average causal effect in MR, termed homogeneity, is that the causal effect does not vary across levels of any instrument used in the analysis. In contrast, the science of pharmacogenetics seeks to actively uncover and exploit genetically driven effect heterogeneity for the purposes of precision medicine. In this study, we consider a recently proposed method for performing pharmacogenetic analysis on observational data-the Triangulation WIthin a STudy (TWIST) framework-and explore how it can be combined with traditional MR approaches to properly characterise average causal effects and genetically driven effect heterogeneity. We propose two new methods which not only estimate the genetically driven effect heterogeneity but also enable the estimation of a causal effect in the genetic group with and without the risk allele separately. Both methods utilise homogeneity-respecting and homogeneity-violating genetic variants and rely on a different set of assumptions. Using data from the ALSPAC study, we apply our new methods to estimate the causal effect of smoking before and during pregnancy on offspring birth weight in mothers whose genetics mean they find it (relatively) easier or harder to quit smoking.
Predicting Lung Cancer in Korean Never-Smokers With Polygenic Risk Scores
In the last few decades, genome-wide association studies (GWAS) with more than 10,000 subjects have identified several loci associated with lung cancer and these loci have been used to develop novel risk prediction tools for cancer. The present study aimed to establish a lung cancer prediction model for Korean never-smokers using polygenic risk scores (PRSs); PRSs were calculated using a pruning-thresholding-based approach based on 11 genome-wide significant single nucleotide polymorphisms (SNPs). Overall, the odds ratios tended to increase as PRSs were larger, with the odds ratio of the top 5% PRSs being 1.71 (95% confidence interval: 1.31-2.23) using the 40%-60% percentile group as the reference, and the area under the curve (AUC) of the prediction model being of 0.76 (95% confidence interval: 0.747-0.774). The receiver operating characteristic (ROC) curves of the prediction model with and without PRSs as covariates were compared using DeLong's test, and a significant difference was observed. Our results suggest that PRSs can be valuable tools for predicting the risk of lung cancer.
Ethical, Legal, and Social Implications of Gene-Environment Interaction Research
Many complex disorders are impacted by the interplay of genetic and environmental factors. In gene-environment interactions (GxE), an individual's genetic and epigenetic makeup impacts the response to environmental exposures. Understanding GxE can impact health at the individual, community, and population levels. The rapid expansion of GxE research in biomedical studies for complex diseases raises many unique ethical, legal, and social implications (ELSIs) that have not been extensively explored and addressed. This review article builds on discussions originating from a workshop held by the National Institute of Environmental Health Sciences (NIEHS) and the National Human Genome Research Institute (NHGRI) in January 2022, entitled: "Ethical, Legal, and Social Implications of Gene-Environment Interaction Research." We expand upon multiple key themes to inform broad recommendations and general guidance for addressing some of the most unique and challenging ELSI in GxE research. Key takeaways include strategies and approaches for establishing sustainable community partnerships, incorporating social determinants of health and environmental justice considerations into GxE research, effectively communicating and translating GxE findings, and addressing privacy and discrimination concerns in all GxE research going forward. Additional guidelines, resources, approaches, training, and capacity building are required to further support innovative GxE research and multidisciplinary GxE research teams.
PSAP-Genomic-Regions: A Method Leveraging Population Data to Prioritize Coding and Non-Coding Variants in Whole Genome Sequencing for Rare Disease Diagnosis
The introduction of Next-Generation Sequencing technologies in the clinics has improved rare disease diagnosis. Nonetheless, for very heterogeneous or very rare diseases, more than half of cases still lack molecular diagnosis. Novel strategies are needed to prioritize variants within a single individual. The Population Sampling Probability (PSAP) method was developed to meet this aim but only for coding variants in exome data. Here, we propose an extension of the PSAP method to the non-coding genome called PSAP-genomic-regions. In this extension, instead of considering genes as testing units (PSAP-genes strategy), we use genomic regions defined over the whole genome that pinpoint potential functional constraints. We conceived an evaluation protocol for our method using artificially generated disease exomes and genomes, by inserting coding and non-coding pathogenic ClinVar variants in large data sets of exomes and genomes from the general population. PSAP-genomic-regions significantly improves the ranking of these variants compared to using a pathogenicity score alone. Using PSAP-genomic-regions, more than 50% of non-coding ClinVar variants were among the top 10 variants of the genome. On real sequencing data from six patients with Cerebral Small Vessel Disease and nine patients with male infertility, all causal variants were ranked in the top 100 variants with PSAP-genomic-regions. By revisiting the testing units used in the PSAP method to include non-coding variants, we have developed PSAP-genomic-regions, an efficient whole-genome prioritization tool which offers promising results for the diagnosis of unresolved rare diseases.
Enhancing Gene Expression Predictions Using Deep Learning and Functional Annotations
Transcriptome-wide association studies (TWAS) aim to uncover genotype-phenotype relationships through a two-stage procedure: predicting gene expression from genotypes using an expression quantitative trait locus (eQTL) data set, then testing the predicted expression for trait associations. Accurate gene expression prediction in stage 1 is crucial, as it directly impacts the power to identify associations in stage 2. Currently, the first stage of such studies is primarily conducted using linear models like elastic net regression, which fail to capture the nonlinear relationships inherent in biological systems. Deep learning methods have the potential to model such nonlinear effects, but have yet to demonstrably outperform linear methods at this task. To address this gap, we propose a new deep learning architecture to predict gene expression from genotypic variation across individuals. Our method utilizes a learnable input scaling layer in conjunction with a convolutional encoder to capture nonlinear effects and higher-order interactions without compromising on interpretability. We further augment this approach to allow for parameter sharing across multiple networks, enabling us to utilize prior information for individual variants in the form of functional annotations. Evaluations on real-world genomic data show that our method consistently outperforms elastic net regression across a large set of heritable genes. Furthermore, our model statistically significantly improved predictive performance by leveraging functional annotations, whereas elastic net regression failed to show equivalent gains when using the same information, suggesting that our method can capture nonlinear functional information beyond the capability of linear models.
Powerful Rare-Variant Association Analysis of Secondary Phenotypes
Most genome-wide association studies are based on case-control designs, which provide abundant resources for secondary phenotype analyses. However, such studies suffer from biased sampling of primary phenotypes, and the traditional statistical methods can lead to seriously distorted analysis results when they are applied to secondary phenotypes without accounting for the biased sampling mechanism. To our knowledge, there are no statistical methods specifically tailored for rare variant association analysis with secondary phenotypes. In this article, we proposed two novel joint test statistics for identifying secondary-phenotype-associated rare variants based on prospective likelihood and retrospective likelihood, respectively. We also exploit the assumption of gene-environment independence in retrospective likelihood to improve the statistical power and adopt a two-step strategy to balance statistical power and robustness. Simulations and a real-data application are conducted to demonstrate the superior performance of our proposed methods.
A Mixed-Effect Kernel Machine Regression Model for Integrative Analysis of Alpha Diversity in Microbiome Studies
Increasing evidence suggests that human microbiota plays a crucial role in many diseases. Alpha diversity, a commonly used summary statistic that captures the richness and/or evenness of the microbial community, has been associated with many clinical conditions. However, individual studies that assess the association between alpha diversity and clinical conditions often provide inconsistent results due to insufficient sample size, heterogeneous study populations and technical variability. In practice, meta-analysis tools have been applied to integrate data from multiple studies. However, these methods do not consider the heterogeneity caused by sequencing protocols, and the contribution of each study to the final model depends mainly on its sample size (or variance estimate). To combine studies with distinct sequencing protocols, a robust statistical framework for integrative analysis of microbiome datasets is needed. Here, we propose a mixed-effect kernel machine regression model to assess the association of alpha diversity with a phenotype of interest. Our approach readily incorporates the study-specific characteristics (including sequencing protocols) to allow for flexible modeling of microbiome effect via a kernel similarity matrix. Within the proposed framework, we provide three hypothesis testing approaches to answer different questions that are of interest to researchers. We evaluate the model performance through extensive simulations based on two distinct data generation mechanisms. We also apply our framework to data from HIV reanalysis consortium to investigate gut dysbiosis in HIV infection.
Exploring pleiotropy in Mendelian randomisation analyses: What are genetic variants associated with 'cigarette smoking initiation' really capturing?
Genetic variants used as instruments for exposures in Mendelian randomisation (MR) analyses may have horizontal pleiotropic effects (i.e., influence outcomes via pathways other than through the exposure), which can undermine the validity of results. We examined the extent of this using smoking behaviours as an example. We first ran a phenome-wide association study in UK Biobank, using a smoking initiation genetic instrument. From the most strongly associated phenotypes, we selected those we considered could either plausibly or not plausibly be caused by smoking. We examined associations between genetic instruments for smoking initiation, smoking heaviness and lifetime smoking and these phenotypes in UK Biobank and the Avon Longitudinal Study of Parents and Children (ALSPAC). We conducted negative control analyses among never smokers, including children. We found evidence that smoking-related genetic instruments were associated with phenotypes not plausibly caused by smoking in UK Biobank and (to a lesser extent) ALSPAC. We observed associations with phenotypes among never smokers. Our results demonstrate that smoking-related genetic risk scores are associated with unexpected phenotypes that are less plausibly downstream of smoking. This may reflect horizontal pleiotropy in these genetic risk scores, and we would encourage researchers to exercise caution this when using these and genetic risk scores for other complex behavioural exposures. We outline approaches that could be taken to consider this and overcome issues caused by potential horizontal pleiotropy, for example, in genetically informed causal inference analyses (e.g., MR) it is important to consider negative control outcomes and triangulation approaches, to avoid arriving at incorrect conclusions.
Using clustering of genetic variants in Mendelian randomization to interrogate the causal pathways underlying multimorbidity from a common risk factor
Mendelian randomization (MR) is an epidemiological approach that utilizes genetic variants as instrumental variables to estimate the causal effect of an exposure on a health outcome. This paper investigates an MR scenario in which genetic variants aggregate into clusters that identify heterogeneous causal effects. Such variant clusters are likely to emerge if they affect the exposure and outcome via distinct biological pathways. In the multi-outcome MR framework, where a shared exposure causally impacts several disease outcomes simultaneously, these variant clusters can provide insights into the common disease-causing mechanisms underpinning the co-occurrence of multiple long-term conditions, a phenomenon known as multimorbidity. To identify such variant clusters, we adapt the general method of agglomerative hierarchical clustering to multi-sample summary-data MR setup, enabling cluster detection based on variant-specific ratio estimates. Particularly, we tailor the method for multi-outcome MR to aid in elucidating the causal pathways through which a common risk factor contributes to multiple morbidities. We show in simulations that our "MR-AHC" method detects clusters with high accuracy, outperforming the existing methods. We apply the method to investigate the causal effects of high body fat percentage on type 2 diabetes and osteoarthritis, uncovering interconnected cellular processes underlying this multimorbid disease pair.
Polygenic hazard score models for the prediction of Alzheimer's free survival using the lasso for Cox's proportional hazards model
The prediction of the susceptibility of an individual to a certain disease is an important and timely research area. An established technique is to estimate the risk of an individual with the help of an integrated risk model, that is, a polygenic risk score with added epidemiological covariates. However, integrated risk models do not capture any time dependence, and may provide a point estimate of the relative risk with respect to a reference population. The aim of this work is twofold. First, we explore and advocate the idea of predicting the time-dependent hazard and survival (defined as disease-free time) of an individual for the onset of a disease. This provides a practitioner with a much more differentiated view of absolute survival as a function of time. Second, to compute the time-dependent risk of an individual, we use published methodology to fit a Cox's proportional hazard model to data from a genetic SNP study of time to Alzheimer's disease (AD) onset, using the lasso to incorporate further epidemiological variables such as sex, APOE (apolipoprotein E, a genetic risk factor for AD) status, 10 leading principal components, and selected genomic loci. We apply the lasso for Cox's proportional hazards to a data set of 6792 AD patients (composed of 4102 cases and 2690 controls) and 87 covariates. We demonstrate that fitting a lasso model for Cox's proportional hazards allows one to obtain more accurate survival curves than with state-of-the-art (likelihood-based) methods. Moreover, the methodology allows one to obtain personalized survival curves for a patient, thus giving a much more differentiated view of the expected progression of a disease than the view offered by integrated risk models. The runtime to compute personalized survival curves is under a minute for the entire data set of AD patients, thus enabling it to handle datasets with 60,000-100,000 subjects in less than 1 h.
Use of genetic correlations to examine selection bias
Observational studies are rarely representative of their target population because there are known and unknown factors that affect an individual's choice to participate (the selection mechanism). Selection can cause bias in a given analysis if the outcome is related to selection (conditional on the other variables in the model). Detecting and adjusting for selection bias in practice typically requires access to data on nonselected individuals. Here, we propose methods to detect selection bias in genetic studies by comparing correlations among genetic variants in the selected sample to those expected under no selection. We examine the use of four hypothesis tests to identify induced associations between genetic variants in the selected sample. We evaluate these approaches in Monte Carlo simulations. Finally, we use these approaches in an applied example using data from the UK Biobank (UKBB). The proposed tests suggested an association between alcohol consumption and selection into UKBB. Hence, UKBB analyses with alcohol consumption as the exposure or outcome may be biased by this selection.