The impact of coarsening an exposure on partial identifiability in instrumental variable settings
In instrumental variable (IV) settings, such as imperfect randomized trials and observational studies with Mendelian randomization, one may encounter a continuous exposure, the causal effect of which is not of true interest. Instead, scientific interest may lie in a coarsened version of this exposure. Although there is a lengthy literature on the impact of coarsening of an exposure with several works focusing specifically on IV settings, all methods proposed in this literature require parametric assumptions. Instead, just as in the standard IV setting, one can consider partial identification via bounds making no parametric assumptions. This was first pointed out in Alexander Balke's PhD dissertation. We extend and clarify his work and derive novel bounds in several settings, including for a three-level IV, which will most likely be the case in Mendelian randomization. We demonstrate our findings in two real data examples, a randomized trial for peanut allergy in infants and a Mendelian randomization setting investigating the effect of homocysteine on cardiovascular disease.
Fast standard error estimation for joint models of longitudinal and time-to-event data based on stochastic EM algorithms
Maximum likelihood inference can often become computationally intensive when performing joint modeling of longitudinal and time-to-event data, due to the intractable integrals in the joint likelihood function. The computational challenges escalate further when modeling HIV-1 viral load data, owing to the nonlinear trajectories and the presence of left-censored data resulting from the assay's lower limit of quantification. In this paper, for a joint model comprising a nonlinear mixed-effect model and a Cox Proportional Hazards model, we develop a computationally efficient Stochastic EM (StEM) algorithm for parameter estimation. Furthermore, we propose a novel technique for fast standard error estimation, which directly estimates standard errors from the results of StEM iterations and is broadly applicable to various joint modeling settings, such as those containing generalized linear mixed-effect models, parametric survival models, or joint models with more than two submodels. We evaluate the performance of the proposed methods through simulation studies and apply them to HIV-1 viral load data from six AIDS Clinical Trials Group studies to characterize viral rebound trajectories following the interruption of antiretroviral therapy (ART), accounting for the informative duration of off-ART periods.
Recoverability of causal effects under presence of missing data: a longitudinal case study
Missing data in multiple variables is a common issue. We investigate the applicability of the framework of graphical models for handling missing data to a complex longitudinal pharmacological study of children with HIV treated with an efavirenz-based regimen as part of the CHAPAS-3 trial. Specifically, we examine whether the causal effects of interest, defined through static interventions on multiple continuous variables, can be recovered (estimated consistently) from the available data only. So far, no general algorithms are available to decide on recoverability, and decisions have to be made on a case-by-case basis. We emphasize the sensitivity of recoverability to even the smallest changes in the graph structure, and present recoverability results for three plausible missingness-directed acyclic graphs (m-DAGs) in the CHAPAS-3 study, informed by clinical knowledge. Furthermore, we propose the concept of a "closed missingness mechanism": if missing data are generated based on this mechanism, an available case analysis is admissible for consistent estimation of any statistical or causal estimand, even if data are missing not at random. Both simulations and theoretical considerations demonstrate how, in the assumed MNAR setting of our study, a complete or available case analysis can be superior to multiple imputation, and estimation results vary depending on the assumed missingness DAG. Our analyses demonstrate an innovative application of missingness DAGs to complex longitudinal real-world data, while highlighting the sensitivity of the results with respect to the assumed causal model.
A scalable two-stage Bayesian approach accounting for exposure measurement error in environmental epidemiology
Accounting for exposure measurement errors has been recognized as a crucial problem in environmental epidemiology for over two decades. Bayesian hierarchical models offer a coherent probabilistic framework for evaluating associations between environmental exposures and health effects, which take into account exposure measurement errors introduced by uncertainty in the estimated exposure as well as spatial misalignment between the exposure and health outcome data. While two-stage Bayesian analyses are often regarded as a good alternative to fully Bayesian analyses when joint estimation is not feasible, there has been minimal research on how to properly propagate uncertainty from the first-stage exposure model to the second-stage health model, especially in the case of a large number of participant locations along with spatially correlated exposures. We propose a scalable two-stage Bayesian approach, called a sparse multivariate normal (sparse MVN) prior approach, based on the Vecchia approximation for assessing associations between exposure and health outcomes in environmental epidemiology. We compare its performance with existing approaches through simulation. Our sparse MVN prior approach shows comparable performance with the fully Bayesian approach, which is a gold standard but is impossible to implement in some cases. We investigate the association between source-specific exposures and pollutant (nitrogen dioxide [NO2])-specific exposures and birth weight of full-term infants born in 2012 in Harris County, Texas, using several approaches, including the newly developed method.
Speeding up interval estimation for R2-based mediation effect of high-dimensional mediators via cross-fitting
Mediation analysis is a useful tool in investigating how molecular phenotypes such as gene expression mediate the effect of exposure on health outcomes. However, commonly used mean-based total mediation effect measures may suffer from cancellation of component-wise mediation effects in opposite directions in the presence of high-dimensional omics mediators. To overcome this limitation, we recently proposed a variance-based R-squared total mediation effect measure that relies on the computationally intensive nonparametric bootstrap for confidence interval estimation. In the work described herein, we formulated a more efficient two-stage, cross-fitted estimation procedure for the R2 measure. To avoid potential bias, we performed iterative Sure Independence Screening (iSIS) in two subsamples to exclude the non-mediators, followed by ordinary least squares regressions for the variance estimation. We then constructed confidence intervals based on the newly derived closed-form asymptotic distribution of the R2 measure. Extensive simulation studies demonstrated that this proposed procedure is much more computationally efficient than the resampling-based method, with comparable coverage probability. Furthermore, when applied to the Framingham Heart Study, the proposed method replicated the established finding of gene expression mediating age-related variation in systolic blood pressure and identified the role of gene expression profiles in the relationship between sex and high-density lipoprotein cholesterol level. The proposed estimation procedure is implemented in R package CFR2M.
Shared parameter modeling of longitudinal data allowing for possibly informative visiting process and terminal event
Joint modeling of longitudinal and time-to-event data, particularly through shared parameter models (SPMs), is a common approach for handling longitudinal marker data with an informative terminal event. A critical but often neglected assumption in this context is that the visiting/observation process is noninformative, depending solely on past marker values and visit times. When this assumption fails, the visiting process becomes informative, resulting potentially to biased SPM estimates. Existing methods generally rely on a conditional independence assumption, positing that the marker model, visiting process, and time-to-event model are independent given shared or correlated random effects. Moreover, they are typically built on an intensity-based visiting process using calendar time. This study introduces a unified approach for jointly modeling a normally distributed marker, the visiting process, and time-to-event data in the form of competing risks. Our model conditions on the history of observed marker values, prior visit times, the marker's random effects, and possibly a frailty term independent of the random effects. While our approach aligns with the shared-parameter framework, it does not presume conditional independence between the processes. Additionally, the visiting process can be defined on either a gap time scale, via proportional hazard models, or a calendar time scale, via proportional intensity models. Through extensive simulation studies, we assess the performance of our proposed methodology. We demonstrate that disregarding an informative visiting process can yield significantly biased marker estimates. However, misspecification of the visiting process can also lead to biased estimates. The gap time formulation exhibits greater robustness compared to the intensity-based model when the visiting process is misspecified. In general, enriching the visiting process with prior visit history enhances performance. We further apply our methodology to real longitudinal data from HIV, where visit frequency varies substantially among individuals.
Functional quantile principal component analysis
This paper introduces functional quantile principal component analysis (FQPCA), a dimensionality reduction technique that extends the concept of functional principal components analysis (FPCA) to the examination of participant-specific quantiles curves. Our approach borrows strength across participants to estimate patterns in quantiles, and uses participant-level data to estimate loadings on those patterns. As a result, FQPCA is able to capture shifts in the scale and distribution of data that affect participant-level quantile curves, and is also a robust methodology suitable for dealing with outliers, heteroscedastic data or skewed data. The need for such methodology is exemplified by physical activity data collected using wearable devices. Participants often differ in the timing and intensity of physical activity behaviors, and capturing information beyond the participant-level expected value curves produced by FPCA is necessary for a robust quantification of diurnal patterns of activity. We illustrate our methods using accelerometer data from the National Health and Nutrition Examination Survey, and produce participant-level 10%, 50%, and 90% quantile curves over 24 h of activity. The proposed methodology is supported by simulation results, and is available as an R package.
Selection processes, transportability, and failure time analysis in life history studies
In life history analysis of data from cohort studies, it is important to address the process by which participants are identified and selected. Many health studies select or enrol individuals based on whether they have experienced certain health related events, for example, disease diagnosis or some complication from disease. Standard methods of analysis rely on assumptions concerning the independence of selection and a person's prospective life history process, given their prior history. Violations of such assumptions are common, however, and can bias estimation of process features. This has implications for the internal and external validity of cohort studies, and for the transportabilty of results to a population. In this paper, we study failure time analysis by proposing a joint model for the cohort selection process and the failure process of interest. This allows us to address both independence assumptions and the transportability of study results. It is shown that transportability cannot be guaranteed in the absence of auxiliary information on the population. Conditions that produce dependent selection and types of auxiliary data are discussed and illustrated in numerical studies. The proposed framework is applied to a study of the risk of psoriatic arthritis in persons with psoriasis.
Pooling controls from nested case-control studies with the proportional risks model
The standard approach to regression modeling for cause-specific hazards with prospective competing risks data specifies separate models for each failure type. An alternative proposed by Lunn and McNeil (1995) assumes the cause-specific hazards are proportional across causes. This may be more efficient than the standard approach, and allows the comparison of covariate effects across causes. In this paper, we extend Lunn and McNeil (1995) to nested case-control studies, accommodating scenarios with additional matching and non-proportionality. We also consider the case where data for different causes are obtained from different studies conducted in the same cohort. It is demonstrated that while only modest gains in efficiency are possible in full cohort analyses, substantial gains may be attained in nested case-control analyses for failure types that are relatively rare. Extensive simulation studies are conducted and real data analyses are provided using the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) study.
Dynamic and concordance-assisted learning for risk stratification with application to Alzheimer's disease
Dynamic prediction models capable of retaining accuracy by evolving over time could play a significant role for monitoring disease progression in clinical practice. In biomedical studies with long-term follow up, participants are often monitored through periodic clinical visits with repeat measurements until an occurrence of the event of interest (e.g. disease onset) or the study end. Acknowledging the dynamic nature of disease risk and clinical information contained in the longitudinal markers, we propose an innovative concordance-assisted learning algorithm to derive a real-time risk stratification score. The proposed approach bypasses the need to fit regression models, such as joint models of the longitudinal markers and time-to-event outcome, and hence enjoys the desirable property of model robustness. Simulation studies confirmed that the proposed method has satisfactory performance in dynamically monitoring the risk of developing disease and differentiating high-risk and low-risk population over time. We apply the proposed method to the Alzheimer's Disease Neuroimaging Initiative data and develop a dynamic risk score of Alzheimer's Disease for patients with mild cognitive impairment using multiple longitudinal markers and baseline prognostic factors.
A Bayesian pharmacokinetics integrated phase I-II design to optimize dose-schedule regimes
The schedule of administering a drug has profound impact on the toxicity and efficacy profiles of the drug through changing its pharmacokinetics (PK). PK is an innate and indispensable component of the dose-schedule optimization. Motivated by this, we propose a Bayesian PK integrated dose-schedule finding (PKIDS) design to identify the optimal dose-schedule regime by integrating PK, toxicity, and efficacy data. Based on the causal pathway that dose and schedule affect PK, which in turn affects efficacy and toxicity, we jointly model the three endpoints by first specifying a Bayesian hierarchical model for the marginal distribution of the longitudinal dose-concentration process. Conditional on the drug concentration in plasma, we jointly model toxicity and efficacy as a function of the concentration. We quantify the risk-benefit of regimes using utility-continuously updating the estimates of PK, toxicity, and efficacy based on interim data-and make adaptive decisions to assign new patients to appropriate dose-schedule regimes via adaptive randomization. The simulation study shows that the PKIDS design has desirable operating characteristics.
HMM for discovering decision-making dynamics using reinforcement learning experiments
Major depressive disorder (MDD), a leading cause of years of life lived with disability, presents challenges in diagnosis and treatment due to its complex and heterogeneous nature. Emerging evidence indicates that reward processing abnormalities may serve as a behavioral marker for MDD. To measure reward processing, patients perform computer-based behavioral tasks that involve making choices or responding to stimulants that are associated with different outcomes, such as gains or losses in the laboratory. Reinforcement learning (RL) models are fitted to extract parameters that measure various aspects of reward processing (e.g. reward sensitivity) to characterize how patients make decisions in behavioral tasks. Recent findings suggest the inadequacy of characterizing reward learning solely based on a single RL model; instead, there may be a switching of decision-making processes between multiple strategies. An important scientific question is how the dynamics of strategies in decision-making affect the reward learning ability of individuals with MDD. Motivated by the probabilistic reward task within the Establishing Moderators and Biosignatures of Antidepressant Response in Clinical Care (EMBARC) study, we propose a novel RL-HMM (hidden Markov model) framework for analyzing reward-based decision-making. Our model accommodates decision-making strategy switching between two distinct approaches under an HMM: subjects making decisions based on the RL model or opting for random choices. We account for continuous RL state space and allow time-varying transition probabilities in the HMM. We introduce a computationally efficient Expectation-maximization (EM) algorithm for parameter estimation and use a nonparametric bootstrap for inference. Extensive simulation studies validate the finite-sample performance of our method. We apply our approach to the EMBARC study to show that MDD patients are less engaged in RL compared to the healthy controls, and engagement is associated with brain activities in the negative affect circuitry during an emotional conflict task.
On the Addams family of discrete frailty distributions for modeling multivariate case I interval-censored data
Random effect models for time-to-event data, also known as frailty models, provide a conceptually appealing way of quantifying association between survival times and of representing heterogeneities resulting from factors which may be difficult or impossible to measure. In the literature, the random effect is usually assumed to have a continuous distribution. However, in some areas of application, discrete frailty distributions may be more appropriate. The present paper is about the implementation and interpretation of the Addams family of discrete frailty distributions. We propose methods of estimation for this family of densities in the context of shared frailty models for the hazard rates for case I interval-censored data. Our optimization framework allows for stratification of random effect distributions by covariates. We highlight interpretational advantages of the Addams family of discrete frailty distributions and theK-point distribution as compared to other frailty distributions. A unique feature of the Addams family and the K-point distribution is that the support of the frailty distribution depends on its parameters. This feature is best exploited by imposing a model on the distributional parameters, resulting in a model with non-homogeneous covariate effects that can be analyzed using standard measures such as the hazard ratio. Our methods are illustrated with applications to multivariate case I interval-censored infection data.
Adaptive Gaussian Markov random fields for child mortality estimation
The under-5 mortality rate (U5MR), a critical health indicator, is typically estimated from household surveys in lower and middle income countries. Spatio-temporal disaggregation of household survey data can lead to highly variable estimates of U5MR, necessitating the usage of smoothing models which borrow information across space and time. The assumptions of common smoothing models may be unrealistic when certain time periods or regions are expected to have shocks in mortality relative to their neighbors, which can lead to oversmoothing of U5MR estimates. In this paper, we develop a spatial and temporal smoothing approach based on Gaussian Markov random field models which incorporate knowledge of these expected shocks in mortality. We demonstrate the potential for these models to improve upon alternatives not incorporating knowledge of expected shocks in a simulation study. We apply these models to estimate U5MR in Rwanda at the national level from 1985 to 2019, a time period which includes the Rwandan civil war and genocide.
Stochastic EM algorithm for partially observed stochastic epidemics with individual heterogeneity
We develop a stochastic epidemic model progressing over dynamic networks, where infection rates are heterogeneous and may vary with individual-level covariates. The joint dynamics are modeled as a continuous-time Markov chain such that disease transmission is constrained by the contact network structure, and network evolution is in turn influenced by individual disease statuses. To accommodate partial epidemic observations commonly seen in real-world data, we propose a stochastic EM algorithm for inference, introducing key innovations that include efficient conditional samplers for imputing missing infection and recovery times which respect the dynamic contact network. Experiments on both synthetic and real datasets demonstrate that our inference method can accurately and efficiently recover model parameters and provide valuable insight at the presence of unobserved disease episodes in epidemic data.
Regression and alignment for functional data and network topology
In the brain, functional connections form a network whose topological organization can be described by graph-theoretic network diagnostics. These include characterizations of the community structure, such as modularity and participation coefficient, which have been shown to change over the course of childhood and adolescence. To investigate if such changes in the functional network are associated with changes in cognitive performance during development, network studies often rely on an arbitrary choice of preprocessing parameters, in particular the proportional threshold of network edges. Because the choice of parameter can impact the value of the network diagnostic, and therefore downstream conclusions, we propose to circumvent that choice by conceptualizing the network diagnostic as a function of the parameter. As opposed to a single value, a network diagnostic curve describes the connectome topology at multiple scales-from the sparsest group of the strongest edges to the entire edge set. To relate these curves to executive function and other covariates, we use scalar-on-function regression, which is more flexible than previous functional data-based models used in network neuroscience. We then consider how systematic differences between networks can manifest in misalignment of diagnostic curves, and consequently propose a supervised curve alignment method that incorporates auxiliary information from other variables. Our algorithm performs both functional regression and alignment via an iterative, penalized, and nonlinear likelihood optimization. The illustrated method has the potential to improve the interpretability and generalizability of neuroscience studies where the goal is to study heterogeneity among a mixture of function- and scalar-valued measures.
Exposure proximal immune correlates analysis
Immune response decays over time, and vaccine-induced protection often wanes. Understanding how vaccine efficacy changes over time is critical to guiding the development and application of vaccines in preventing infectious diseases. The objective of this article is to develop statistical methods that assess the effect of decaying immune responses on the risk of disease and on vaccine efficacy, within the context of Cox regression with sparse sampling of immune responses, in a baseline-naive population. We aim to further disentangle the various aspects of the time-varying vaccine effect, whether direct on disease or mediated through immune responses. Based on time-to-event data from a vaccine efficacy trial and sparse sampling of longitudinal immune responses, we propose a weighted estimated induced likelihood approach that models the longitudinal immune response trajectory and the time to event separately. This approach assesses the effects of the decaying immune response, the peak immune response, and/or the waning vaccine effect on the risk of disease. The proposed method is applicable not only to standard randomized trial designs but also to augmented vaccine trial designs that re-vaccinate uninfected placebo recipients at the end of the standard trial period. We conducted simulation studies to evaluate the performance of our method and applied the method to analyze immune correlates from a phase III SARS-CoV-2 vaccine trial.
Incorporating prior information in gene expression network-based cancer heterogeneity analysis
Cancer is molecularly heterogeneous, with seemingly similar patients having different molecular landscapes and accordingly different clinical behaviors. In recent studies, gene expression networks have been shown as more effective/informative for cancer heterogeneity analysis than some simpler measures. Gene interconnections can be classified as "direct" and "indirect," where the latter can be caused by shared genomic regulators (such as transcription factors, microRNAs, and other regulatory molecules) and other mechanisms. It has been suggested that incorporating the regulators of gene expressions in network analysis and focusing on the direct interconnections can lead to a deeper understanding of the more essential gene interconnections. Such analysis can be seriously challenged by the large number of parameters (jointly caused by network analysis, incorporation of regulators, and heterogeneity) and often weak signals. To effectively tackle this problem, we propose incorporating prior information contained in the published literature. A key challenge is that such prior information can be partial or even wrong. We develop a two-step procedure that can flexibly accommodate different levels of prior information quality. Simulation demonstrates the effectiveness of the proposed approach and its superiority over relevant competitors. In the analysis of a breast cancer dataset, findings different from the alternatives are made, and the identified sample subgroups have important clinical differences.
Estimating causal effects for binary outcomes using per-decision inverse probability weighting
Micro-randomized trials are commonly conducted for optimizing mobile health interventions such as push notifications for behavior change. In analyzing such trials, causal excursion effects are often of primary interest, and their estimation typically involves inverse probability weighting (IPW). However, in a micro-randomized trial, additional treatments can often occur during the time window over which an outcome is defined, and this can greatly inflate the variance of the causal effect estimator because IPW would involve a product of numerous weights. To reduce variance and improve estimation efficiency, we propose two new estimators using a modified version of IPW, which we call "per-decision IPW." The second estimator further improves efficiency using the projection idea from the semiparametric efficiency theory. These estimators are applicable when the outcome is binary and can be expressed as the maximum of a series of sub-outcomes defined over sub-intervals of time. We establish the estimators' consistency and asymptotic normality. Through simulation studies and real data applications, we demonstrate substantial efficiency improvement of the proposed estimator over existing estimators. The new estimators can be used to improve the precision of primary and secondary analyses for micro-randomized trials with binary outcomes.
Direct estimation and inference of higher-level correlations from lower-level measurements with applications in gene-pathway and proteomics studies
This paper tackles the challenge of estimating correlations between higher-level biological variables (e.g. proteins and gene pathways) when only lower-level measurements are directly observed (e.g. peptides and individual genes). Existing methods typically aggregate lower-level data into higher-level variables and then estimate correlations based on the aggregated data. However, different data aggregation methods can yield varying correlation estimates as they target different higher-level quantities. Our solution is a latent factor model that directly estimates these higher-level correlations from lower-level data without the need for data aggregation. We further introduce a shrinkage estimator to ensure the positive definiteness and improve the accuracy of the estimated correlation matrix. Furthermore, we establish the asymptotic normality of our estimator, enabling efficient computation of P-values for the identification of significant correlations. The effectiveness of our approach is demonstrated through comprehensive simulations and the analysis of proteomics and gene expression datasets. We develop the R package highcor for implementing our method.