STATISTICAL MODELLING

Canonical correlation analysis in high dimensions with structured regularization
Tuzhilina E, Tozzi L and Hastie T
Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. A regularized modification of canonical correlation analysis (RCCA) which imposes an penalty on the CCA coefficients is widely used in applications with high-dimensional data. One limitation of such regularization is that it ignores any data structure, treating all the features equally, which can be ill-suited for some applications. In this article we introduce several approaches to regularizing CCA that take the underlying data structure into account. In particular, the proposed group regularized canonical correlation analysis (GRCCA) is useful when the variables are correlated in groups. We illustrate some computational strategies to avoid excessive computations with regularized CCA in high dimensions. We demonstrate the application of these methods in our motivating application from neuroscience, as well as in a small simulation example.
Streamlined variational inference for higher level group-specific curve models
Menictas M, Nolan TH, Simpson DG and Wand MP
A two-level group-specific curve model is such that the mean response of each member of a group is a separate smooth function of a predictor of interest. The three-level extension is such that one grouping variable is nested within another one, and higher level extensions are analogous. Streamlined variational inference for higher level group-specific curve models is a challenging problem. We confront it by systematically working through two-level and then three-level cases and making use of the higher level sparse matrix infrastructure laid down in Nolan and Wand (2019). A motivation is analysis of data from ultrasound technology for which three-level group-specific curve models are appropriate. Whilst extension to the number of levels exceeding three is not covered explicitly, the pattern established by our systematic approach sheds light on what is required for even higher level group-specific curve models.
Joint modelling of longitudinal and survival data in the presence of competing risks with applications to prostate cancer data
Sheikh MT, Ibrahim JG, Gelfond JA, Sun W and Chen MH
This research is motivated from the data from a large Selenium and Vitamin E Cancer Prevention Trial (SELECT). The prostate specific antigens (PSAs) were collected longitudinally, and the survival endpoint was the time to low-grade cancer or the time to high-grade cancer (competing risks). In this article, the goal is to model the longitudinal PSA data and the time-to-prostate cancer (PC) due to low- or high-grade. We consider the low-grade and high-grade as two competing causes of developing PC. A joint model for simultaneously analysing longitudinal and time-to-event data in the presence of multiple causes of failure (or competing risk) is proposed within the Bayesian framework. The proposed model allows for handling the missing causes of failure in the SELECT data and implementing an efficient Markov chain Monte Carlo sampling algorithm to sample from the posterior distribution via a novel reparameterization technique. Bayesian criteria, ΔDIC, and ΔWAIC, are introduced to quantify the gain in fit in the survival sub-model due to the inclusion of longitudinal data. A simulation study is conducted to examine the empirical performance of the posterior estimates as well as ΔDIC and ΔWAIC and a detailed analysis of the SELECT data is also carried out to further demonstrate the proposed methodology.
Assessing Importance of Biomarkers: a Bayesian Joint Modeling Approach of Longitudinal and Survival Data with Semicompeting Risks
Zhang F, Chen MH, Cong XJ and Chen Q
Longitudinal biomarkers such as patient-reported outcomes (PROs) and quality of life (QOL) are routinely collected in cancer clinical trials or other studies. Joint modeling of PRO/QOL and survival data can provide a comparative assessment of patient-reported changes in specific symptoms or global measures that correspond to changes in survival. Motivated by a head and neck cancer clinical trial, we develop a class of trajectory-based models for longitudinal and survival data with disease progression. Specifically, we propose a class of mixed effects regression models for longitudinal measures, a cure rate model for the disease progression time ( ), and a Cox proportional hazards model with time-varying covariates for the overall survival time ( ) to account for and treatment switching. Under the semi-competing risks framework, the disease progression is the nonterminal event, the occurrence of which is subject to a terminal event of death. The properties of the proposed models are examined in detail. Within the Bayesian paradigm, we derive the decompositions of the deviance information criterion (DIC) and the logarithm of the pseudo marginal likelihood (LPML) to assess the fit of the longitudinal component of the model and the fit of each survival component, separately. We further develop ΔDIC as well as ΔLPML to determine the importance and contribution of the longitudinal data to the model fit of the and data.
A Bayesian transition model for missing longitudinal binary outcomes and an application to a smoking cessation study
Li L, Lee JH, Sutton SK, Simmons VN and Brandon TH
Smoking cessation intervention studies often produce data on smoking status at discrete follow-up assessments, often with missing data in different amounts at each assessment. Smoking status in these studies is a dynamic process with individuals transitioning from smoking to abstinent, as well as abstinent to smoking, at different times during the intervention. Directly assessing transitions provides an opportunity to answer important questions like 'Does the proposed intervention help smokers remain abstinent or quit smoking more effectively than other interventions?' In this article, we model changes in smoking status and examine how interventions and other covariates affect the transitions. We propose a Bayesian approach for fitting the transition model to the observed data and impute missing outcomes based on a logistic model, which accounts for both missing at random (MAR) and missing not at random (MNAR) mechanisms. The proposed Bayesian approach treats missing data as additional unknown quantities and samples them from their posterior distributions. The performance of the proposed method is investigated through simulation studies and illustrated by data from a randomized controlled trial of smoking cessation interventions. Finally, posterior predictive checking and log pseudo marginal likelihood (LPML) are used to assess model assumptions and perform model comparisons, respectively.
Bayesian Causal Mediation Analysis with Multiple Ordered Mediators
Gao T and Albert JM
Causal mediation analysis provides investigators insight into how a treatment or exposure can affect an outcome of interest through one or more mediators on causal pathway. When multiple mediators on the pathway are causally ordered, identification of mediation effects on certain causal pathways requires a sensitivity parameter to be specified. A mixed model-based approach was proposed in the Bayesian framework to connect potential outcomes at different treatment levels, and identify mediation effects independent of a sensitivity parameter, for the natural direct and indirect effects on all causal pathways. The proposed method is illustrated in a linear setting for mediators and outcome, with mediator-treatment interactions. Sensitivity analysis was performed for the prior choices in the Bayesian models. The proposed Bayesian method was applied to an adolescent dental health study, to see how social economic status can affect dental caries through a sequence of causally ordered mediators in dental visit and oral hygiene index.
Identifying Dynamical Time Series Model Parameters from Equilibrium Samples, with Application to Gene Regulatory Networks
Young WC, Yeung KY and Raftery AE
Gene regulatory network reconstruction is an essential task of genomics in order to further our understanding of how genes interact dynamically with each other. The most readily available data, however, are from steady state observations. These data are not as informative about the relational dynamics between genes as knockout or over-expression experiments, which attempt to control the expression of individual genes. We develop a new framework for network inference using samples from the equilibrium distribution of a vector autoregressive (VAR) time-series model which can be applied to steady state gene expression data. We explore the theoretical aspects of our method and apply the method to synthetic gene expression data generated using GeneNetWeaver.
Rejoinder to statistical contributions to bioinformatics: Design, modelling, structure learning and Integration
Morris JS and Baladandayuthapani V
We thank the discussants for their kind comments and their insightful analysis and discussion that has substantially added to the contribution of this issue. Overall, it seems the discussants have affirmed many of our primary points, and have also raised a number of other relevant and important issues that we did not emphasize in the paper. Several common threads emerged from these discussions, including the importance of software development, appropriate dissemination, and close collaboration with biomedical scientists and technology experts in order to ensure our work is relevant and impactful. Each discussant also mentioned other areas of bioinformatics that have been impacted by statistical researchers that we did not highlight in the original paper. In response, we will first summarize discuss these general themes, and then respond to specific comments of each discussant, and finally talk about the additional areas of bioinformatics impacted by statisticians that were mentioned by the reviewers.
Comparison and Contrast of Two General Functional Regression Modeling Frameworks
Morris JS
In this article, Greven and Scheipl describe an impressively general framework for performing functional regression that builds upon the generalized additive modeling framework. Over the past number of years, my collaborators and I have also been developing a general framework for functional regression, functional mixed models, which shares many similarities with this framework, but has many differences as well. In this discussion, I compare and contrast these two frameworks, to hopefully illuminate characteristics of each, highlighting their respecitve strengths and weaknesses, and providing recommendations regarding the settings in which each approach might be preferable.
Discussion of 'Regularized Regression for Categorical Data'
Shepherd BE and Liu Q
Using a latent variable model with non-constant factor loadings to examine PM constituents related to secondary inorganic aerosols
Zhang Z, O'Neill MS and Sánchez BN
Factor analysis is a commonly used method of modelling correlated multivariate exposure data. Typically, the measurement model is assumed to have constant factor loadings. However, from our preliminary analyses of the Environmental Protection Agency's (EPA's) PM fine speciation data, we have observed that the factor loadings for four constituents change considerably in stratified analyses. Since invariance of factor loadings is a prerequisite for valid comparison of the underlying latent variables, we propose a factor model that includes non-constant factor loadings that change over time and space using P-spline penalized with the generalized cross-validation (GCV) criterion. The model is implemented using the Expectation-Maximization (EM) algorithm and we select the multiple spline smoothing parameters by minimizing the GCV criterion with Newton's method during each iteration of the EM algorithm. The algorithm is applied to a one-factor model that includes four constituents. Through bootstrap confidence bands, we find that the factor loading for total nitrate changes across seasons and geographic regions.
Longitudinal Functional Models with Structured Penalties
Kundu MG, Harezlak J and Randolph TW
This article addresses estimation in regression models for longitudinally-collected functional covariates (time-varying predictor curves) with a longitudinal scaler outcome. The framework consists of estimating a time-varying coefficient function that is modeled as a linear combination of time-invariant functions with time-varying coefficients. The model uses extrinsic information to inform the structure of the penalty, while the estimation procedure exploits the equivalence between penalized least squares estimation and a linear mixed model representation. The process is empirically evaluated with several simulations and it is applied to analyze the neurocognitive impairment of HIV patients and its association with longitudinally-collected magnetic resonance spectroscopy (MRS) curves.
Cox Regression Models with Functional Covariates for Survival Data
Gellar JE, Colantuoni E, Needham DM and Crainiceanu CM
We extend the Cox proportional hazards model to cases when the exposure is a densely sampled functional process, measured at baseline. The fundamental idea is to combine penalized signal regression with methods developed for mixed effects proportional hazards models. The model is fit by maximizing the penalized partial likelihood, with smoothing parameters estimated by a likelihood-based criterion such as AIC or EPIC. The model may be extended to allow for multiple functional predictors, time varying coefficients, and missing or unequally-spaced data. Methods were inspired by and applied to a study of the association between time to death after hospital discharge and daily measures of disease severity collected in the intensive care unit, among survivors of acute respiratory distress syndrome.
Robust estimation of marginal regression parameters in clustered data
Datta S and Beck JD
We develop robust methods for analyzing clustered data where estimation of marginal regression parameters is of interest. Inverse cluster size reweighting in the objective function to be minimized is incorporated to handle the issue of informative cluster size. Performance of the resulting estimators is studied by simulation. Large sample inference and variance estimation is carried out. The methodology is illustrated using a periodontal disease dataset.
Quasi-periodic spatiotemporal models of brain activation in single-trial MEG experiments
Ventrucci M, Bowman AW, Miller C and Gross J
Magneto-encephalography (MEG) is an imaging technique which measures neuronal activity in the brain. Even when a subject is in a resting state, MEG data show characteristic spatial and temporal patterns, resulting from electrical current at specific locations in the brain. The key pattern of interest is a 'dipole', consisting of two adjacent regions of high and low activation which oscillate over time in an out-of-phase manner. Standard approaches are based on averages over large numbers of trials in order to reduce noise. In contrast, this article addresses the issue of dipole modelling for single trial data, as this is of interest in application areas. There is also clear evidence that the frequency of this oscillation in single trials generally changes over time and so exhibits quasi-periodic rather than periodic behaviour. A framework for the modelling of dipoles is proposed through estimation of a spatiotemporal smooth function constructed as a parametric function of space and a smooth function of time. Quasi-periodic behaviour is expressed in phase functions which are allowed to evolve smoothly over time. The model is fitted in two stages. First, the spatial location of the dipole is identified and the smooth signals characterizing the amplitude functions for each separate pole are estimated. Second, the phase and frequency of the amplitude signals are estimated as smooth functions. The model is applied to data from a real MEG experiment focusing on motor and visual brain processes. In contrast to existing standard approaches, the model allows the variability across trials and subjects to be identified. The nature of this variability is informative about the resting state of the brain.
Applications of a Kullback-Leibler Divergence for Comparing Non-nested Models
Wang CP and Jo B
Wang and Ghosh (2011) proposed a Kullback-Leibler divergence (KLD) which is asymptotically equivalent to the KLD by Goutis and Robert (1998) when the reference model (in comparison with a competing fitted model) is correctly specified and when certain regularity conditions hold true. While properties of the KLD by Wang and Ghosh (2011) have been investigated in the Bayesian framework, this paper further explores the property of this KLD in the frequentist framework using four application examples, each fitted by two competing non-nested models.
Bayesian latent structure models with space-time-dependent covariates
Cai B, Lawson AB, Hossain MM and Choi J
Spatial-temporal data requires flexible regression models which can model the dependence of responses on space- and time-dependent covariates. In this paper, we describe a semiparametric space-time model from a Bayesian perspective. Nonlinear time dependence of covariates and the interactions among the covariates are constructed by local linear and piecewise linear models, allowing for more flexible orientation and position of the covariate plane by using time-varying basis functions. Space-varying covariate linkage coefficients are also incorporated to allow for the variation of space structures across the geographical location. The formulation accommodates uncertainty in the number and locations of the piecewise basis functions to characterize the global effects, spatially structured and unstructured random effects in relation to covariates. The proposed approach relies on variable selection-type mixture priors for uncertainty in the number and locations of basis functions and in the space-varying linkage coefficients. A simulation example is presented to evaluate the performance of the proposed approach with the competing models. A real data example is used for illustration.
Random-covariances and mixed-effects models for imputing multivariate multilevel continuous data
Yucel RM
Principled techniques for incomplete-data problems are increasingly part of mainstream statistical practice. Among many proposed techniques so far, inference by multiple imputation (MI) has emerged as one of the most popular. While many strategies leading to inference by MI are available in cross-sectional settings, the same richness does not exist in multilevel applications. The limited methods available for multilevel applications rely on the multivariate adaptations of mixed-effects models. This approach preserves the mean structure across clusters and incorporates distinct variance components into the imputation process. In this paper, I add to these methods by considering a random covariance structure and develop computational algorithms. The attraction of this new imputation modeling strategy is to correctly reflect the mean and variance structure of the joint distribution of the data, and allow the covariances differ across the clusters. Using Markov Chain Monte Carlo techniques, a predictive distribution of missing data given observed data is simulated leading to creation of multiple imputations. To circumvent the large sample size requirement to support independent covariance estimates for the level-1 error term, I consider distributional impositions mimicking random-effects distributions assigned a priori. These techniques are illustrated in an example exploring relationships between victimization and individual and contextual level factors that raise the risk of violent crime.
A Bayesian model for repeated measures zero-inflated count data with application to outpatient psychiatric service use
Neelon BH, O'Malley AJ and Normand SL
In applications involving count data, it is common to encounter an excess number of zeros. In the study of outpatient service utilization, for example, the number of utilization days will take on integer values, with many subjects having no utilization (zero values). Mixed-distribution models, such as the zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB), are often used to fit such data. A more general class of mixture models, called hurdle models, can be used to model zero-deflation as well as zero-inflation. Several authors have proposed frequentist approaches to fitting zero-inflated models for repeated measures. We describe a practical Bayesian approach which incorporates prior information, has optimal small-sample properties, and allows for tractable inference. The approach can be easily implemented using standard Bayesian software. A study of psychiatric outpatient service use illustrates the methods.
Nuclear penalized multinomial regression with an application to predicting at bat outcomes in baseball
Powers S, Hastie T and Tibshirani R
We propose the nuclear norm penalty as an alternative to the ridge penalty for regularized multinomial regression. This convex relaxation of reduced-rank multinomial regression has the advantage of leveraging underlying structure among the response categories to make better predictions. We apply our method, nuclear penalized multinomial regression (NPMR), to Major League Baseball play-by-play data to predict outcome probabilities based on batter-pitcher matchups. The interpretation of the results meshes well with subject-area expertise and also suggests a novel understanding of what differentiates players.
Statistical Contributions to Bioinformatics: Design, Modeling, Structure Learning, and Integration
Morris JS and Baladandayuthapani V
The advent of high-throughput multi-platform genomics technologies providing whole-genome molecular summaries of biological samples has revolutionalized biomedical research. These technologiees yield highly structured big data, whose analysis poses significant quantitative challenges. The field of Bioinformatics has emerged to deal with these challenges, and is comprised of many quantitative and biological scientists working together to effectively process these data and extract the treasure trove of information they contain. Statisticians, with their deep understanding of variability and uncertainty quantification, play a key role in these efforts. In this article, we attempt to summarize some of the key contributions of statisticians to bioinformatics, focusing on four areas: (1) experimental design and reproducibility, (2) preprocessing and feature extraction, (3) unified modeling, and (4) structure learning and integration. In each of these areas, we highlight some key contributions and try to elucidate the key statistical principles underlying these methods and approaches. Our goals are to demonstrate major ways in which statisticians have contributed to bioinformatics, encourage statisticians to get involved early in methods development as new technologies emerge, and to stimulate future methodological work based on the statistical principles elucidated in this article and utilizing all availble information to uncover new biological insights.