JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS

Measuring the impact of new risk factors within survival models
Heller G and Devlin SM
Survival is poor for patients with metastatic cancer, and it is vital to examine new biomarkers that can improve patient prognostication and identify those who would benefit from more aggressive therapy. In metastatic prostate cancer, 2 new assays have become available: one that quantifies the number of cancer cells circulating in the peripheral blood, and the other a marker of the aggressiveness of the disease. It is critical to determine the magnitude of the effect of these biomarkers on the discrimination of a model-based risk score. To do so, most analysts frequently consider the discrimination of 2 separate survival models: one that includes both the new and standard factors and a second that includes the standard factors alone. However, this analysis is ultimately incorrect for many of the scale-transformation models ubiquitous in survival, as the reduced model is misspecified if the full model is specified correctly. To circumvent this issue, we developed a projection-based approach to estimate the impact of the 2 prostate cancer biomarkers. The results indicate that the new biomarkers can influence model discrimination and justify their inclusion in the risk model; however, the hunt remains for an applicable model to risk-stratify patients with metastatic prostate cancer.
tdCoxSNN: Time-dependent Cox survival neural network for continuous-time dynamic prediction
Zeng L, Zhang J, Chen W and Ding Y
The aim of dynamic prediction is to provide individualized risk predictions over time, which are updated as new data become available. In pursuit of constructing a dynamic prediction model for a progressive eye disorder, age-related macular degeneration (AMD), we propose a time-dependent Cox survival neural network (tdCoxSNN) to predict its progression using longitudinal fundus images. tdCoxSNN builds upon the time-dependent Cox model by utilizing a neural network to capture the nonlinear effect of time-dependent covariates on the survival outcome. Moreover, by concurrently integrating a convolutional neural network with the survival network, tdCoxSNN can directly take longitudinal images as input. We evaluate and compare our proposed method with joint modelling and landmarking approaches through extensive simulations. We applied the proposed approach to two real datasets. One is a large AMD study, the Age-Related Eye Disease Study, in which more than 50,000 fundus images were captured over a period of 12 years for more than 4,000 participants. Another is a public dataset of the primary biliary cirrhosis disease, where multiple laboratory tests were longitudinally collected to predict the time-to-liver transplant. Our approach demonstrates commendable predictive performance in both simulation studies and the analysis of the two real datasets.
Walking fingerprinting
Koffman L, Crainiceanu C and Leroux A
We consider the problem of predicting an individual's identity from accelerometry data collected during walking. In a previous paper, we transformed the accelerometry time series into an image by constructing the joint distribution of the acceleration and lagged acceleration for a vector of lags. Predictors derived by partitioning this image into grid cells were used in logistic regression to predict individuals. Here, we (a) implement machine learning methods for prediction using the grid cell-derived predictors; (b) derive inferential methods to screen for the most predictive grid cells while adjusting for correlation and multiple comparisons; and (c) develop a novel multivariate functional regression model that avoids partitioning the predictor space. Prediction methods are compared on two open source acceleometry data sets collected from: (a) 32 individuals walking on a km path; and (b) six repetitions of walking on a 20 m path on two occasions at least 1 week apart for 153 study participants. In the 32-individual study, all methods achieve at least 95% rank-1 accuracy, while in the 153-individual study, accuracy varies from 41% to 98%, depending on the method and prediction task. Methods provide insights into why some individuals are easier to predict than others.
Joint modelling of survival and backwards recurrence outcomes: an analysis of factors associated with fertility treatment in the U.S
Guo S, Zhang J and McLain AC
The motivation for this paper is to determine factors associated with time-to-fertility treatment (TTFT) among women currently attempting pregnancy in a cross-sectional sample. Challenges arise due to dependence between time-to-pregnancy (TTP) and TTFT. We propose appending a marginal accelerated failure time model to identify risk factors of TTFT with a model for TTP where fertility treatment is included as a time-varying treatment to account for their dependence. The latter requires extending backwards recurrence survival methods to incorporate time-varying covariates with time-varying coefficients. Since backwards recurrence survival methods are a function of mean survival, computational difficulties arise in formulating mean survival when fertility treatment is unobserved, i.e. when TTFT is censored. We address these challenges by developing computationally friendly forms for the double expectation of TTP and TTFT. The performance is validated via comprehensive simulation studies. We apply our approach to the National Survey of Family Growth and explore factors related to prolonged TTFT in the U.S.
Estimating spatially varying health effects of wildland fire smoke using mobile health data
Wu L, Gao C, Yang S, Reich BJ and Rappold AG
Wildland fire smoke exposures are an increasing threat to public health, highlighting the need for studying the effects of protective behaviours on reducing health outcomes. Emerging smartphone applications provide unprecedented opportunities to deliver health risk communication messages to a large number of individuals in real-time and subsequently study the effectiveness, but also pose methodological challenges. Smoke Sense, a citizen science project, provides an interactive smartphone app platform for participants to engage with information about air quality, and ways to record their own health symptoms and actions taken to reduce smoke exposure. We propose a doubly robust estimator of the structural nested mean model that accounts for spatially and time-varying effects via a local estimating equation approach with geographical kernel weighting. Moreover, our analytical framework also handles informative missingness by inverse probability weighting of estimating functions. We evaluate the method using extensive simulation studies and apply it to Smoke Sense data to increase the knowledge base about the relationship between health preventive measures and health-related outcomes. Our results show that the protective behaviours' effects vary over space and time and find that protective behaviours have more significant effects on reducing health symptoms in the Southwest than the Northwest region of the U.S.
Non-parametric Bayesian approach to multiple treatment comparisons in network meta-analysis with application to comparisons of anti-depressants
Barrientos AF, Page GL and Lin L
Network meta-analysis is a powerful tool to synthesize evidence from independent studies and compare multiple treatments simultaneously. A critical task of performing a network meta-analysis is to offer ranks of all available treatment options for a specific disease outcome. Frequently, the estimated treatment rankings are accompanied by a large amount of uncertainty, suffer from multiplicity issues, and rarely permit possible ties of treatments with similar performance. These issues make interpreting rankings problematic as they are often treated as absolute metrics. To address these shortcomings, we formulate a ranking strategy that adapts to scenarios with high-order uncertainty by producing more conservative results. This improves the interpretability while simultaneously accounting for multiple comparisons. To admit ties between treatment effects in cases where differences between treatment effects are negligible, we also develop a Bayesian non-parametric approach for network meta-analysis. The approach capitalizes on the induced clustering mechanism of Bayesian non-parametric methods, producing a positive probability that two treatment effects are equal. We demonstrate the utility of the procedure through numerical experiments and a network meta-analysis designed to study antidepressant treatments.
Inverse set estimation and inversion of simultaneous confidence intervals
Ren J, Telschow FJE and Schwartzman A
Motivated by the questions of risk assessment in climatology (temperature change in North America) and medicine (impact of statin usage and coronavirus disease 2019 on hospitalized patients), we address the problem of estimating the set in the domain of a function whose image equals a predefined subset of the real line. Existing methods require strict assumptions. We generalize the estimation of such sets to dense and nondense domains with protection against inflated Type I error in exploratory data analysis. This is achieved by proving that confidence sets of multiple upper, lower, or interval sets can be simultaneously constructed with the desired confidence nonasymptotically through inverting simultaneous confidence intervals. Nonparametric bootstrap algorithm and code are provided.
Population-level task-evoked functional connectivity via Fourier analysis
Meng K and Eloyan A
Functional magnetic resonance imaging (fMRI) is a noninvasive and in-vivo imaging technique essential for measuring brain activity. Functional connectivity is used to study associations between brain regions, either while study subjects perform tasks or during periods of rest. In this paper, we propose a rigorous definition of task-evoked functional connectivity at the population level (ptFC). Importantly, our proposed ptFC is interpretable in the context of task-fMRI studies. An algorithm for estimating the ptFC is provided. We present the performance of the proposed algorithm compared to existing functional connectivity frameworks using simulations. Lastly, we apply the proposed algorithm to estimate the ptFC in a motor-task study from the Human Connectome Project.
Unsupervised Bayesian classification for models with scalar and functional covariates
Garcia NL, Rodrigues-Motta M, Migon HS, Petkova E, Tarpey T, Ogden RT, Giordano JO and Perez MM
We consider unsupervised classification by means of a latent multinomial variable which categorizes a scalar response into one of the L components of a mixture model which incorporates scalar and functional covariates. This process can be thought as a hierarchical model with the first level modelling a scalar response according to a mixture of parametric distributions and the second level modelling the mixture probabilities by means of a generalized linear model with functional and scalar covariates. The traditional approach of treating functional covariates as vectors not only suffers from the curse of dimensionality, since functional covariates can be measured at very small intervals leading to a highly parametrized model, but also does not take into account the nature of the data. We use basis expansions to reduce the dimensionality and a Bayesian approach for estimating the parameters while providing predictions of the latent classification vector. The method is motivated by two data examples that are not easily handled by existing methods. The first example concerns identifying placebo responders on a clinical trial (normal mixture model) and the other predicting illness for milking cows (zero-inflated mixture of the Poisson model).
Bayesian semi-parametric inference for clustered recurrent events with zero inflation and a terminal event
Tian X, Ciarleglio M, Cai J, Greene EJ, Esserman D, Li F and Zhao Y
Recurrent events are common in clinical studies and are often subject to terminal events. In pragmatic trials, participants are often nested in clinics and can be susceptible or structurally unsusceptible to the recurrent events. We develop a Bayesian shared random effects model to accommodate this complex data structure. To achieve robustness, we consider the Dirichlet processes to model the residual of the accelerated failure time model for the survival process as well as the cluster-specific shared frailty distribution, along with an efficient sampling algorithm for posterior inference. Our method is applied to a recent cluster randomized trial on fall injury prevention.
Revisiting the effects of maternal education on adolescents' academic performance: Doubly robust estimation in a network-based observational study
McNealis V, Moodie EEM and Dean N
In many contexts, particularly when study subjects are adolescents, peer effects can invalidate typical statistical requirements in the data. For instance, it is plausible that a student's academic performance is influenced both by their own mother's educational level as well as that of their peers. Since the underlying social network is measured, the Add Health study provides a unique opportunity to examine the impact of maternal college education on adolescent school performance, both direct and indirect. However, causal inference on populations embedded in social networks poses technical challenges, since the typical no interference assumption no longer holds. While inverse probability-of-treatment weighted (IPW) estimators have been developed for this setting, they are often highly unstable. Motivated by the question of maternal education, we propose doubly robust (DR) estimators combining models for treatment and outcome that are consistent and asymptotically normal if either model is correctly specified. We present empirical results that illustrate the DR property and the efficiency gain of DR over IPW estimators even when the treatment model is misspecified. Contrary to previous studies, our robust analysis does not provide evidence of an indirect effect of maternal education on academic performance within adolescents' social circles in Add Health.
Testing unit root non-stationarity in the presence of missing data in univariate time series of mobile health studies
Fowler C, Cai X, Baker JT, Onnela JP and Valeri L
The use of digital devices to collect data in mobile health studies introduces a novel application of time series methods, with the constraint of potential data missing at random or missing not at random (MNAR). In time-series analysis, testing for stationarity is an important preliminary step to inform appropriate subsequent analyses. The Dickey-Fuller test evaluates the null hypothesis of unit root non-stationarity, under no missing data. Beyond recommendations under data missing completely at random for complete case analysis or last observation carry forward imputation, researchers have not extended unit root non-stationarity testing to more complex missing data mechanisms. Multiple imputation with chained equations, Kalman smoothing imputation, and linear interpolation have also been used for time-series data, however such methods impose constraints on the autocorrelation structure and impact unit root testing. We propose maximum likelihood estimation and multiple imputation using state space model approaches to adapt the augmented Dickey-Fuller test to a context with missing data. We further develop sensitivity analyses to examine the impact of MNAR data. We evaluate the performance of existing and proposed methods across missing mechanisms in extensive simulations and in their application to a multi-year smartphone study of bipolar patients.
Variable selection for individualised treatment rules with discrete outcomes
Bian Z, Moodie EEM, Shortreed SM, Lambert SD and Bhatnagar S
An individualised treatment rule (ITR) is a decision rule that aims to improve individuals' health outcomes by recommending treatments according to subject-specific information. In observational studies, collected data may contain many variables that are irrelevant to treatment decisions. Including all variables in an ITR could yield low efficiency and a complicated treatment rule that is difficult to implement. Thus, selecting variables to improve the treatment rule is crucial. We propose a doubly robust variable selection method for ITRs, and show that it compares favourably with competing approaches. We illustrate the proposed method on data from an adaptive, web-based stress management tool.
A pseudo-response approach to constructing confidence intervals for the subset of patients expected to benefit from a new treatment
Liu W, Zhang Z, Hu Z, Xu P and Cohen CJ
In precision medicine, there is much interest in estimating the expected-to-benefit (EB) subset, i.e. the subset of patients who are expected to benefit from a new treatment based on a collection of baseline characteristics. There are many statistical methods for estimating the EB subset, most of which produce a 'point estimate' without a confidence statement to address uncertainty. Confidence intervals for the EB subset have been defined only recently, and their construction is a new area for methodological research. This article proposes a pseudo-response approach to EB subset estimation and confidence interval construction. Compared to existing methods, the pseudo-response approach allows us to focus on modelling a conditional treatment effect function (as opposed to the conditional mean outcome given treatment and baseline covariates) and is able to incorporate information from baseline covariates that are not involved in defining the EB subset. Simulation results show that incorporating such covariates can improve estimation efficiency and reduce the size of the confidence interval for the EB subset. The methodology is applied to a randomized clinical trial comparing two drugs for treating HIV infection.
Bayesian kernel machine regression for count data: modelling the association between social vulnerability and COVID-19 deaths in South Carolina
Mutiso F, Li H, Pearce JL, Benjamin-Neelon SE, Mueller NT and Neelon B
The COVID-19 pandemic created an unprecedented global health crisis. Recent studies suggest that socially vulnerable communities were disproportionately impacted, although findings are mixed. To quantify social vulnerability in the US, many studies rely on the Social Vulnerability Index (SVI), a county-level measure comprising 15 census variables. Typically, the SVI is modelled in an additive manner, which may obscure non-linear or interactive associations, further contributing to inconsistent findings. As a more robust alternative, we propose a negative binomial Bayesian kernel machine regression (BKMR) model to investigate dynamic associations between social vulnerability and COVID-19 death rates, thus extending BKMR to the count data setting. The model produces a 'vulnerability effect' that quantifies the impact of vulnerability on COVID-19 death rates in each county. The method can also identify the relative importance of various SVI variables and make future predictions as county vulnerability profiles evolve. To capture spatio-temporal heterogeneity, the model incorporates spatial effects, county-level covariates, and smooth temporal functions. For Bayesian computation, we propose a tractable data-augmented Gibbs sampler. We conduct a simulation study to highlight the approach and apply the method to a study of COVID-19 deaths in the US state of South Carolina during the 2021 calendar year.
Statistical inference for complete and incomplete mobility trajectories under the flight-pause model
Jurek M, Calder CA and Zigler C
We formulate a statistical flight-pause model (FPM) for human mobility, represented by a collection of random objects, called motions, appropriate for mobile phone tracking (MPT) data. We develop the statistical machinery for parameter inference and trajectory imputation under various forms of missing data. We show that common assumptions about the missing data mechanism for MPT are not valid for the mechanism governing the random motions underlying the FPM, representing an understudied missing data phenomenon. We demonstrate the consequences of missing data and our proposed adjustments in both simulations and real data, outlining implications for MPT data collection and design.
Models and methods for analysing clustered recurrent hospitalisations in the presence of COVID-19 effects
Ding X, He K and Kalbfleisch JD
Recurrent events such as hospitalisations are outcomes that can be used to monitor dialysis facilities' quality of care. However, current methods are not adequate to analyse data from many facilities with multiple hospitalisations, especially when adjustments are needed for multiple time scales. It is also controversial whether direct or indirect standardisation should be used in comparing facilities. This study is motivated by the need of the Centers for Medicare and Medicaid Services to evaluate US dialysis facilities using Medicare claims, which involve almost 8,000 facilities and over 500,000 dialysis patients. This scope is challenging for current statistical software's computational power. We propose a method that has a flexible baseline rate function and is computationally efficient. Additionally, the proposed method shares advantages of both indirect and direct standardisation. The method is evaluated under a range of simulation settings and demonstrates substantially improved computational efficiency over the existing package . Finally, we illustrate the method with an important application to monitoring dialysis facilities in the U.S., while making time-dependent adjustments for the effects of COVID-19.
Multivariate longitudinal analysis for the association between brain atrophy and cognitive impairment in prodromal Huntington's disease subjects
Zheng C, Tong L and Zhang Y
Cognitive impairment has been widely accepted as a disease progression measure prior to the onset of Huntington's disease. We propose a sophisticated measurement error correction method that can handle potentially correlated measurement errors in longitudinally collected exposures and multiple outcomes. The asymptotic theory for the proposed method is developed. A simulation study is conducted to demonstrate the satisfactory performance of the proposed two-stage fitting method and shows that the independent working correlation structure outperforms other alternatives. We conduct a comprehensive longitudinal analysis to assess how brain striatal atrophy affects impairment in various cognitive domains for Huntington's disease.
Automated calibration for stability selection in penalised regression and graphical models
Bodinier B, Filippi S, Nøst TH, Chiquet J and Chadeau-Hyam M
Stability selection represents an attractive approach to identify sparse sets of features jointly associated with an outcome in high-dimensional contexts. We introduce an automated calibration procedure via maximisation of an in-house stability score and accommodating a priori-known block structure (e.g. multi-OMIC) data. It applies to [Least Absolute Shrinkage Selection Operator (LASSO)] penalised regression and graphical models. Simulations show our approach outperforms non-stability-based and stability selection approaches using the original calibration. Application to multi-block graphical LASSO on real (epigenetic and transcriptomic) data from the Norwegian Women and Cancer study reveals a central/credible and novel cross-OMIC role of LRRN3 in the biological response to smoking. Proposed approaches were implemented in the R package sharp.
A novel agreement statistic using data on uncertainty in ratings
Zee J, Mariani L, Barisoni L, Mahajan P and Gillespie B
Many existing methods for estimating agreement correct for chance agreement by adjusting the observed proportion agreement by the probability of chance agreement based on different assumptions. These assumptions may not always be appropriate, as demonstrated by pathologists' ratings of kidney biopsy descriptors. We propose a novel agreement statistic that accounts for the empirical probability of chance agreement, estimated by collecting additional data on rater uncertainty for each rating. A standard error estimator for the proposed statistic is derived. Simulation studies show that in most cases, our proposed statistic is unbiased in estimating the probability of agreement after removing chance agreement.
Estimating a brain network predictive of stress and genotype with supervised autoencoders
Talbot A, Dunson D, Dzirasa K and Carlson D
Targeted brain stimulation has the potential to treat mental illnesses. We develop an approach to help design protocols by identifying relevant multi-region electrical dynamics. Our approach models these dynamics as a superposition of latent networks, where the latent variables predict a relevant outcome. We use supervised autoencoders (SAEs) to improve predictive performance in this context, describe the conditions where SAEs improve predictions, and provide modelling constraints to ensure biological relevance. We experimentally validate our approach by finding a network associated with stress that aligns with a previous stimulation protocol and characterizing a genotype associated with bipolar disorder.