Maximum likelihood estimation for semiparametric regression models with interval-censored multistate data
Interval-censored multistate data arise in many studies of chronic diseases, where the health status of a subject can be characterized by a finite number of disease states and the transition between any two states is only known to occur over a broad time interval. We relate potentially time-dependent covariates to multistate processes through semiparametric proportional intensity models with random effects. We study nonparametric maximum likelihood estimation under general interval censoring and develop a stable expectation-maximization algorithm. We show that the resulting parameter estimators are consistent and that the finite-dimensional components are asymptotically normal with a covariance matrix that attains the semiparametric efficiency bound and can be consistently estimated through profile likelihood. In addition, we demonstrate through extensive simulation studies that the proposed numerical and inferential procedures perform well in realistic settings. Finally, we provide an application to a major epidemiologic cohort study.
Phylogenetic association analysis with conditional rank correlation
Phylogenetic association analysis plays a crucial role in investigating the correlation between microbial compositions and specific outcomes of interest in microbiome studies. However, existing methods for testing such associations have limitations related to the assumption of a linear association in high-dimensional settings and the handling of confounding effects. Hence, there is a need for methods capable of characterizing complex associations, including nonmonotonic relationships. This article introduces a novel phylogenetic association analysis framework and associated tests to address these challenges by employing conditional rank correlation as a measure of association. The proposed tests account for confounders in a fully nonparametric manner, ensuring robustness against outliers and the ability to detect diverse dependencies. The proposed framework aggregates conditional rank correlations for subtrees using weighted sum and maximum approaches to capture both dense and sparse signals. The significance level of the test statistics is determined by calibration through a nearest-neighbour bootstrapping method, which is straightforward to implement and can accommodate additional datasets when these are available. The practical advantages of the proposed framework are demonstrated through numerical experiments using both simulated and real microbiome datasets.
Likelihood-based inference under nonconvex boundary constraints
Likelihood-based inference under nonconvex constraints on model parameters has become increasingly common in biomedical research. In this paper, we establish large-sample properties of the maximum likelihood estimator when the true parameter value lies at the boundary of a nonconvex parameter space. We further derive the asymptotic distribution of the likelihood ratio test statistic under nonconvex constraints on model parameters. A general Monte Carlo procedure for generating the limiting distribution is provided. The theoretical results are demonstrated by five examples in Anderson's stereotype logistic regression model, genetic association studies, gene-environment interaction tests, cost-constrained linear regression and fairness-constrained linear regression.
Statistical summaries of unlabelled evolutionary trees
Rooted and ranked phylogenetic trees are mathematical objects that are useful in modelling hierarchical data and evolutionary relationships with applications to many fields such as evolutionary biology and genetic epidemiology. Bayesian phylogenetic inference usually explores the posterior distribution of trees via Markov chain Monte Carlo methods. However, assessing uncertainty and summarizing distributions remains challenging for these types of structures. While labelled phylogenetic trees have been extensively studied, relatively less literature exists for unlabelled trees that are increasingly useful, for example when one seeks to summarize samples of trees obtained with different methods, or from different samples and environments, and wishes to assess the stability and generalizability of these summaries. In our paper, we exploit recently proposed distance metrics of unlabelled ranked binary trees and unlabelled ranked genealogies, or trees equipped with branch lengths, to define the Fréchet mean, variance and interquartile sets as summaries of these tree distributions. We provide an efficient combinatorial optimization algorithm for computing the Fréchet mean of a sample or of distributions on unlabelled ranked tree shapes and unlabelled ranked genealogies. We show the applicability of our summary statistics for studying popular tree distributions and for comparing the SARS-CoV-2 evolutionary trees across different locations during the COVID-19 epidemic in 2020. Our current implementations are publicly available at https://github.com/RSamyak/fmatrix.
A mark-specific quantile regression model
Quantile regression has become a widely used tool for analysing competing risk data. However, quantile regression for competing risk data with a continuous mark is still scarce. The mark variable is an extension of cause of failure in a classical competing risk model where cause of failure is replaced by a continuous mark only observed at uncensored failure times. An example of the continuous mark variable is the genetic distance that measures dissimilarity between the infecting virus and the virus contained in the vaccine construct. In this article, we propose a novel mark-specific quantile regression model. The proposed estimation method borrows strength from data in a neighbourhood of a mark and is based on an induced smoothed estimation equation, which is very different from the existing methods for competing risk data with discrete causes. The asymptotic properties of the resulting estimators are established across mark and quantile continuums. In addition, a mark-specific quantile-type vaccine efficacy is proposed and its statistical inference procedures are developed. Simulation studies are conducted to evaluate the finite sample performances of the proposed estimation and hypothesis testing procedures. An application to the first HIV vaccine efficacy trial is provided.
A linear adjustment-based approach to posterior drift in transfer learning
We present new models and methods for the posterior drift problem where the regression function in the target domain is modelled as a linear adjustment, on an appropriate scale, of that in the source domain, and study the theoretical properties of our proposed estimators in the binary classification problem. The core idea of our model inherits the simplicity and the usefulness of generalized linear models and accelerated failure time models from the classical statistics literature. Our approach is shown to be flexible and applicable in a variety of statistical settings, and can be adopted for transfer learning problems in various domains including epidemiology, genetics and biomedicine. As concrete applications, we illustrate the power of our approach (i) through mortality prediction for British Asians by borrowing strength from similar data from the larger pool of British Caucasians, using the UK Biobank data, and (ii) in overcoming a spurious correlation present in the source domain of the Waterbirds dataset.
Efficient Estimation under Data Fusion
We aim to make inferences about a smooth, finite-dimensional parameter by fusing data from multiple sources together. Previous works have studied the estimation of a variety of parameters in similar data fusion settings, including in the estimation of the average treatment effect and average reward under a policy, with the majority of them merging one historical data source with covariates, actions, and rewards and one data source of the same covariates. In this work, we consider the general case where one or more data sources align with each part of the distribution of the target population, for example, the conditional distribution of the reward given actions and covariates. We describe potential gains in efficiency that can arise from fusing these data sources together in a single analysis, which we characterize by a reduction in the semiparametric efficiency bound. We also provide a general means to construct estimators that achieve these bounds. In numerical simulations, we illustrate marked improvements in efficiency from using our proposed estimators rather than their natural alternatives. Finally, we illustrate the magnitude of efficiency gains that can be realized in vaccine immunogenicity studies by fusing data from two HIV vaccine trials.
Discussion of 'Statistical inference for streamed longitudinal data'
Spectral adjustment for spatial confounding
Adjusting for an unmeasured confounder is generally an intractable problem, but in the spatial setting it may be possible under certain conditions. We derive necessary conditions on the coherence between the exposure and the unmeasured confounder that ensure the effect of exposure is estimable. We specify our model and assumptions in the spectral domain to allow for different degrees of confounding at different spatial resolutions. One assumption that ensures identifiability is that confounding present at global scales dissipates at local scales. We show that this assumption in the spectral domain is equivalent to adjusting for global-scale confounding in the spatial domain by adding a spatially smoothed version of the exposure to the mean of the response variable. Within this general framework, we propose a sequence of confounder adjustment methods that range from parametric adjustments based on the Matérn coherence function to more robust semiparametric methods that use smoothing splines. These ideas are applied to areal and geostatistical data for both simulated and real datasets.
ASSESSING TIME-VARYING CAUSAL EFFECT MODERATION IN THE PRESENCE OF CLUSTER-LEVEL TREATMENT EFFECT HETEROGENEITY AND INTERFERENCE
The micro-randomized trial (MRT) is a sequential randomized experimental design to empirically evaluate the effectiveness of mobile health (mHealth) intervention components that may be delivered at hundreds or thousands of decision points. MRTs have motivated a new class of causal estimands, termed "causal excursion effects", for which semiparametric inference can be conducted via a weighted, centered least squares criterion (Boruvka et al., 2018). Existing methods assume between-subject independence and non-interference. Deviations from these assumptions often occur. In this paper, causal excursion effects are revisited under potential cluster-level treatment effect heterogeneity and interference, where the treatment effect of interest may depend on cluster-level moderators. Utility of the proposed methods is shown by analyzing data from a multi-institution cohort of first year medical residents in the United States.
Marginal proportional hazards models for multivariate interval-censored data
Multivariate interval-censored data arise when there are multiple types of events or clusters of study subjects, such that the event times are potentially correlated and when each event is only known to occur over a particular time interval. We formulate the effects of potentially time-varying covariates on the multivariate event times through marginal proportional hazards models while leaving the dependence structures of the related event times unspecified. We construct the nonparametric pseudolikelihood under the working assumption that all event times are independent, and we provide a simple and stable EM-type algorithm. The resulting nonparametric maximum pseudolikelihood estimators for the regression parameters are shown to be consistent and asymptotically normal, with a limiting covariance matrix that can be consistently estimated by a sandwich estimator under arbitrary dependence structures for the related event times. We evaluate the performance of the proposed methods through extensive simulation studies and present an application to data from the Atherosclerosis Risk in Communities Study.
Multi-stage optimal dynamic treatment regimes for survival outcomes with dependent censoring
We propose a reinforcement learning method for estimating an optimal dynamic treatment regime for survival outcomes with dependent censoring. The estimator allows the failure time to be conditionally independent of censoring and dependent on the treatment decision times, supports a flexible number of treatment arms and treatment stages, and can maximize either the mean survival time or the survival probability at a certain time-point. The estimator is constructed using generalized random survival forests and can have polynomial rates of convergence. Simulations and analysis of the Atherosclerosis Risk in Communities study data suggest that the new estimator brings higher expected outcomes than existing methods in various settings.
Gradient-based sparse principal component analysis with extensions to online learning
Sparse principal component analysis is an important technique for simultaneous dimensionality reduction and variable selection with high-dimensional data. In this work we combine the unique geometric structure of the sparse principal component analysis problem with recent advances in convex optimization to develop novel gradient-based sparse principal component analysis algorithms. These algorithms enjoy the same global convergence guarantee as the original alternating direction method of multipliers, and can be more efficiently implemented with the rich toolbox developed for gradient methods from the deep learning literature. Most notably, these gradient-based algorithms can be combined with stochastic gradient descent methods to produce efficient online sparse principal component analysis algorithms with provable numerical and statistical performance guarantees. The practical performance and usefulness of the new algorithms are demonstrated in various simulation studies. As an application, we show how the scalability and statistical accuracy of our method enable us to find interesting functional gene groups in high-dimensional RNA sequencing data.
Sample-constrained partial identification with application to selection bias
Many partial identification problems can be characterized by the optimal value of a function over a set where both the function and set need to be estimated by empirical data. Despite some progress for convex problems, statistical inference in this general setting remains to be developed. To address this, we derive an asymptotically valid confidence interval for the optimal value through an appropriate relaxation of the estimated set. We then apply this general result to the problem of selection bias in population-based cohort studies. We show that existing sensitivity analyses, which are often conservative and difficult to implement, can be formulated in our framework and made significantly more informative via auxiliary information on the population. We conduct a simulation study to evaluate the finite sample performance of our inference procedure, and conclude with a substantive motivating example on the causal effect of education on income in the highly selected UK Biobank cohort. We demonstrate that our method can produce informative bounds using plausible population-level auxiliary constraints. We implement this method in the [Formula: see text] package [Formula: see text].
A multiplicative structural nested mean model for zero-inflated outcomes
Zero-inflated nonnegative outcomes are common in many applications. In this work, motivated by freemium mobile game data, we propose a class of multiplicative structural nested mean models for zero-inflated nonnegative outcomes which flexibly describes the joint effect of a sequence of treatments in the presence of time-varying confounders. The proposed estimator solves a doubly robust estimating equation, where the nuisance functions, namely the propensity score and conditional outcome means given confounders, are estimated parametrically or nonparametrically. To improve the accuracy, we leverage the characteristic of zero-inflated outcomes by estimating the conditional means in two parts, that is, separately modelling the probability of having positive outcomes given confounders, and the mean outcome conditional on its being positive and given the confounders. We show that the proposed estimator is consistent and asymptotically normal as either the sample size or the follow-up time goes to infinity. Moreover, the typical sandwich formula can be used to estimate the variance of treatment effect estimators consistently, without accounting for the variation due to estimating nuisance functions. Simulation studies and an application to a freemium mobile game dataset are presented to demonstrate the empirical performance of the proposed method and support our theoretical findings.
Data integration: exploiting ratios of parameter estimates from a reduced external model
We consider the situation of estimating the parameters in a generalized linear prediction model, from an internal dataset, where the outcome variable [Formula: see text] is binary and there are two sets of covariates, [Formula: see text] and [Formula: see text]. We have information from an external study that provides parameter estimates for a generalized linear model of [Formula: see text] on [Formula: see text]. We propose a method that makes limited assumptions about the similarity of the distributions in the two study populations. The method involves orthogonalizing the [Formula: see text] variables and then borrowing information about the ratio of the coefficients from the external model. The method is justified based on a new result relating the parameters in a generalized linear model to the parameters in a generalized linear model with omitted covariates. The method is applicable if the regression coefficients in the [Formula: see text] given [Formula: see text] model are similar in the two populations, up to an unknown scalar constant. This type of transportability between populations is something that can be checked from the available data. The asymptotic variance of the proposed method is derived. The method is evaluated in a simulation study and shown to gain efficiency compared to simple analysis of the internal dataset, and is robust compared to an alternative method of incorporating external information.
Testing generalized linear models with high-dimensional nuisance parameter
Generalized linear models often have a high-dimensional nuisance parameters, as seen in applications such as testing gene-environment interactions or gene-gene interactions. In these scenarios, it is essential to test the significance of a high-dimensional sub-vector of the model's coefficients. Although some existing methods can tackle this problem, they often rely on the bootstrap to approximate the asymptotic distribution of the test statistic, and thus are computationally expensive. Here, we propose a computationally efficient test with a closed-form limiting distribution, which allows the parameter being tested to be either sparse or dense. We show that under certain regularity conditions, the type I error of the proposed method is asymptotically correct, and we establish its power under high-dimensional alternatives. Extensive simulations demonstrate the good performance of the proposed test and its robustness when certain sparsity assumptions are violated. We also apply the proposed method to Chinese famine sample data in order to show its performance when testing the significance of gene-environment interactions.
Functional hybrid factor regression model for handling heterogeneity in imaging studies
This paper develops a functional hybrid factor regression modelling framework to handle the heterogeneity of many large-scale imaging studies, such as the Alzheimer's disease neuroimaging initiative study. Despite the numerous successes of those imaging studies, such heterogeneity may be caused by the differences in study environment, population, design, protocols or other hidden factors, and it has posed major challenges in integrative analysis of imaging data collected from multicentres or multistudies. We propose both estimation and inference procedures for estimating unknown parameters and detecting unknown factors under our new model. The asymptotic properties of both estimation and inference procedures are systematically investigated. The finite-sample performance of our proposed procedures is assessed by using Monte Carlo simulations and a real data example on hippocampal surface data from the Alzheimer's disease study.
Graphical Gaussian Process Models for Highly Multivariate Spatial Data
For multivariate spatial Gaussian process (GP) models, customary specifications of cross-covariance functions do not exploit relational inter-variable graphs to ensure process-level conditional independence among the variables. This is undesirable, especially for highly multivariate settings, where popular cross-covariance functions such as the multivariate Matérn suffer from a "curse of dimensionality" as the number of parameters and floating point operations scale up in quadratic and cubic order, respectively, in the number of variables. We propose a class of multivariate "Graphical Gaussian Processes" using a general construction called "stitching" that crafts cross-covariance functions from graphs and ensures process-level conditional independence among variables. For the Matérn family of functions, stitching yields a multivariate GP whose univariate components are Matérn GPs, and conforms to process-level conditional independence as specified by the graphical model. For highly multivariate settings and decomposable graphical models, stitching offers massive computational gains and parameter dimension reduction. We demonstrate the utility of the graphical Matérn GP to jointly model highly multivariate spatial data using simulation examples and an application to air-pollution modelling.
Significance testing for canonical correlation analysis in high dimensions
We consider the problem of testing for the presence of linear relationships between large sets of random variables based on a post-selection inference approach to canonical correlation analysis. The challenge is to adjust for the selection of subsets of variables having linear combinations with maximal sample correlation. To this end, we construct a stabilized one-step estimator of the euclidean-norm of the canonical correlations maximized over subsets of variables of pre-specified cardinality. This estimator is shown to be consistent for its target parameter and asymptotically normal, provided the dimensions of the variables do not grow too quickly with sample size. We also develop a greedy search algorithm to accurately compute the estimator, leading to a computationally tractable omnibus test for the global null hypothesis that there are no linear relationships between any subsets of variables having the pre-specified cardinality. We further develop a confidence interval that takes the variable selection into account.
A proximal distance algorithm for likelihood-based sparse covariance estimation
This paper addresses the task of estimating a covariance matrix under a patternless sparsity assumption. In contrast to existing approaches based on thresholding or shrinkage penalties, we propose a likelihood-based method that regularizes the distance from the covariance estimate to a symmetric sparsity set. This formulation avoids unwanted shrinkage induced by more common norm penalties, and enables optimization of the resulting nonconvex objective by solving a sequence of smooth, unconstrained subproblems. These subproblems are generated and solved via the proximal distance version of the majorization-minimization principle. The resulting algorithm executes rapidly, gracefully handles settings where the number of parameters exceeds the number of cases, yields a positive-definite solution, and enjoys desirable convergence properties. Empirically, we demonstrate that our approach outperforms competing methods across several metrics, for a suite of simulated experiments. Its merits are illustrated on international migration data and a case study on flow cytometry. Our findings suggest that the marginal and conditional dependency networks for the cell signalling data are more similar than previously concluded.