LIFETIME DATA ANALYSIS

Optimal survival analyses with prevalent and incident patients
Hartman N
Period-prevalent cohorts are often used for their cost-saving potential in epidemiological studies of survival outcomes. Under this design, prevalent patients allow for evaluations of long-term survival outcomes without the need for long follow-up, whereas incident patients allow for evaluations of short-term survival outcomes without the issue of left-truncation. In most period-prevalent survival analyses from the existing literature, patients have been recruited to achieve an overall sample size, with little attention given to the relative frequencies of prevalent and incident patients and their statistical implications. Furthermore, there are no existing methods available to rigorously quantify the impact of these relative frequencies on estimation and inference and incorporate this information into study design strategies. To address these gaps, we develop an approach to identify the optimal mix of prevalent and incident patients that maximizes precision over the entire estimated survival curve, subject to a flexible weighting scheme. In addition, we prove that inference based on the weighted log-rank test or Cox proportional hazards model is most powerful with an entirely prevalent or incident cohort, and we derive theoretical formulas to determine the optimal choice. Simulations confirm the validity of the proposed optimization criteria and show that substantial efficiency gains can be achieved by recruiting the optimal mix of prevalent and incident patients. The proposed methods are applied to assess waitlist outcomes among kidney transplant candidates.
Right-censored models by the expectile method
Ciuperca G
Based on the expectile loss function and the adaptive LASSO penalty, the paper proposes and studies the estimation methods for the accelerated failure time (AFT) model. In this approach, we need to estimate the survival function of the censoring variable by the Kaplan-Meier estimator. The AFT model parameters are first estimated by the expectile method and afterwards, when the number of explanatory variables can be large, by the adaptive LASSO expectile method which directly carries out the automatic selection of variables. We also obtain the convergence rate and asymptotic normality for the two estimators, while showing the sparsity property for the censored adaptive LASSO expectile estimator. A numerical study using Monte Carlo simulations confirms the theoretical results and demonstrates the competitive performance of the two proposed estimators. The usefulness of these estimators is illustrated by applying them to three survival data sets.
Conditional modeling of recurrent event data with terminal event
Fang W, Zhou J and Xie M
Recurrent event data with a terminal event arise in follow-up studies. The current literature has primarily focused on the effect of covariates on the recurrent event process using marginal estimating equation approaches or joint modeling approaches via frailties. In this article, we propose a conditional model for recurrent event data with a terminal event, which provides an intuitive interpretation of the effect of the terminal event: at an early time, the rate of recurrent events is nearly independent of the terminal event, but the dependence gets stronger as time goes close to the terminal event time. A two-stage likelihood-based approach is proposed to estimate parameters of interest. Asymptotic properties of the estimators are established. The finite-sample behavior of the proposed method is examined through simulation studies. A real data of colorectal cancer is analyzed by the proposed method for illustration.
Two-stage pseudo maximum likelihood estimation of semiparametric copula-based regression models for semi-competing risks data
Arachchige SJ, Chen X and Zhou QM
We propose a two-stage estimation procedure for a copula-based model with semi-competing risks data, where the non-terminal event is subject to dependent censoring by the terminal event, and both events are subject to independent censoring. With a copula-based model, the marginal survival functions of individual event times are specified by semiparametric transformation models, and the dependence between the bivariate event times is specified by a parametric copula function. For the estimation procedure, in the first stage, the parameters associated with the marginal of the terminal event are estimated using only the corresponding observed outcomes, and in the second stage, the marginal parameters for the non-terminal event time and the copula parameter are estimated together via maximizing a pseudo-likelihood function based on the joint distribution of the bivariate event times. We derived the asymptotic properties of the proposed estimator and provided an analytic variance estimator for inference. Through simulation studies, we showed that our approach leads to consistent estimates with less computational cost and more robustness than the one-stage procedure developed in Chen YH (Lifetime Data Anal 18:36-57, 2012), where all parameters were estimated simultaneously. In addition, our approach demonstrates more desirable finite-sample performances over another existing two-stage estimation method proposed in Zhu H et al., (Commu Statistics-Theory Methods 51(22):7830-7845, 2021) . An R package PMLE4SCR is developed to implement our proposed method.
Evaluating time-to-event surrogates for time-to-event true endpoints: an information-theoretic approach based on causal inference
Stijven F, Molenberghs G, Van Keilegom I, Van der Elst W and Alonso A
Putative surrogate endpoints must undergo a rigorous statistical evaluation before they can be used in clinical trials. Numerous frameworks have been introduced for this purpose. In this study, we extend the scope of the information-theoretic causal-inference approach to encompass scenarios where both outcomes are time-to-event endpoints, using the flexibility provided by D-vine copulas. We evaluate the quality of the putative surrogate using the individual causal association (ICA)-a measure based on the mutual information between the individual causal treatment effects. However, in spite of its appealing mathematical properties, the ICA may be ill defined for composite endpoints. Therefore, we also propose an alternative rank-based metric for assessing the ICA. Due to the fundamental problem of causal inference, the joint distribution of all potential outcomes is only partially identifiable and, consequently, the ICA cannot be estimated without strong unverifiable assumptions. This is addressed by a formal sensitivity analysis that is summarized by the so-called intervals of ignorance and uncertainty. The frequentist properties of these intervals are discussed in detail. Finally, the proposed methods are illustrated with an analysis of pooled data from two advanced colorectal cancer trials. The newly developed techniques have been implemented in the R package Surrogate.
Nonparametric estimation of the cumulative incidence function for doubly-truncated and interval-censored competing risks data
Shen PS
Interval sampling is widely used for collection of disease registry data, which typically report incident cases during a certain time period. Such sampling scheme induces doubly truncated data if the failure time can be observed exactly and doubly truncated and interval censored (DTIC) data if the failure time is known only to lie within an interval. In this article, we consider nonparametric estimation of the cumulative incidence functions (CIF) using doubly-truncated and interval-censored competing risks (DTIC-C) data obtained from interval sampling scheme. Using the approach of Shen (Stat Methods Med Res 31:1157-1170, 2022b), we first obtain the nonparametric maximum likelihood estimator (NPMLE) of the distribution function of failure time ignoring failure types. Using the NPMLE, we proposed nonparametric estimators of the CIF with DTIC-C data and establish consistency of the proposed estimators. Simulation studies show that the proposed estimator performs well for finite sample size.
A class of semiparametric models for bivariate survival data
Dos Reis Miranda Filho W and Demarqui FN
We propose a new class of bivariate survival models based on the family of Archimedean copulas with margins modeled by the Yang and Prentice (YP) model. The Ali-Mikhail-Haq (AMH), Clayton, Frank, Gumbel-Hougaard (GH), and Joe copulas are employed to accommodate the dependency among marginal distributions. Baseline distributions are modeled semiparametrically by the Piecewise Exponential (PE) distribution and the Bernstein polynomials (BP). Inference procedures for the proposed class of models are based on the maximum likelihood (ML) approach. The new class of models possesses some attractive features: i) the ability to take into account survival data with crossing survival curves; ii) the inclusion of the well-known proportional hazards (PH) and proportional odds (PO) models as particular cases; iii) greater flexibility provided by the semiparametric modeling of the marginal baseline distributions; iv) the availability of closed-form expressions for the likelihood functions, leading to more straightforward inferential procedures. The properties of the proposed class are numerically investigated through an extensive simulation study. Finally, we demonstrate the versatility of our new class of models through the analysis of survival data involving patients diagnosed with ovarian cancer.
Proportional rates model for recurrent event data with intermittent gaps and a terminal event
Jin J, Song X, Sun L and Su PF
Recurrent events are common in medical practice or epidemiologic studies when each subject experiences a particular event repeatedly over time. In some long-term observations of recurrent events, a terminal event such as death may exist in recurrent event data. Meanwhile, some inspected subjects will withdraw from a study for some time for various reasons and then resume, which may happen more than once. The period between the subject leaving and returning to the study is called an intermittent gap. One naive method typically ignores gaps and treats the events as usual recurrent events, which could result in misleading estimation results. In this article, we consider a semiparametric proportional rates model for recurrent event data with intermittent gaps and a terminal event. An estimation procedure is developed for the model parameters, and the asymptotic properties of the resulting estimators are established. Simulation studies demonstrate that the proposed estimators perform satisfactorily compared to the naive method that ignores gaps. A diabetes study further shows the utility of the proposed method.
A global kernel estimator for partially linear varying coefficient additive hazards models
Ng HM and Wong KY
We study kernel-based estimation methods for partially linear varying coefficient additive hazards models, where the effects of one type of covariates can be modified by another. Existing kernel estimation methods for varying coefficient models often use a "local" approach, where only a small local neighborhood of subjects are used for estimating the varying coefficient functions. Such a local approach, however, is generally inefficient as information about some non-varying nuisance parameter from subjects outside the neighborhood is discarded. In this paper, we develop a "global" kernel estimator that simultaneously estimates the varying coefficients over the entire domains of the functions, leveraging the non-varying nature of the nuisance parameter. We establish the consistency and asymptotic normality of the proposed estimators. The theoretical developments are substantially more challenging than those of the local methods, as the dimension of the global estimator increases with the sample size. We conduct extensive simulation studies to demonstrate the feasibility and superior performance of the proposed methods compared with existing local methods and provide an application to a motivating cancer genomic study.
Nested case-control sampling without replacement
Shin YE and Saegusa T
Nested case-control design (NCC) is a cost-effective outcome-dependent design in epidemiology that collects all cases and a fixed number of controls at the time of case diagnosis from a large cohort. Due to inefficiency relative to full cohort studies, previous research developed various estimation methodologies but changing designs in the formulation of risk sets was considered only in view of potential bias in the partial likelihood estimation. In this paper, we study a modified design that excludes previously selected controls from risk sets in view of efficiency improvement as well as bias. To this end, we extend the inverse probability weighting method of Samuelsen which was shown to outperform the partial likelihood estimator in the standard setting. We develop its asymptotic theory and a variance estimation of both regression coefficients and the cumulative baseline hazard function that takes account of the complex feature of the modified sampling design. In addition to good finite sample performance of variance estimation, simulation studies show that the modified design with the proposed estimator is more efficient than the standard design. Examples are provided using data from NIH-AARP Diet and Health Cohort Study.
Spatiotemporal multilevel joint modeling of longitudinal and survival outcomes in end-stage kidney disease
Kürüm E, Nguyen DV, Qian Q, Banerjee S, Rhee CM and Şentürk D
Individuals with end-stage kidney disease (ESKD) on dialysis experience high mortality and excessive burden of hospitalizations over time relative to comparable Medicare patient cohorts without kidney failure. A key interest in this population is to understand the time-dynamic effects of multilevel risk factors that contribute to the correlated outcomes of longitudinal hospitalization and mortality. For this we utilize multilevel data from the United States Renal Data System (USRDS), a national database that includes nearly all patients with ESKD, where repeated measurements/hospitalizations over time are nested in patients and patients are nested within (health service) regions across the contiguous U.S. We develop a novel spatiotemporal multilevel joint model (STM-JM) that accounts for the aforementioned hierarchical structure of the data while considering the spatiotemporal variations in both outcomes across regions. The proposed STM-JM includes time-varying effects of multilevel (patient- and region-level) risk factors on hospitalization trajectories and mortality and incorporates spatial correlations across the spatial regions via a multivariate conditional autoregressive correlation structure. Efficient estimation and inference are performed via a Bayesian framework, where multilevel varying coefficient functions are targeted via thin-plate splines. The finite sample performance of the proposed method is assessed through simulation studies. An application of the proposed method to the USRDS data highlights significant time-varying effects of patient- and region-level risk factors on hospitalization and mortality and identifies specific time periods on dialysis and spatial locations across the U.S. with elevated hospitalization and mortality risks.
Copula-based analysis of dependent current status data with semiparametric linear transformation model
Yu H, Zhang R and Zhang L
This paper discusses regression analysis of current status data with dependent censoring, a problem that often occurs in many areas such as cross-sectional studies, epidemiological investigations and tumorigenicity experiments. Copula model-based methods are commonly employed to tackle this issue. However, these methods often face challenges in terms of model and parameter identification. The primary aim of this paper is to propose a copula-based analysis for dependent current status data, where the association parameter is left unspecified. Our method is based on a general class of semiparametric linear transformation models and parametric copulas. We demonstrate that the proposed semiparametric model is identifiable under certain regularity conditions from the distribution of the observed data. For inference, we develop a sieve maximum likelihood estimation method, using Bernstein polynomials to approximate the nonparametric functions involved. The asymptotic consistency and normality of the proposed estimators are established. Finally, to demonstrate the effectiveness and practical applicability of our method, we conduct an extensive simulation study and apply the proposed method to a real data example.
Partial-linear single-index transformation models with censored data
Lee M, Troxel AB and Liu M
In studies with time-to-event outcomes, multiple, inter-correlated, and time-varying covariates are commonly observed. It is of great interest to model their joint effects by allowing a flexible functional form and to delineate their relative contributions to survival risk. A class of semiparametric transformation (ST) models offers flexible specifications of the intensity function and can be a general framework to accommodate nonlinear covariate effects. In this paper, we propose a partial-linear single-index (PLSI) transformation model that reduces the dimensionality of multiple covariates into a single index and provides interpretable estimates of the covariate effects. We develop an iterative algorithm using the regression spline technique to model the nonparametric single-index function for possibly nonlinear joint effects, followed by nonparametric maximum likelihood estimation. We also propose a nonparametric testing procedure to formally examine the linearity of covariate effects. We conduct Monte Carlo simulation studies to compare the PLSI transformation model with the standard ST model and apply it to NYU Langone Health de-identified electronic health record data on COVID-19 hospitalized patients' mortality and a Veteran's Administration lung cancer trial.
Unifying mortality forecasting model: an investigation of the COM-Poisson distribution in the GAS model for improved projections
Rakhmawan SA, Mahmood T, Abbas N and Riaz M
Forecasting mortality rates is crucial for evaluating life insurance company solvency, especially amid disruptions caused by phenomena like COVID-19. The Lee-Carter model is commonly employed in mortality modelling; however, extensions that can encompass count data with diverse distributions, such as the Generalized Autoregressive Score (GAS) model utilizing the COM-Poisson distribution, exhibit potential for enhancing time-to-event forecasting accuracy. Using mortality data from 29 countries, this research evaluates various distributions and determines that the COM-Poisson model surpasses the Poisson, binomial, and negative binomial distributions in forecasting mortality rates. The one-step forecasting capability of the GAS model offers distinct advantages, while the COM-Poisson distribution demonstrates enhanced flexibility and versatility by accommodating various distributions, including Poisson and negative binomial. Ultimately, the study determines that the COM-Poisson GAS model is an effective instrument for examining time series data on mortality rates, particularly when facing time-varying parameters and non-conventional data distributions.
A flexible time-varying coefficient rate model for panel count data
Sun D, Guo Y, Li Y, Sun J and Tu W
Panel count regression is often required in recurrent event studies, where the interest is to model the event rate. Existing rate models are unable to handle time-varying covariate effects due to theoretical and computational difficulties. Mean models provide a viable alternative but are subject to the constraints of the monotonicity assumption, which tends to be violated when covariates fluctuate over time. In this paper, we present a new semiparametric rate model for panel count data along with related theoretical results. For model fitting, we present an efficient EM algorithm with three different methods for variance estimation. The algorithm allows us to sidestep the challenges of numerical integration and difficulties with the iterative convex minorant algorithm. We showed that the estimators are consistent and asymptotically normally distributed. Simulation studies confirmed an excellent finite sample performance. To illustrate, we analyzed data from a real clinical study of behavioral risk factors for sexually transmitted infections.
Regression analysis of doubly censored failure time data with ancillary information
Du M, Gao X and Chen L
Doubly censored failure time data occur in many areas and for the situation, the failure time of interest usually represents the elapsed time between two related events such as an infection and the resulting disease onset. Although many methods have been proposed for regression analysis of such data, most of them are conditional on the occurrence time of the initial event and ignore the relationship between the two events or the ancillary information contained in the initial event. Corresponding to this, a new sieve maximum likelihood approach is proposed that makes use of the ancillary information, and in the method, the logistic model and Cox proportional hazards model are employed to model the initial event and the failure time of interest, respectively. A simulation study is conducted and suggests that the proposed method works well in practice and is more efficient than the existing methods as expected. The approach is applied to an AIDS study that motivated this investigation.
Competing risks and multivariate outcomes in epidemiological and clinical trial research
Prentice RL
Data analysis methods for the study of treatments or exposures in relation to a clinical outcome in the presence of competing risks have a long history, often with inference targets that are hypothetical, thereby requiring strong assumptions for identifiability with available data. Here data analysis methods are considered that are based on single and higher dimensional marginal hazard rates, quantities that are identifiable under standard independent censoring assumptions. These lead naturally to joint survival function estimators for outcomes of interest, including competing risk outcomes, and provide the basis for addressing a variety of data analysis questions. These methods will be illustrated using simulations and Women's Health Initiative cohort and clinical trial data sets, and additional research needs will be described.
A constrained maximum likelihood approach to developing well-calibrated models for predicting binary outcomes
Cao Y, Ma W, Zhao G, McCarthy AM and Chen J
The added value of candidate predictors for risk modeling is routinely evaluated by comparing the performance of models with or without including candidate predictors. Such comparison is most meaningful when the estimated risk by the two models are both unbiased in the target population. Very often data for candidate predictors are sourced from nonrepresentative convenience samples. Updating the base model using the study data without acknowledging the discrepancy between the underlying distribution of the study data and that in the target population can lead to biased risk estimates and therefore an unfair evaluation of candidate predictors. To address this issue assuming access to a well-calibrated base model, we propose a semiparametric method for model fitting that enforces good calibration. The central idea is to calibrate the fitted model against the base model by enforcing suitable constraints in maximizing the likelihood function. This approach enables unbiased assessment of model improvement offered by candidate predictors without requiring a representative sample from the target population, thus overcoming a significant practical challenge. We study theoretical properties for model parameter estimates, and demonstrate improvement in model calibration via extensive simulation studies. Finally, we apply the proposed method to data extracted from Penn Medicine Biobank to inform the added value of breast density for breast cancer risk assessment in the Caucasian woman population.
Risk projection for time-to-event outcome from population-based case-control studies leveraging summary statistics from the target population
Zheng J and Hsu L
Risk stratification based on prediction models has become increasingly important in preventing and managing chronic diseases. However, due to cost- and time-limitations, not every population can have resources for collecting enough detailed individual-level information on a large number of people to develop risk prediction models. A more practical approach is to use prediction models developed from existing studies and calibrate them with relevant summary-level information of the target population. Many existing studies were conducted under the population-based case-control design. Gail et al. (J Natl Cancer Inst 81:1879-1886, 1989) proposed to combine the odds ratio estimates obtained from case-control data and the disease incidence rates from the target population to obtain the baseline hazard function, and thereby the pure risk for developing diseases. However, the approach requires the risk factor distribution of cases from the case-control studies be same as the target population, which, if violated, may yield biased risk estimation. In this article, we propose two novel weighted estimating equation approaches to calibrate the baseline risk by leveraging the summary information of (some) risk factors in addition to disease-free probabilities from the targeted population. We establish the consistency and asymptotic normality of the proposed estimators. Extensive simulation studies and an application to colorectal cancer studies demonstrate the proposed estimators perform well for bias reduction in finite samples.
Measurement error models with zero inflation and multiple sources of zeros, with applications to hard zeros
Bhadra A, Wei R, Keogh R, Kipnis V, Midthune D, Buckman DW, Su Y, Chowdhury AR and Carroll RJ
We consider measurement error models for two variables observed repeatedly and subject to measurement error. One variable is continuous, while the other variable is a mixture of continuous and zero measurements. This second variable has two sources of zeros. The first source is episodic zeros, wherein some of the measurements for an individual may be zero and others positive. The second source is hard zeros, i.e., some individuals will always report zero. An example is the consumption of alcohol from alcoholic beverages: some individuals consume alcoholic beverages episodically, while others never consume alcoholic beverages. However, with a small number of repeat measurements from individuals, it is not possible to determine those who are episodic zeros and those who are hard zeros. We develop a new measurement error model for this problem, and use Bayesian methods to fit it. Simulations and data analyses are used to illustrate our methods. Extensions to parametric models and survival analysis are discussed briefly.
Special issue dedicated to Mitchell H. Gail, M.D. Ph.D
Ting Lee ML