COMMUNICATIONS IN STATISTICS-THEORY AND METHODS

Confidence Interval Estimation of the Youden index and corresponding cut-point for a combination of biomarkers under normality
Attwood K and Tian L
In prognostic/diagnostic medical research, it is often the goal to identify a biomarker that differentiates between patients with and without a condition, or patients that will have good or poor response to a given treatment. The statistical literature is abundant with methods for evaluating single biomarkers for these purposes. However, in practice, a single biomarker rarely captures all aspects of a disease process; therefore, it is often the case that using a combination of biomarkers will improve discriminatory ability. A variety of methods have been developed for combining biomarkers based on the maximization of some global measure or cost-function. These methods usually create a score based on a linear combination of the biomarkers, upon which the standard single biomarker methodologies (such as the Youden's index) are applied. However, these single biomarker methodologies do not account for the multivariable nature of the combined biomarker score. In this article we present generalized inference and bootstrap approaches to estimating confidence intervals for the Youden's index and corresponding cut-point for a combined biomarker. These methods account for inherent dependencies and provide accurate and efficient estimates. A simulation study and real-world example utilize data from a Duchene Muscular Dystrophy study are also presented.
Covariate adjustment via propensity scores for recurrent events in the presence of dependent censoring
Cho Y and Ghosh D
Dependent censoring is common in many medical studies, especially when there are multiple occurrences of the event of interest. Ghosh and Lin (2003) and Hsieh, Ding and Wang (2011) proposed estimation procedures using an artificial censoring technique. However, if covariates are not bounded, then these methods can cause excessive artificial censoring. In this paper, we propose estimation procedures for the treatment effect based on a novel application of propensity scores. Simulation studies show that the proposed method provides good finite-sample properties. The techniques are illustrated with an application to an HIV dataset.
Power for balanced linear mixed models with complex missing data processes
Josey KP, Ringham BM, Barón AE, Schenkman M, Sauder KA, Muller KE, Dabelea D and Glueck DH
When designing repeated measures studies, both the amount and the pattern of missing outcome data can affect power. The chance that an observation is missing may vary across measurements, and missingness may be correlated across measurements. For example, in a physiotherapy study of patients with Parkinson's disease, increasing intermittent dropout over time yielded missing measurements of physical function. In this example, we assume data are missing completely at random, since the chance that a data point was missing appears to be unrelated to either outcomes or covariates. For data missing completely at random, we propose noncentral power approximations for the Wald test for balanced linear mixed models with Gaussian responses. The power approximations are based on moments of missing data summary statistics. The moments were derived assuming a conditional linear missingness process. The approach provides approximate power for both complete-case analyses, which include independent sampling units where all measurements are present, and observed-case analyses, which include all independent sampling units with at least one measurement. Monte Carlo simulations demonstrate the accuracy of the method in small samples. We illustrate the utility of the method by computing power for proposed replications of the Parkinson's study.
Semiparametric copula-based regression modeling of semi-competing risks data
Zhu H, Lan Y, Ning J and Shen Y
Semi-competing risks data often arise in medical studies where the terminal event (e.g., death) censors the non-terminal event (e.g., cancer recurrence), but the non-terminal event does not prevent the subsequent occurrence of the terminal event. This article considers regression modeling of semi-competing risks data to assess the covariate effects on the respective non-terminal and terminal event times. We propose a copula-based framework for semi-competing risks regression with time-varying coefficients, where the dependence between the non-terminal and terminal event times is characterized by a copula and the time-varying covariate effects are imposed on two marginal regression models. We develop a two-stage inferential procedure for estimating the association parameter in the copula model and time-varying regression parameters. We evaluate the finite sample performance of the proposed method through simulation studies and illustrate the method through an application to Surveillance, Epidemiology, and End Results-Medicare data for elderly women diagnosed with early-stage breast cancer and initially treated with breast-conserving surgery.
Sample Size Calculation for Count Outcomes in Cluster Randomization Trials with Varying Cluster Sizes
Wang J, Zhang S and Ahn C
In many cluster randomization studies, cluster sizes are not fixed and may be highly variable. For those studies, sample size estimation assuming a constant cluster size may lead to under-powered studies. Sample size formulas have been developed to incorporate the variability in cluster size for clinical trials with continuous and binary outcomes. Count outcomes frequently occur in cluster randomized studies. In this paper, we derive a closed-form sample size formula for count outcomes accounting for the variability in cluster size. We compare the performance of the proposed method with the average cluster size method through simulation. The simulation study shows that the proposed method has a better performance with empirical powers and type I errors closer to the nominal levels.
Competing risks regression for clustered data with covariate-dependent censoring
Khanal M, Kim S, Fang X and Ahn KW
Competing risks data in clinical trial or observational studies often suffer from cluster effects such as center effects and matched pairs design. The proportional subdistribution hazards (PSH) model is one of the most widely used methods for competing risks data analyses. However, the current literature on the PSH model for clustered competing risks data is limited to covariate-independent censoring and the unstratified model. In practice, competing risks data often face covariate-dependent censoring and have the non-PSH structure. Thus, we propose a marginal stratified PSH model with covariate-adjusted censoring weight for clustered competing risks data. We use a marginal stratified proportional hazards model to estimate the survival probability of censoring by taking clusters and non-proportional hazards structure into account. Our simulation results show that, in the presence of covariate-dependent censoring, the parameter estimates of the proposed method are unbiased with approximate 95% coverage rates. We apply the proposed method to stem cell transplant data of leukemia patients to evaluate the clinical implications of donor-recipient HLA matching on chronic graft-versus-host disease.
A robust Spearman correlation coefficient permutation test
Yu H and Hutson AD
In this work, we show that Spearman's correlation coefficient test about found in most statistical software is theoretically incorrect and performs poorly when bivariate normality assumptions are not met or the sample size is small. There is common misconception that the tests about are robust to deviations from bivariate normality. However, we found under certain scenarios violation of the bivariate normality assumption has severe effects on type I error control for the common tests. To address this issue, we developed a robust permutation test for testing the hypothesis based on an appropriately studentized statistic. We will show that the test is asymptotically valid in general settings. This was demonstrated by a comprehensive set of simulation studies, where the proposed test exhibits robust type I error control, even when the sample size is small. We also demonstrated the application of this test in two real world examples.
Log-epsilon-skew normal: A generalization of the log-normal distribution
Hutson AD, Mashtare TL and Mudholkar GS
The log-normal distribution is widely used to model non-negative data in many areas of applied research. In this paper, we introduce and study a family of distributions with non-negative reals as support and termed the log-epsilon-skew normal (LESN) which includes the log-normal distributions as a special case. It is related to the epsilon-skew normal developed in Mudholkar and Hutson (2000) the way the log-normal is related to the normal distribution. We study its main properties, hazard function, moments, skewness and kurtosis coefficients, and discuss maximum likelihood estimation of model parameters. We summarize the results of a simulation study to examine the behavior of the maximum likelihood estimates, and we illustrate the maximum likelihood estimation of the LESN distribution parameters to two real world data sets.
-optimal designs for two-variable logistic regression model with restricted design space
Zhai Y, Wang C, Lin HY and Fang Z
The problem of constructing locally -optimal designs for two-variable logistic model with no interaction has been studied in many literature. In Kabera, Haines, and Ndlovu (2015), the model is restricted to have positive slopes and negative intercept for the assumptions that the probability of response increases with doses for both drugs and that the probability of response is less than 0.5 at zero dose level of both drugs. The design space mainly discussed is the set [0, ∞) × [0, ∞), while the finite rectangular design space is presented only in scenarios where the results for the unlimited design space are still appropriate. In this paper, we intend to loose these restrictions and discuss the rectangular design spaces for the model where the -optimal designs can not be obtained. The result can be extended to the models where drugs have negative or opposite effects, or the models with positive intercept, by using translation and reflection in the first quadrant.
A fortune cookie problem: A test for nominal data whether two samples are from the same population of equally likely elements
Gou J, Ruth K, Basickes S and Litwin S
This article considers a way to test the hypothesis that two collections of objects are from the same uniform distribution of such objects. The exact -value is calculated based on the distribution for the observed overlaps. In addition, an interval estimate of the number of distinct objects, when all objects are equally likely, is indicated.
A few theoretical results for Laplace and arctan penalized ordinary least squares linear regression estimators
John M and Vettam S
Two new nonconvex penalty functions - Laplace and arctan - were recently introduced in the literature to obtain sparse models for high-dimensional statistical problems. In this paper, we study the theoretical properties of Laplace and arctan penalized ordinary least squares linear regression models. We first illustrate the near-unbiasedness of the nonzero regression weights obtained by the new penalty functions, in the orthonormal design case. In the general design case, we present theoretical results in two asymptotic settings: (a) the number of features, fixed, but the sample size, , and (b) both and tend to infinity. The theoretical results shed light onto the differences between the solutions based on the new penalty functions and those based on existing convex and nonconvex Bridge penalty functions. Our theory also shows that both Laplace and arctan penalties satisfy the oracle property. Finally, we also present results from a brief simulations study illustrating the performance of Laplace and arctan penalties based on the gradient descent optimization algorithm.
A Hybrid Method for Density Power Divergence Minimization with Application to Robust Univariate Location and Scale Estimation
Anum AT and Pokojovy M
We develop a new globally convergent optimization method for solving a constrained minimization problem underlying the minimum density power divergence estimator for univariate Gaussian data in the presence of outliers. Our hybrid procedure combines classical Newton's method with a gradient descent iteration equipped with a step control mechanism based on Armijo's rule to ensure global convergence. Extensive simulations comparing the resulting estimation procedure with the more prominent robust competitor, Minimum Covariance Determinant (MCD) estimator, across a wide range of breakdown point values suggest improved efficiency of our method. Application to estimation and inference for a real-world dataset is also given.
Estimating Transition Intensity Rate on Interval-censored Data Using Semi-parametric with EM Algorithm Approach
Qian C, Srivastava DK, Pan J, Hudson MM and Rai SN
Phase IV clinical trials are designed to monitor long-term side effects of medical treatment. For instance, childhood cancer survivors treated with chest radiation and/or anthracycline are often at risk of developing cardiotoxicity during their adulthood. Often the primary focus of a study could be on estimating the cumulative incidence of a particular outcome of interest such as cardiotoxicity. However, it is challenging to evaluate patients continuously and usually, this information is collected through cross-sectional surveys by following patients longitudinally. This leads to interval-censored data since the exact time of the onset of the toxicity is unknown. Rai et al. computed the transition intensity rate using a parametric model and estimated parameters using maximum likelihood approach in an illness-death model. However, such approach may not be suitable if the underlying parametric assumptions do not hold. This manuscript proposes a semi-parametric model, with a logit relationship for the treatment intensities in two groups, to estimate the transition intensity rates within the context of an illness-death model. The estimation of the parameters is done using an EM algorithm with profile likelihood. Results from the simulation studies suggest that the proposed approach is easy to implement and yields comparable results to the parametric model.
Exact group sequential designs for two-arm experiments with Poisson distributed outcome variables
Grayling MJ, Wason JMS and Mander AP
We describe and compare two methods for the group sequential design of two-arm experiments with Poisson distributed data, which are based on a normal approximation and exact calculations respectively. A framework to determine near-optimal stopping boundaries is also presented. Using this framework, for a considered example, we demonstrate that a group sequential design could reduce the expected sample size under the null hypothesis by as much as 44% compared to a fixed sample approach. We conclude with a discussion of the advantages and disadvantages of the two presented procedures.
Inference for sparse linear regression based on the leave-one-covariate-out solution path
Cao X, Gregory K and Wang D
We propose a new measure of variable importance in high-dimensional regression based on the change in the LASSO solution path when one covariate is left out. The proposed procedure provides a novel way to calculate variable importance and conduct variable screening. In addition, our procedure allows for the construction of -values for testing whether each coe cient is equal to zero as well as for testing hypotheses involving multiple regression coefficients simultaneously; bootstrap techniques are used to construct the null distribution. For low-dimensional linear models, our method can achieve higher power than the -test. Extensive simulations are provided to show the effectiveness of our method. In the high-dimensional setting, our proposed solution path based test achieves greater power than some other recently developed high-dimensional inference methods. We extend our method to logistic regression and demonstrate in simulation that our leave-one-covariate-out solution path tests can provide accurate -values.
Efficient inferences for linear transformation models with doubly censored data
Choi S and Huang X
Doubly-censored data, which consist of exact and case-1 interval-censored observations, often arise in medical studies, such as HIV/AIDS clinical trials. This article considers nonparametric maximum likelihood estimation (NPMLE) of semiparametric transformation models that encompass the proportional hazards and proportional odds models when data are subject to double censoring. The maximum likelihood estimator is obtained by directly maximizing a nonparametric likelihood concerning a regression parameter and a nuisance function parameter, which facilitates efficient and reliable computation. Statistical inferences can be conveniently made from the inverse of the observed information matrix. The estimator is shown to be consistent and asymptotically normal. The limiting variances for the estimators can be consistently estimated. Simulation studies demonstrate that the NPMLE works well even under a heavy censoring scheme and substantially outperforms methods based on estimating functions in terms of efficiency. The method is illustrated through an application to a data set from an AIDS clinical trial.
Estimating Time-Varying Treatment Switching Effect Using Accelerated Failure Time Model with Application to Vascular Access for Hemodialysis
Chu FI and Wang Y
Vascular access for hemodialysis is of paramount importance. Although studies have found that central venous catheter (CVC) is often associated with poor outcomes and switching to arteriovenous fistula (AVF) and arteriovenous grafts (AVG) is beneficial, it has not been fully elucidated how the effect of switching of access on outcomes changes over time and whether the effect depends on switching time. In this paper we propose to relate the observed survival time for patients without access change and the counterfactual time for patients with access change using an AFT model with time-varying effects. The flexibility of AFT model allows us to account for baseline effect and the prognostic effect from covariates at access change while estimating the effect of access change. The effect of access change overtime is modeled nonparametrically using a cubic spline function. Simulation studies show excellent performance. Our methods are applied to investigate the effect of vascular access change over time in dialysis patients. It is concluded that the benefit of switching from CVC to AVG depends on the time of switching, the sooner the better.
A Class of Additive Transformation Models for Recurrent Gap Times
Chen L, Feng Y and Sun J
The gap time between recurrent events is often of primary interest in many fields such as medical studies (Cook and Lawless 2007; Kang, Sun, and Zhao 2015; Schaubel and Cai 2004), and in this paper, we discuss regression analysis of the gap times arising from a general class of additive transformation models. For the problem, we propose two estimation procedures, the modified within-cluster resampling (MWCR) method and the weighted risk-set (WRS) method, and the proposed estimators are shown to be consistent and asymptotically follow the normal distribution. In particular, the estimators have closed forms and can be easily determined, and the methods have the advantage of leaving the correlation among gap times arbitrary. A simulation study is conducted for assessing the finite sample performance of the presented methods and suggests that they work well in practical situations. Also the methods are applied to a set of real data from a chronic granulomatous disease (CGD) clinical trial.
A note on semiparametric efficient generalization of causal effects from randomized trials to target populations
Li F, Hong H and Stuart EA
When effect modifiers influence the decision to participate in randomized trials, generalizing causal effect estimates to an external target population requires the knowledge of two scores - the propensity score for receiving treatment and the sampling score for trial participation. While the former score is known due to randomization, the latter score is usually unknown and estimated from data. Under unconfounded trial participation, we characterize the asymptotic efficiency bounds for estimating two causal estimands - the population average treatment effect and the average treatment effect among the non-participants - and examine the role of the scores. We also study semiparametric efficient estimators that directly balance the weighted trial sample toward the target population, and illustrate their operating characteristics via simulations.
Estimation of Multi-state Models with Missing Covariate Values Based on Observed Data Likelihood
Lou W, Abner EL, Wan L, Fardo DW, Lipton R, Katz M and Kryscio RJ
Continuous-time multi-state models are commonly used to study diseases with multiple stages. Potential risk factors associated with the disease are added to the transition intensities of the model as covariates, but missing covariate measurements arise frequently in practice. We propose a likelihood-based method that deals efficiently with a missing covariate in these models. Our simulation study showed that the method performs well for both 'missing completely at random' and 'missing at random' mechanisms. We also applied our method to a real dataset, the Einstein Aging Study.
Bayesian Mediation Analysis for Time-to-Event Outcome: Investigating Racial Disparity in Breast Cancer Survival
Yu Q, Cao W, Mercante D and Li B
Mediation analysis is conducted to make inferences on effects of mediators that intervene the relationship between an exposure variable and an outcome. Bayesian mediation analysis (BMA) naturally considers the hierarchical structure of the effects from the exposure variable to mediators and then to the outcome. We propose three BMA methods on survival outcomes, where mediation effects are measured in terms of hazard rate, survival time, or log of survival time respectively. In addition, we allow setting a limited survival time in the time-to-event analysis. The methods are validated by comparing the estimation precision at different scenarios through simulations. The three methods all give effective estimates. Finally, the methods are applied to the Surveillance, Epidemiology, and End Results Program (SEER) supported special studies to explore the racial disparity in breast cancer survivals. The included variable completely explained the observed racial disparities. We provide visual aids to help with the result interpretations.