Using biomarkers to allocate patients in a response-adaptive clinical trial
In this paper, we discuss a response adaptive randomization method, and why it should be used in clinical trials for rare diseases compared to a randomized controlled trial with equal fixed randomization. The developed method uses a patient's biomarkers to alter the allocation probability to each treatment, in order to emphasize the benefit to the trial population. The method starts with an initial burn-in period of a small number of patients, who with equal probability, are allocated to each treatment. We then use a regression method to predict the best outcome of the next patient, using their biomarkers and the information from the previous patients. This estimated best treatment is assigned to the next patient with high probability. A completed clinical trial for the effect of catumaxomab on the survival of cancer patients is used as an example to demonstrate the use of the method and the differences to a controlled trial with equal allocation. Different regression procedures are investigated and compared to a randomized controlled trial, using efficacy and ethical measures.
A semi-parametric bootstrap-based best linear unbiased estimator of location under symmetry
In this note we provide a novel semi-parametric best linear unbiased estimator (BLUE) of location and its corresponding variance estimator under the assumption the random variate is generated from a symmetric location-scale family of distributions. The approach follows in a two-stage fashion and is based on the exact bootstrap estimate of the covariance matrix of the order statistic. We generalize our approach to add a robustness component in order to derive a trimmed BLUE of location under a semi-parametric symmetry assumption.
Comparing the performance of statistical methods that generalize effect estimates from randomized controlled trials to much larger target populations
Policymakers use results from randomized controlled trials to inform decisions about whether to implement treatments in target populations. Various methods - including inverse probability weighting, outcome modeling, and Targeted Maximum Likelihood Estimation - that use baseline data available in both the trial and target population have been proposed to generalize the trial treatment effect estimate to the target population. Often the target population is significantly larger than the trial sample, which can cause estimation challenges. We conduct simulations to compare the performance of these methods in this setting. We vary the size of the target population, the proportion of the target population selected into the trial, and the complexity of the true selection and outcome models. All methods performed poorly when the trial size was only 2% of the target population size or the target population included only 1,000 units. When the target population or the proportion of units selected into the trial was larger, some methods, such as outcome modeling using Bayesian Additive Regression Trees, performed well. We caution against generalizing using these existing approaches when the target population is much larger than the trial sample and advocate future research strives to improve methods for generalizing to large target populations.
A New Non-Linear Conjugate Gradient Algorithm for Destructive Cure Rate Model and a Simulation Study: Illustration with Negative Binomial Competing Risks
In this paper, we propose a new estimation methodology based on a projected non-linear conjugate gradient (PNCG) algorithm with an efficient line search technique. We develop a general PNCG algorithm for a survival model incorporating a proportion cure under a competing risks setup, where the initial number of competing risks are exposed to elimination after an initial treatment (known as destruction). In the literature, expectation maximization (EM) algorithm has been widely used for such a model to estimate the model parameters. Through an extensive Monte Carlo simulation study, we compare the performance of our proposed PNCG with that of the EM algorithm and show the advantages of our proposed method. Through simulation, we also show the advantages of our proposed methodology over other optimization algorithms (including other conjugate gradient type methods) readily available as R software packages. To show these, we assume the initial number of competing risks to follow a negative binomial distribution although our general algorithm allows one to work with any competing risks distribution. Finally, we apply our proposed algorithm to analyze a well-known melanoma data.
A note on the estimation and inference with quadratic inference functions for correlated outcomes
The quadratic inference function approach is a popular method in the analysis of correlated data. The quadratic inference function is formulated based on multiple sets of score equations (or extended score equations) that over-identify the regression parameters of interest, and improves efficiency over the generalized estimating equations under correlation misspecification. In this note, we provide an alternative solution to the quadratic inference function by separately solving each set of score equations and combining the solutions. We provide an insight that an optimally weighted combination of estimators obtained separately from the distinct sets of score equations is asymptotically equivalent to the estimator obtained via the quadratic inference function. We further establish results on inference for the optimally weighted estimator and extend these insights to the general setting with over-identified estimating equations. A simulation study is carried out to confirm the analytical insights and connections in finite samples.
Analysis of combined probability and nonprobability samples: A simulation evaluation and application to a teen smoking behavior survey
In scientific studies with low-prevalence outcomes, probability sampling may be supplemented by nonprobability sampling to boost the sample size of desired subpopulation while remaining representative to the entire study population. To utilize both probability and nonprobability samples appropriately, several methods have been proposed in the literature to generate pseudo-weights, including ad-hoc weights, inclusion probability adjusted weights, and propensity score adjusted weights. We empirically compare various weighting strategies via an extensive simulation study, where probability and nonprobability samples are combined. Weight normalization and raking adjustment are also considered. Our simulation results suggest that the unity weight method (with weight normalization) and the inclusion probability adjusted weight method yield very good overall performance. This work is motivated by the Buckeye Teen Health Study, which examines risk factors for the initiation of smoking among teenage males in Ohio. To address the low response rate in the initial probability sample and low prevalence of smokers in the target population, a small convenience sample was collected as a supplement. Our proposed method yields estimates very close to the ones from the analysis using only the probability sample and enjoys the additional benefit of being able to track more teens with risky behaviors through follow-ups.
A Bayesian Model for Spatial Partly Interval-Censored Data
Partly interval-censored data often occur in cancer clinical trials and have been analyzed as right-censored data. Patients' geographic information sometimes is also available and can be useful in testing treatment effects and predicting survivorship. We propose a Bayesian semiparametric method for analyzing partly interval-censored data with areal spatial information under the proportional hazards model. A simulation study is conducted to compare the performance of the proposed method with the main method currently available in the literature and the traditional Cox proportional hazards model for right-censored data. The method is illustrated through a leukemia survival data set and a dental health data set. The proposed method will be especially useful for analyzing progression-free survival in multi-regional cancer clinical trials.
Group Feature Screening via the F Statistic
Feature screening is crucial in the analysis of ultrahigh dimensional data, where the number of variables (features) is in an exponential order of the number of observations. In various ultrahigh dimensional data, variables are naturally grouped, giving us a good rationale to develop a screening method using joint effect of multiple variables. In this article, we propose a group screening procedure via the F-test statistic. The proposed method is a direct extension of the original sure independence screening procedure, when the group information is known, for example, from prior knowledge. Under certain regularity conditions, we prove that the proposed group screening procedure possesses the sure screening property that selects all effective groups with a probability approaching one at an exponential rate. We use simulations to demonstrate the advantages of the proposed method and show its application in a genome-wide association study. We conclude that the grouping method is very useful in the analysis of ultrahigh dimensional data, as the optimal F-test can detect true signals with desired properties.
Diagnostics for a two-stage joint survival model
A two-stage joint survival model is used to analyse time to event outcomes that could be associated with biomakers that are repeatedly collected over time. A Two-stage joint survival model has limited model checking tools and is usually assessed using standard diagnostic tools for survival models. The diagnostic tools can be improved and implemented. Time-varying covariates in a two-stage joint survival model might contain outlying observations or subjects. In this study we used the variance shift outlier model (VSOM) to detect and down-weight outliers in the first stage of the two-stage joint survival model. This entails fitting a VSOM at the observation level and a VSOM at the subject level, and then fitting a combined VSOM for the identified outliers. The fitted values were then extracted from the combined VSOM which were then used as time-varying covariate in the extended Cox model. We illustrate this methodology on a dataset from a multi-centre randomised clinical trial. A multi-centre trial showed that a combined VSOM fits the data better than an extended Cox model. We noted that implementing a combined VSOM, when desired, has a better fit based on the fact that outliers are down-weighted.
Robust Estimation of Heterogeneous Treatment Effects: An Algorithm-based Approach
Heterogeneous treatment effect estimation is an essential element in the practice of tailoring treatment to suit the characteristics of individual patients. Most existing methods are not sufficiently robust against data irregularities. To enhance the robustness of the existing methods, we recently put forward a general estimating equation that unifies many existing learners. But the performance of model-based learners depends heavily on the correctness of the underlying treatment effect model. This paper addresses this vulnerability by converting the treatment effect estimation to a weighted supervised learning problem. We combine the general estimating equation with supervised learning algorithms, such as the gradient boosting machine, random forest, and artificial neural network, with appropriate modifications. This extension retains the estimators' robustness while enhancing their flexibility and scalability. Simulation shows that the algorithm-based estimation methods outperform their model-based counterparts in the presence of nonlinearity and non-additivity. We developed an package, , for public access to the proposed methods. To illustrate the methods, we present a real data example to compare the blood pressure-lowering effects of two classes of antihypertensive agents.
Causal inference with a mediated proportional hazards regression model
The natural direct and indirect effects in causal mediation analysis with survival data having one mediator is addressed by VanderWeele (2011) [1]. He derived an approach for (1) an accelerated failure time regression model in general cases and (2) a proportional hazards regression model when the time-to-event outcome is rare. If the outcome is not rare, then VanderWeele (2011) [1] did not derive a simple closed-form expression for the log-natural direct and log-natural indirect effects for the proportional hazards regression model because the baseline cumulative hazard function does not approach zero. We develop two approaches to extend VanderWeele's approach, in which the assumption of a rare outcome is not required. We obtain the natural direct and indirect effects for specific time points through numerical integration after we calculate the cumulative baseline hazard by (1) applying the Breslow method in the Cox proportional hazards regression model to estimate the unspecified cumulative baseline hazard; (2) assuming a piecewise constant baseline hazard model, yielding a parametric model, to estimate the baseline hazard and cumulative baseline hazard. We conduct simulation studies to compare our two approaches with other methods and illustrate our two approaches by applying them to data from the ASsessment, Serial Evaluation, and Subsequent Sequelae in Acute Kidney Injury (ASSESS-AKI) Consortium.
A weighted Jackknife approach utilizing linear model based-estimators for clustered data
Small number of clusters combined with cluster level heterogeneity poses a great challenge for the data analysis. We have published a weighted Jackknife approach to address this issue applying weighted cluster means as the basic estimators. The current study proposes a new version of the weighted delete-one-cluster Jackknife analytic framework, which employs Ordinary Least Squares or Generalized Least Squares estimators as the fundamentals. Algorithms for computing estimated variances of the study estimators have also been derived. Wald test statistics can be further obtained, and the statistical comparison in the outcome means of two conditions is determined using the cluster permutation procedure. The simulation studies show that the proposed framework produces estimates with higher precision and improved power for statistical hypothesis testing compared to other methods.
The generalized sigmoidal quantile function
In this note we introduce a new smooth nonparametric quantile function estimator based on a newly defined generalized expectile function and termed the sigmoidal quantile function estimator. We also introduce a hybrid quantile function estimator, which combines the optimal properties of the classic kernel quantile function estimator with our new generalized sigmoidal quantile function estimator. The generalized sigmoidal quantile function can estimate quantiles beyond the range of the data, which is important for certain applications given smaller sample sizes. This property of extrapolation is illustrated in order to improve standard bootstrap smoothing resampling methods.
Distribution Theory Following Blinded and Unblinded Sample Size Re-estimation under Parametric Models
Asymptotic distribution theory for maximum likelihood estimators under fixed alternative hypotheses is reported in the literature even though the power of any realistic test converges to one under fixed alternatives. Under fixed alternatives, authors have established that nuisance parameter estimates are inconsistent when sample size re-estimation (SSR) follows blinded randomization. These results have helped to inhibit the use of SSR. In this paper, we argue for local alternatives to be used instead of fixed alternatives. Motivated by Gould and Shih (1998), we treat unavailable treatment assignments in blinded experiments as missing data and rely on single imputation from marginal distributions to fill in for missing data. With local alternatives, it is sufficient to proceed only with the first step of the EM algorithm mimicking imputation under the null hypothesis. Then, we show that blinded and unblinded estimates of the nuisance parameter are consistent, and re-estimated sample sizes converge to their locally asymptotically optimal values. This theoretical finding is confirmed through Monte-Carlo simulation studies. Practical utility is illustrated through a multiple logistic regression example. We conclude that, for hypothesis testing with a predetermined minimally clinically relevant local effect size, both blinded and unblinded SSR procedures lead to similar sample sizes and power.
Sensitivity analysis for assumptions of general mediation analysis
Mediation analysis is widely used to identify significant mediators and estimate the mediation (direct and indirect) effects in causal pathways between an exposure variable and a response variable. In mediation analysis, the mediation effect refers to the effect transmitted by mediator intervening the relationship between an exposure variable and a response variable. Traditional mediation analysis methods, such as the difference in the coefficient method, the product of the coefficient method, and counterfactual framework method, all require several key assumptions. Thus, the estimation of mediation effects can be biased when one or more assumptions are violated. In addition to the traditional mediation analysis methods, Yu et al. proposed a general mediation analysis method that can use general predictive models to estimate mediation effects of any types of exposure variable(s), mediators and outcome(s). However, whether this method relies on the assumptions for the traditional mediation analysis methods is unknown. In this paper, we perform series of simulation studies to investigate the impact of violation of assumptions on the estimation of mediation effects using Yu et al.'s mediation analysis method. We use the R package for all estimations. We find that three assumptions for traditional mediation analysis methods are also essential for Yu et al.'s method. This paper provides a pipeline for using simulations to evaluate the impact of the assumptions for the general mediation analysis.
Parallel tempering strategies for model-based landmark detection on shapes
In the field of shape analysis, landmarks are defined as a low-dimensional, representative set of important features of an object's shape that can be used to identify regions of interest along its outline. An important problem is to infer the number and arrangement of landmarks, given a set of shapes drawn from a population. One proposed approach defines a posterior distribution over landmark locations by associating each landmark configuration with a linear reconstruction of the shape. In practice, sampling from the resulting posterior density is challenging using standard Markov chain Monte Carlo (MCMC) methods because multiple configurations of landmarks can describe a complex shape similarly well, manifesting in a multi-modal posterior with well-separated modes. Standard MCMC methods traverse multi-modal posteriors poorly and, even when multiple modes are identified, the relative amount of time spent in each one can be misleading. We apply new advances in the parallel tempering literature to the problem of landmark detection, providing guidance on implementation generalized to other applications within shape analysis. Proposal adaptation is used during burn-in to ensure efficient traversal of the parameter space while maintaining computational efficiency. We demonstrate this algorithm on simulated data and common shapes obtained from computer vision scenes.
Optimal Personalized Treatment Selection with Multivariate Outcome Measures in a Multiple Treatment Case
In this work we propose a novel method for individualized treatment selection when there are correlated multiple treatment responses. For the treatment ( ≥ 2) scenario, we compare quantities that are suitable indexes based on outcome variables for each treatment conditional on patient-specific scores constructed from collected covariate measurements. Our method covers any number of treatments and outcome variables, and it can be applied for a broad set of models. The proposed method uses a rank aggregation technique that takes into account possible correlations among ranked lists to estimate an ordering of treatments based on treatment performance measures such as the smooth conditional mean. The method has the flexibility to incorporate patient and clinician preferences into the optimal treatment decision on an individual case basis. A simulation study demonstrates the performance of the proposed method in finite samples. We also present data analyses using HIV clinical trial data to show the applicability of the proposed procedure for real data.
Multi-Response Based Personalized Treatment Selection with Data from Crossover Designs for Multiple Treatments
In this work we propose a novel method for treatment selection based on individual covariate information when the treatment response is multivariate and data are available from a crossover design. Our method covers any number of treatments and it can be applied for a broad set of models. The proposed method uses a rank aggregation technique to estimate an ordering of treatments based on ranked lists of treatment performance measures such as smooth conditional means and conditional probability of a response for one treatment dominating others. An empirical study demonstrates the performance of the proposed method in finite samples.
Robust RNA-seq data analysis using an integrated method of ROC curve and Kolmogorov-Smirnov test
It is a common approach to dichotomize a continuous biomarker in clinical setting for the convenience of application. Analytically, results from using a dichotomized biomarker are often more reliable and resistant to outliers, bi-modal and other unknown distributions. There are two commonly used methods for selecting the best cut-off value for dichotomization of a continuous biomarker, using either maximally selected chi-square statistic or a ROC curve, specifically the Youden Index. In this paper, we explained that in many situations, it is inappropriate to use the former. By using the Maximum Absolute Youden Index (MAYI), we demonstrated that the integration of a MAYI and the Kolmogorov-Smirnov test is not only a robust non-parametric method, but also provides more meaningful p value for selecting the cut-off value than using a Mann-Whitney test. In addition, our method can be applied directly in clinical settings.
Marginal probabilities and point estimation for conditionally specified logistic regression
Conditionally specified logistic regression (CSLR) models binary response variables. It is shown that marginal probabilities can be derived for a CSLR model. We also extend the CSLR model by allowing third order interactions. We apply two versions of CSLR to simulated data and a set of real data, and compare the results to those from other modeling methods.
Adjusted curves for clustered survival and competing risks data
Observational studies with right-censored data often have clustered data due to matched pairs or a study center effect. In such data, there may be an imbalance in patient characteristics between treatment groups, where Kaplan-Meier curves or unadjusted cumulative incidence curves can be misleading and may not represent the average patient on a given treatment arm. Adjusted curves are desirable to appropriately display survival or cumulative incidence curves in this case. We propose methods for estimating the adjusted survival and cumulative incidence probabilities for clustered right-censored data. For the competing risks outcome, we allow both covariate-independent and covariate-dependent censoring. We develop an R package to implement the proposed methods. It provides the estimates of adjusted survival and cumulative incidence probabilities along with their standard errors. Our simulation results show that the adjusted survival and cumulative incidence estimates of the proposed method are unbiased with approximate 95% coverage rates. We apply the proposed method to stem cell transplant data of leukemia patients.