Real Effect or Bias? Good Practices for Evaluating the Robustness of Evidence From Comparative Observational Studies Through Quantitative Sensitivity Analysis for Unmeasured Confounding
The assumption of "no unmeasured confounders" is a critical but unverifiable assumption required for causal inference yet quantitative sensitivity analyses to assess robustness of real-world evidence remains under-utilized. The lack of use is likely in part due to complexity of implementation and often specific and restrictive data requirements for application of each method. With the advent of methods that are broadly applicable in that they do not require identification of a specific unmeasured confounder-along with publicly available code for implementation-roadblocks toward broader use of sensitivity analyses are decreasing. To spur greater application, here we offer a good practice guidance to address the potential for unmeasured confounding at both the design and analysis stages, including framing questions and an analytic toolbox for researchers. The questions at the design stage guide the researcher through steps evaluating the potential robustness of the design while encouraging gathering of additional data to reduce uncertainty due to potential confounding. At the analysis stage, the questions guide quantifying the robustness of the observed result and providing researchers with a clearer indication of the strength of their conclusions. We demonstrate the application of this guidance using simulated data based on an observational fibromyalgia study, applying multiple methods from our analytic toolbox for illustration purposes.
Flexible Spline Models for Blinded Sample Size Reestimation in Event-Driven Clinical Trials
In event-driven trials, the target power under a certain treatment effect is maintained as long as the required number of events is obtained. The misspecification of the survival function in the planning phase does not result in a loss of power. However, the trial might take longer than planned if the event rate is lower than assumed. Blinded sample size reestimation (BSSR) uses non-comparative interim data to adjust the sample size if some planning assumptions are wrong. In the setting of an event-driven trial, the sample size may be adjusted to maintain the chances to obtain the required number of events within the planned time frame. For the purpose of BSSR, the survival function is estimated based on the interim data and often needs to be extrapolated. The current practice is to fit standard parametric models, which may however not always be suitable. Here we propose a flexible spline-based BSSR method. Specifically, we propose to carry out the extrapolation based on the Royston-Parmar spline model. To compare the proposed procedure with parametric approaches, we carried out a simulation study. Although parametric approaches might seriously over- or underestimate the expected number of events, the proposed flexible approach avoided such undesirable behavior. This is also observed in an application to a secondary progressive multiple sclerosis trial. Overall, if planning assumptions are wrong this more robust flexible BSSR method could help event-driven designs to more accurately adjust recruitment numbers and to finish on time.
Confidence Intervals for the Risk Difference Between Secondary and Primary Infection Based on the Method of Variance Estimates Recovery
The risk difference (RD) between the secondary infection, given the primary infection, and the primary infection can be a useful measure of the change in the infection rates of the primary infection and the secondary infection. It plays an important role in pharmacology and epidemiology. The method of variance estimate recovery (MOVER) is used to construct confidence intervals (CIs) for the RD. Seven types of CIs for binomial proportion are introduced to obtain MOVER-based CIs for the RD. The simulation studies show that the Agresti-Coull CI, score method incorporating continuity correction CI, Clopper Pearson CI, and Bayesian credibility CI are conservative. The Jeffreys CI, Wilson score CI, and Arcsin CI draw a satisfactory performance; they are suitable for various practical application scenarios as they can provide accurate and reliable results. To illustrate that the recommended CIs are competitive or even better than other methods, three real datasets were used.
A Phase I Dose-Finding Design Incorporating Intra-Patient Dose Escalation
Conventional Phase I trial designs assign a single dose to each patient, necessitating a minimum number of patients per dose to reliably identify the maximum tolerated dose (MTD). However, in many clinical trials, such as those involving pediatric patients or patients with rare cancers, recruiting an adequate number of patients can pose challenges, limiting the applicability of standard trial designs. To address this challenge, we propose a new Phase I dose-finding design, denoted as IP-CRM, that integrates intra-patient dose escalation with the continual reassessment method (CRM). In the IP-CRM design, intra-patient dose escalation is allowed, guided by both individual patients' toxicity outcomes and accumulated data across patients, and the starting dose for each cohort of patients is adaptively updated. We further extend the IP-CRM design to address carryover effects and/or intra-patient correlations. Due to the potential for each patient to contribute multiple data points at varying doses owing to intra-patient dose escalation, the IP-CRM design offers the advantage of determining the MTD with a considerably reduced sample size compared to standard Phase I dose-finding designs. Simulation studies show that our IP-CRM design can efficiently reduce sample size while concurrently enhancing the probability of identifying the MTD when compared with standard CRM designs and the 3 + 3 design.
A Likelihood Perspective on Dose-Finding Study Designs in Oncology
Dose-finding studies in oncology often include an up-and-down dose transition rule that assigns a dose to each cohort of patients based on accumulating data on dose-limiting toxicity (DLT) events. In making a dose transition decision, a key scientific question is whether the true DLT rate of the current dose exceeds the target DLT rate, and the statistical question is how to evaluate the statistical evidence in the available DLT data with respect to that scientific question. This article introduces generalized likelihood ratios (GLRs) that can be used to measure statistical evidence and support dose transition decisions. Applying this approach to a single-dose likelihood leads to a GLR-based interval design with three parameters: the target DLT rate and two GLR cut-points representing the levels of evidence required for dose escalation and de-escalation. This design gives a likelihood interpretation to each existing interval design and provides a unified framework for comparing different interval designs in terms of how much evidence is required for escalation and de-escalation. A GLR-based comparison of commonly used interval designs reveals important differences and motivates alternative designs that reduce over-treatment while maintaining MTD estimation accuracy. The GLR-based approach can also be applied to a joint likelihood based on a nonparametric (e.g., isotonic regression) model or a parametric model. Simulation results indicate that the isotonic GLR performs similarly to the single-dose GLR but the GLR based on a parsimonious model can improve MTD estimation when the underlying model is correct.
WATCH: A Workflow to Assess Treatment Effect Heterogeneity in Drug Development for Clinical Trial Sponsors
This article proposes a Workflow for Assessing Treatment effeCt Heterogeneity (WATCH) in clinical drug development targeted at clinical trial sponsors. WATCH is designed to address the challenges of investigating treatment effect heterogeneity (TEH) in randomized clinical trials, where sample size and multiplicity limit the reliability of findings. The proposed workflow includes four steps: analysis planning, initial data analysis and analysis dataset creation, TEH exploration, and multidisciplinary assessment. The workflow offers a general overview of how treatment effects vary by baseline covariates in the observed data and guides the interpretation of the observed findings based on external evidence and the best scientific understanding. The workflow is exploratory and not inferential/confirmatory in nature but should be preplanned before database lock and analysis start. It is focused on providing a general overview rather than a single specific finding or subgroup with a differential effect.
A Bayesian Hybrid Design With Borrowing From Historical Study
In early phase drug development of combination therapy, the primary objective is to preliminarily assess whether there is additive activity from a novel agent when combined with an established monotherapy. Due to potential feasibility issues for conducting a large randomized study, uncontrolled single-arm trials have been the mainstream approach in cancer clinical trials. However, such trials often present significant challenges in deciding whether to proceed to the next phase of development due to the lack of randomization in traditional two-arm trials. A hybrid design, leveraging data from a completed historical clinical study of the monotherapy, offers a valuable option to enhance study efficiency and improve informed decision-making. Compared to traditional single-arm designs, the hybrid design may significantly enhance power by borrowing external information, enabling a more robust assessment of activity. The primary challenge of hybrid design lies in handling information borrowing. We introduce a Bayesian dynamic power prior (DPP) framework with three components of controlling amount of dynamic borrowing. The framework offers flexible study design options with explicit interpretation of borrowing, allowing customization according to specific needs. Furthermore, the posterior distribution in the proposed framework has a closed form, offering significant advantages in computational efficiency. The proposed framework's utility is demonstrated through simulations and a case study.
Potential Bias Models With Bayesian Shrinkage Priors for Dynamic Borrowing of Multiple Historical Control Data
When multiple historical controls are available, it is necessary to consider the conflicts between current and historical controls and the relationships among historical controls. One of the assumptions concerning the relationships between the parameters of interest of current and historical controls is known as the "Potential biases." Within the "Potential biases" assumption, the differences between the parameters of interest of the current control and of each historical control are defined as "potential bias parameters." We define a class of models called "potential biases model" that encompass several existing methods, including the commensurate prior. The potential bias model incorporates homogeneous historical controls by shrinking the potential bias parameters to zero. In scenarios where multiple historical controls are available, a method that uses a horseshoe prior was proposed. However, various other shrinkage priors are also available. In this study, we propose methods that apply spike-and-slab, Dirichlet-Laplace, and spike-and-slab lasso priors to the potential bias model. We conduct a simulation study and analyze clinical trial examples to compare the performances of the proposed and existing methods. The horseshoe prior and the three other priors make the strongest use of historical controls in the absence of heterogeneous historical controls and reduce the influence of heterogeneous historical controls in the presence of a few historical controls. Among these four priors, the spike-and-slab prior performed the best for heterogeneous historical controls.
Beyond the Fragility Index
The results of randomized clinical trials (RCTs) are frequently assessed with the fragility index (FI). Although the information provided by FI may supplement the p value, this indicator presents intrinsic weaknesses and shortcomings. In this article, we establish an analysis of fragility within a broader framework so that it can reliably complement the information provided by the p value. This perspective is named the analysis of strength. We first propose a new strength index (SI), which can be adopted in normal distribution settings. This measure can be obtained for both significance and nonsignificance and is straightforward to calculate, thus presenting compelling advantages over FI, starting from the presence of a threshold. The case of time-to-event outcomes is also addressed. Then, beyond the p value, we develop the analysis of strength using likelihood ratios from Royall's statistical evidence viewpoint. A new R package is provided for performing strength calculations, and a simulation study is conducted to explore the behavior of SI and the likelihood-based indicator empirically across different settings. The newly proposed analysis of strength is applied in the assessment of the results of three recent trials involving the treatment of COVID-19.
Pre-Posterior Distributions in Drug Development and Their Properties
The topic of this article is pre-posterior distributions of success or failure. These distributions, determined before a study is run and based on all our assumptions, are what we should believe about the treatment effect if we are told only that the study has been successful, or unsuccessful. I show how the pre-posterior distributions of success and failure can be used during the planning phase of a study to investigate whether the study is able to discriminate between effective and ineffective treatments. I show how these distributions are linked to the probability of success (PoS), or failure, and how they can be determined from simulations if standard asymptotic normality assumptions are inappropriate. I show the link to the concept of the conditional introduced by Temple and Robertson in the context of the planning of multiple studies. Finally, I show that they can also be constructed regardless of whether the analysis of the study is frequentist or fully Bayesian.
A Model-Based Trial Design With a Randomization Scheme Considering Pharmacokinetics Exposure for Dose Optimization in Oncology
The primary purpose of an oncology dose-finding trial for novel anticancer agents has been shifting from determining the maximum tolerated dose to identifying an optimal dose (OD) that is tolerable and therapeutically beneficial for subjects in subsequent clinical trials. In 2022, the FDA Oncology Center of Excellence initiated Project Optimus to reform the paradigm of dose optimization and dose selection in oncology drug development and issued a draft guidance. The guidance suggests that dose-finding trials include randomized dose-response cohorts of multiple doses and incorporate information on pharmacokinetics (PK) in addition to safety and efficacy data to select the OD. Furthermore, PK information could be a quick alternative to efficacy data to predict the minimum efficacious dose and decide the dose assignment. This article proposes a model-based trial design for dose optimization with a randomization scheme based on PK outcomes in oncology. A simulation study shows that the proposed design has advantages compared to the other designs in the percentage of correct OD selection and the average number of patients assigned to OD in various realistic settings.
A Bayesian Dynamic Model-Based Adaptive Design for Oncology Dose Optimization in Phase I/II Clinical Trials
With the development of targeted therapy, immunotherapy, and antibody-drug conjugates (ADCs), there is growing concern over the "more is better" paradigm developed decades ago for chemotherapy, prompting the US Food and Drug Administration (FDA) to initiate Project Optimus to reform dose optimization and selection in oncology drug development. For early-phase oncology trials, given the high variability from sparse data and the rigidity of parametric model specifications, we use Bayesian dynamic models to borrow information across doses with only vague order constraints. Our proposed adaptive design simultaneously incorporates toxicity and efficacy outcomes to select the optimal dose (OD) in Phase I/II clinical trials, utilizing Bayesian model averaging to address the uncertainty of dose-response relationships and enhance the robustness of the design. Additionally, we extend the proposed design to handle delayed toxicity and efficacy outcomes. We conduct extensive simulation studies to evaluate the operating characteristics of the proposed method under various practical scenarios. The results demonstrate that the proposed designs have desirable operating characteristics. A trial example is presented to demonstrate the practical implementation of the proposed designs.
Subgroup Identification Based on Quantitative Objectives
Precision medicine is the future of drug development, and subgroup identification plays a critical role in achieving the goal. In this paper, we propose a powerful end-to-end solution squant (available on CRAN) that explores a sequence of quantitative objectives. The method converts the original study to an artificial 1:1 randomized trial, and features a flexible objective function, a stable signature with good interpretability, and an embedded false discovery rate (FDR) control. We demonstrate its performance through simulation and provide a real data example.
Bayesian Solutions for Assessing Differential Effects in Biomarker Positive and Negative Subgroups
The number of clinical trials that include a binary biomarker in design and analysis has risen due to the advent of personalised medicine. This presents challenges for medical decision makers because a drug may confer a stronger effect in the biomarker positive group, and so be approved either in this subgroup alone or in the all-comer population. We develop and evaluate Bayesian methods that can be used to assess this. All our methods are based on the same statistical model for the observed data but we propose different prior specifications to express differing degrees of knowledge about the extent to which the treatment may be more effective in one subgroup than the other. We illustrate our methods using some real examples. We also show how our methodology is useful when designing trials where the size of the biomarker negative subgroup is to be determined. We conclude that our Bayesian framework is a natural tool for making decisions, for example, whether to recommend using the treatment in the biomarker negative subgroup where the treatment is less likely to be efficacious, or determining the number of biomarker positive and negative patients to include when designing a trial.
PKBOIN-12: A Bayesian Optimal Interval Phase I/II Design Incorporating Pharmacokinetics Outcomes to Find the Optimal Biological Dose
Immunotherapies and targeted therapies have gained popularity due to their promising therapeutic effects across multiple treatment areas. The focus of early phase dose-finding clinical trials has shifted from finding the maximum tolerated dose (MTD) to identifying the optimal biological dose (OBD), which aims to balance the toxicity and efficacy outcomes, thus optimizing the risk-benefit trade-off. These trials often collect multiple pharmacokinetics (PK) outcomes to assess drug exposure, which has shown correlations with toxicity and efficacy outcomes but has not been utilized in the current dose-finding designs for OBD selection. Moreover, PK outcomes are usually available within days after initial treatment, much faster than toxicity and efficacy outcomes. To bridge this gap, we introduce the innovative model-assisted PKBOIN-12 design, which enhances BOIN12 by integrating PK information into both the dose-finding algorithm and the final OBD determination process. We further extend PKBOIN-12 to TITE-PKBOIN-12 to address the challenges of late-onset toxicity and efficacy outcomes. Simulation results demonstrate that PKBOIN-12 more effectively identifies the OBD and allocates a greater number of patients to it than BOIN12. Additionally, PKBOIN-12 decreases the probability of selecting inefficacious doses as the OBD by excluding those with low drug exposure. Comprehensive simulation studies and sensitivity analysis confirm the robustness of both PKBOIN-12 and TITE-PKBOIN-12 in various scenarios.
Treatment Effect Measures Under Nonproportional Hazards
'Treatment effect measures under nonproportional hazards' by Snapinn et al. (Pharmaceutical Statistics, 22, 181-193) recently proposed some novel estimates of treatment effect for time-to-event endpoints. In this note, we clarify three points related to the proposed estimators that help to elucidate their properties. We hope that their work, and this commentary, will motivate further discussion concerning treatment effect measures that do not require the proportional hazards assumption.
Optimizing Sample Size Determinations for Phase 3 Clinical Trials in Type 2 Diabetes
An informed estimate of subject-level variance is a key determinate for accurate estimation of the required sample size for clinical trials. Evaluating completed adult Type 2 diabetes studies submitted to the FDA for accuracy of the variance estimate at the planning stage provides insights to inform the sample size requirements for future studies. From the U.S. Food and Drug Administration (FDA) database of new drug applications containing 14,106 subjects from 26 phase 3 randomized studies submitted to the FDA in support of drug approvals in adult type 2 diabetes studies reviewed between 2013 and 2017, we obtained estimates of subject-level variance for the primary endpoint-change in glycated hemoglobin (HbA1c) from baseline to 6 months. In addition, we used nine additional studies to examine the impact of clinically meaningful covariates on residual standard deviation and sample size re-estimation. Our analyses show that reduced sample sizes can be used without interfering with the validity of efficacy results for adult type 2 diabetes drug trials. This finding has implications for future research involving the adult type 2 diabetes population, including the potential to reduce recruitment period length and improve the timeliness of results. Furthermore, our findings could be utilized in the design of future endocrinology clinical trials.
Prediction Intervals for Overdispersed Poisson Data and Their Application in Medical and Pre-Clinical Quality Control
In pre-clinical and medical quality control, it is of interest to assess the stability of the process under monitoring or to validate a current observation using historical control data. Classically, this is done by the application of historical control limits (HCL) graphically displayed in control charts. In many applications, HCL are applied to count data, for example, the number of revertant colonies (Ames assay) or the number of relapses per multiple sclerosis patient. Count data may be overdispersed, can be heavily right-skewed and clusters may differ in cluster size or other baseline quantities (e.g., number of petri dishes per control group or different length of monitoring times per patient). Based on the quasi-Poisson assumption or the negative-binomial distribution, we propose prediction intervals for overdispersed count data to be used as HCL. Variable baseline quantities are accounted for by offsets. Furthermore, we provide a bootstrap calibration algorithm that accounts for the skewed distribution and achieves equal tail probabilities. Comprehensive Monte-Carlo simulations assessing the coverage probabilities of eight different methods for HCL calculation reveal, that the bootstrap calibrated prediction intervals control the type-1-error best. Heuristics traditionally used in control charts (e.g., the limits in Shewhart c- or u-charts or the mean ± 2 SD) fail to control a pre-specified coverage probability. The application of HCL is demonstrated based on data from the Ames assay and for numbers of relapses of multiple sclerosis patients. The proposed prediction intervals and the algorithm for bootstrap calibration are publicly available via the R package predint.
Taylor Series Approximation for Accurate Generalized Confidence Intervals of Ratios of Log-Normal Standard Deviations for Meta-Analysis Using Means and Standard Deviations in Time Scale
With contemporary anesthetic drugs, the efficacy of general anesthesia is assured. Health-economic and clinical objectives are related to reductions in the variability in dosing, variability in recovery, etc. Consequently, meta-analyses for anesthesiology research would benefit from quantification of ratios of standard deviations of log-normally distributed variables (e.g., surgical duration). Generalized confidence intervals can be used, once sample means and standard deviations in the raw, time, scale, for each study and group have been used to estimate the mean and standard deviation of the logarithms of the times (i.e., "log-scale"). We examine the matching of the first two moments versus also using higher-order terms, following Higgins et al. 2008 and Friedrich et al. 2012. Monte Carlo simulations revealed that using the first two moments 95% confidence intervals had coverage 92%-95%, with small bias. Use of higher-order moments worsened confidence interval coverage for the log ratios, especially for coefficients of variation in the time scale of 50% and for larger sample sizes per group, resulting in 88% coverage. We recommend that for calculating confidence intervals for ratios of standard deviations based on generalized pivotal quantities and log-normal distributions, when relying on transformation of sample statistics from time to log scale, use the first two moments, not the higher order terms.
Bayesian Sample Size Calculation in Small n, Sequential Multiple Assignment Randomized Trials (snSMART)
A recent study design for clinical trials with small sample sizes is the small n, sequential, multiple assignment, randomized trial (snSMART). An snSMART design has been previously proposed to compare the efficacy of two dose levels versus placebo. In such a trial, participants are initially randomized to receive either low dose, high dose or placebo in stage 1. In stage 2, participants are re-randomized to either dose level depending on their initial treatment and a dichotomous response. A Bayesian analytic approach borrowing information from both stages was proposed and shown to improve the efficiency of estimation. In this paper, we propose two sample size determination (SSD) methods for the proposed snSMART comparing two dose levels with placebo. Both methods adopt the average coverage criterion (ACC) approach. In the first approach, the sample size is calculated in one step, taking advantage of the explicit posterior variance of the treatment effect. In the other two step approach, we update the sample size needed for a single-stage parallel design with a proposed adjustment factor (AF). Through simulations, we demonstrate that the required sample sizes calculated using the two SSD approaches both provide the desired power. We also provide an applet to allow for convenient and fast sample size calculation in this snSMART setting.
A Commensurate Prior Model With Random Effects for Survival and Competing Risk Outcomes to Accommodate Historical Controls
Clinical trials (CTs) often suffer from small sample sizes due to limited budgets and patient enrollment challenges. Using historical data for the CT data analysis may boost statistical power and reduce the required sample size. Existing methods on borrowing information from historical data with right-censored outcomes did not consider matching between historical data and CT data to reduce the heterogeneity. In addition, they studied the survival outcome only, not competing risk outcomes. Therefore, we propose a clustering-based commensurate prior model with random effects for both survival and competing risk outcomes that effectively borrows information based on the degree of comparability between historical and CT data. Simulation results show that the proposed method controls type I errors better and has a lower bias than some competing methods. We apply our method to a phase III CT which compares the effectiveness of bone marrow donated from family members with only partially matched bone marrow versus two partially matched cord blood units to treat leukemia and lymphoma.