International Journal of Biostatistics

Hypothesis testing for detecting outlier evaluators
Xu L, Zucker DM and Wang M
In epidemiological studies, the measurements of disease outcomes are carried out by different evaluators. In this paper, we propose a two-stage procedure for detecting outlier evaluators. In the first stage, a regression model is fitted to obtain the evaluators' effects. Outlier evaluators have different effects than normal evaluators. In the second stage, stepwise hypothesis tests are performed to detect outlier evaluators. The true positive rate and true negative rate of the proposed procedure are assessed in a simulation study. We apply the proposed method to detect potential outlier audiologists among the audiologists who measured hearing threshold levels of the participants in the Audiology Assessment Arm of the Conservation of Hearing Study, which is an epidemiological study for examining risk factors of hearing loss.
Optimizing personalized treatments for targeted patient populations across multiple domains
Chen Y, Zeng D and Wang Y
Learning individualized treatment rules (ITRs) for a target patient population with mental disorders is confronted with many challenges. First, the target population may be different from the training population that provided data for learning ITRs. Ignoring differences between the training patient data and the target population can result in sub-optimal treatment strategies for the target population. Second, for mental disorders, a patient's underlying mental state is not observed but can be inferred from measures of high-dimensional combinations of symptomatology. Treatment mechanisms are unknown and can be complex, and thus treatment effect moderation can take complicated forms. To address these challenges, we propose a novel method that connects measurement models, efficient weighting schemes, and flexible neural network architecture through latent variables to tailor treatments for a target population. Patients' underlying mental states are represented by a compact set of latent state variables while preserving interpretability. Weighting schemes are designed based on lower-dimensional latent variables to efficiently balance population differences so that biases in learning the latent structure and treatment effects are mitigated. Extensive simulation studies demonstrated consistent superiority of the proposed method and the weighting approach. Applications to two real-world studies of patients with major depressive disorder have shown a broad utility of the proposed method in improving treatment outcomes in the target population.
History-restricted marginal structural model and latent class growth analysis of treatment trajectories for a time-dependent outcome
Diop A, Sirois C, Guertin JR, Schnitzer ME, Brophy JM, Blais C and Talbot D
In previous work, we introduced a framework that combines latent class growth analysis (LCGA) with marginal structural models (LCGA-MSM). LCGA-MSM first summarizes the numerous time-varying treatment patterns into a few trajectory groups and then allows for a population-level causal interpretation of the group differences. However, the LCGA-MSM framework is not suitable when the outcome is time-dependent. In this study, we propose combining a nonparametric history-restricted marginal structural model (HRMSM) with LCGA. HRMSMs can be seen as an application of standard MSMs on multiple time intervals. To the best of our knowledge, we also present the first application of HRMSMs with a time-to-event outcome. It was previously noted that HRMSMs could pose interpretation problems in survival analysis when either targeting a hazard ratio or a survival curve. We propose a causal parameter that bypasses these interpretation challenges. We consider three different estimators of the parameters: inverse probability of treatment weighting (IPTW), g-computation, and a pooled longitudinal targeted maximum likelihood estimator (pooled LTMLE). We conduct simulation studies to measure the performance of the proposed LCGA-HRMSM. For all scenarios, we obtain unbiased estimates when using either g-computation or pooled LTMLE. IPTW produced estimates with slightly larger bias in some scenarios. Overall, all approaches have good coverage of the 95 % confidence interval. We applied our approach to a population of older Quebecers composed of 57,211 statin initiators and found that a greater adherence to statins was associated with a lower combined risk of cardiovascular disease or all-cause mortality.
Hybrid classical-Bayesian approach to sample size determination for two-arm superiority clinical trials
Sambucini V
Traditional methods for Sample Size Determination (SSD) based on power analysis exploit relevant fixed values or preliminary estimates for the unknown parameters. A hybrid classical-Bayesian approach can be used to formally incorporate information or model uncertainty on unknown quantities by using prior distributions according to the Bayesian approach, while still analysing the data in a frequentist framework. In this paper, we propose a hybrid procedure for SSD in two-arm superiority trials, that takes into account the different role played by the unknown parameters involved in the statistical power. Thus, different prior distributions are used to formalize design expectations and to model information or uncertainty on preliminary estimates involved at the analysis stage. To illustrate the method, we consider binary data and derive the proposed hybrid criteria using three possible parameters of interest, i.e. the difference between proportions of successes, the logarithm of the relative risk and the logarithm of the odds ratio. Numerical examples taken from the literature are presented to show how to implement the proposed procedure.
Comments on "sensitivity of estimands in clinical trials with imperfect compliance" by Chen and Heitjan
Baker SG and Lindeman KS
Chen and Heitjan (Sensitivity of estimands in clinical trials with imperfect compliance. Int J Biostat. 2023) used linear extrapolation to estimate the population average causal effect (PACE) from the complier average causal effect (CACE) in multiple randomized trials with all-or-none compliance. For extrapolating from CACE to PACE in this setting and in the paired availability design involving different availabilities of treatment among before-and-after studies, we recommend the sensitivity analysis in Baker and Lindeman (J Causal Inference, 2013) because it is not restricted to a linear model, as it involves various random effect and trend models.
Detecting differentially expressed genes from RNA-seq data using fuzzy clustering
Ando Y and Shimokawa A
A two-group comparison test is generally performed on RNA sequencing data to detect differentially expressed genes (DEGs). However, the accuracy of this method is low due to the small sample size. To address this, we propose a method using fuzzy clustering that artificially generates data with expression patterns similar to those of DEGs to identify genes that are highly likely to be classified into the same cluster as the initial cluster data. The proposed method is advantageous in that it does not perform any test. Furthermore, a certain level of accuracy can be maintained even when the sample size is biased, and we show that such a situation may improve the accuracy of the proposed method. We compared the proposed method with the conventional method using simulations. In the simulations, we changed the sample size and difference between the expression levels of group 1 and group 2 in the DEGs to obtain the desired accuracy of the proposed method. The results show that the proposed method is superior in all cases under the conditions simulated. We also show that the effect of the difference between group 1 and group 2 on the accuracy is more prominent when the sample size is biased.
An interpretable cluster-based logistic regression model, with application to the characterization of response to therapy in severe eosinophilic asthma
Bilancia M, Nigri A, Cafarelli B and Di Bona D
Asthma is a disease characterized by chronic airway hyperresponsiveness and inflammation, with signs of variable airflow limitation and impaired lung function leading to respiratory symptoms such as shortness of breath, chest tightness and cough. Eosinophilic asthma is a distinct phenotype that affects more than half of patients diagnosed with severe asthma. It can be effectively treated with monoclonal antibodies targeting specific immunological signaling pathways that fuel the inflammation underlying the disease, particularly Interleukin-5 (IL-5), a cytokine that plays a crucial role in asthma. In this study, we propose a data analysis pipeline aimed at identifying subphenotypes of severe eosinophilic asthma in relation to response to therapy at follow-up, which could have great potential for use in routine clinical practice. Once an optimal partition of patients into subphenotypes has been determined, the labels indicating the group to which each patient has been assigned are used in a novel way. For each input variable in a specialized logistic regression model, a clusterwise effect on response to therapy is determined by an appropriate interaction term between the input variable under consideration and the cluster label. We show that the clusterwise odds ratios can be meaningfully interpreted conditional on the cluster label. In this way, we can define an effect measure for the response variable for each input variable in each of the groups identified by the clustering algorithm, which is not possible in standard logistic regression because the effect of the reference class is aliased with the overall intercept. The interpretability of the model is enforced by promoting sparsity, a goal achieved by learning interactions in a hierarchical manner using a special group-Lasso technique. In addition, valid expressions are provided for computing odds ratios in the unusual parameterization used by the sparsity-promoting algorithm. We show how to apply the proposed data analysis pipeline to the problem of sub-phenotyping asthma patients also in terms of quality of response to therapy with monoclonal antibodies.
Response to comments on 'sensitivity of estimands in clinical trials with imperfect compliance'
Chen H and Heitjan DF
Prediction-based variable selection for component-wise gradient boosting
Potts S, Bergherr E, Reinke C and Griesbach C
Model-based component-wise gradient boosting is a popular tool for data-driven variable selection. In order to improve its prediction and selection qualities even further, several modifications of the original algorithm have been developed, that mainly focus on different stopping criteria, leaving the actual variable selection mechanism untouched. We investigate different prediction-based mechanisms for the variable selection step in model-based component-wise gradient boosting. These approaches include Akaikes Information Criterion (AIC) as well as a selection rule relying on the component-wise test error computed via cross-validation. We implemented the AIC and cross-validation routines for Generalized Linear Models and evaluated them regarding their variable selection properties and predictive performance. An extensive simulation study revealed improved selection properties whereas the prediction error could be lowered in a real world application with age-standardized COVID-19 incidence rates.
Ensemble learning methods of inference for spatially stratified infectious disease systems
Peitsch J, Pokharel G and Hossain S
Individual level models are a class of mechanistic models that are widely used to infer infectious disease transmission dynamics. These models incorporate individual level covariate information accounting for population heterogeneity and are generally fitted in a Bayesian Markov chain Monte Carlo (MCMC) framework. However, Bayesian MCMC methods of inference are computationally expensive for large data sets. This issue becomes more severe when applied to infectious disease data collected from spatially heterogeneous populations, as the number of covariates increases. In addition, summary statistics over the global population may not capture the true spatio-temporal dynamics of disease transmission. In this study we propose to use ensemble learning methods to predict epidemic generating models instead of time consuming Bayesian MCMC method. We apply these methods to infer disease transmission dynamics over spatially clustered populations, considering the clusters as natural strata instead of a global population. We compare the performance of two tree-based ensemble learning techniques: random forest and gradient boosting. These methods are applied to the 2001 foot-and-mouth disease epidemic in the U.K. and evaluated using simulated data from a clustered population. It is shown that the spatially clustered data can help to predict epidemic generating models more accurately than the global data.
Kalman filter with impulse noised outliers: a robust sequential algorithm to filter data with a large number of outliers
Cloez B, Fontez B, González-García E and Sanchez I
Impulse noised outliers are data points that differ significantly from other observations. They are generally removed from the data set through local regression or the Kalman filter algorithm. However, these methods, or their generalizations, are not well suited when the number of outliers is of the same order as the number of low-noise data (often called ). In this article, we propose a new model for impulsed noise outliers. It is based on a hierarchical model and a simple linear Gaussian process as with the Kalman Filter. We present a fast forward-backward algorithm to filter and smooth sequential data and which also detects these outliers. We compare the robustness and efficiency of this algorithm with classical methods. Finally, we apply this method on a real data set from a Walk Over Weighing system admitting around 60 % of outliers. For this application, we further develop an (explicit) EM algorithm to calibrate some algorithm parameters.
Random forests for survival data: which methods work best and under what conditions?
Berkowitz M, Altman RM and Loughin TM
Few systematic comparisons of methods for constructing survival trees and forests exist in the literature. Importantly, when the goal is to predict a survival time or estimate a survival function, the optimal choice of method is unclear. We use an extensive simulation study to systematically investigate various factors that influence survival forest performance - forest construction method, censoring, sample size, distribution of the response, structure of the linear predictor, and presence of correlated or noisy covariates. In particular, we study 11 methods that have recently been proposed in the literature and identify 6 top performers. We find that all the factors that we investigate have significant impact on the methods' relative accuracy of point predictions of survival times and survival function estimates. We use our results to make recommendations for which methods to use in a given context and offer explanations for the observed differences in relative performance.
The survival function NPMLE for combined right-censored and length-biased right-censored failure time data: properties and applications
McVittie JH, Wolfson DB and Stephens DA
Many cohort studies in survival analysis have imbedded in them subcohorts consisting of incident cases and prevalent cases. Instead of analysing the data from the incident and prevalent cohorts alone, there are surely advantages to combining the data from these two subcohorts. In this paper, we discuss a survival function nonparametric maximum likelihood estimator (NPMLE) using both length-biased right-censored prevalent cohort data and right-censored incident cohort data. We establish the asymptotic properties of the survival function NPMLE and utilize the NPMLE to estimate the distribution for time spent in a Montreal area hospital.
Estimation of a decreasing mean residual life based on ranked set sampling with an application to survival analysis
Zamanzade E, Zamanzade E and Parvardeh A
The mean residual lifetime (MRL) of a unit in a population at a given time , is the average remaining lifetime among those population units still alive at the time . In some applications, it is reasonable to assume that MRL function is a decreasing function over time. Thus, one natural way to improve the estimation of MRL function is to use this assumption in estimation process. In this paper, we develop an MRL estimator in ranked set sampling (RSS) which, enjoys the monotonicity property. We prove that it is a strongly uniformly consistent estimator of true MRL function. We also show that the asymptotic distribution of the introduced estimator is the same as the empirical one, and therefore the novel estimator is obtained "free of charge", at least in an asymptotic sense. We then compare the proposed estimator with its competitors in RSS and simple random sampling (SRS) using Monte Carlo simulation. Our simulation results confirm the superiority of the proposed procedure for finite sample sizes. Finally, a real dataset from the Surveillance, Epidemiology and End Results (SEER) program of the US National Cancer Institute (NCI) is used to show that the introduced technique can provide more accurate estimates for the average remaining lifetime of patients with breast cancer.
Flexible variable selection in the presence of missing data
Williamson BD and Huang Y
In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose a nonparametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithm that achieve control of commonly used error rates. Through simulations, we show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed method to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.
Statistical models for assessing agreement for quantitative data with heterogeneous random raters and replicate measurements
Ekstrøm CT and Carstensen B
Agreement between methods for quantitative measurements are typically assessed by computing limits of agreement between pairs of methods and/or by illustration through Bland-Altman plots. We consider the situation where the observed measurement methods are considered a random sample from a population of possible methods, and discuss how the underlying linear mixed effects model can be extended to this situation. This is relevant when, for example, the methods represent raters/judges that are used to score specific individuals or items. In the case of random methods, we are not interested in estimates pertaining to the specific methods, but are instead interested in quantifying the variation between the methods actually involved making measurements, and accommodating this as an extra source of variation when generalizing to the clinical performance of a method. In the model we allow raters to have individual precision/skill and permit linked replicates (, when the numbering, labeling or ordering of the replicates within items is important). Applications involving estimation of the limits of agreement for two datasets are shown: A dataset of spatial perception among a group of students as well as a dataset on consumer preference of French chocolate. The models are implemented in the MethComp package for R [Carstensen B, Gurrin L, Ekstrøm CT, Figurski M. MethComp: functions for analysis of agreement in method comparison studies; 2013. R package version 1.22, R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2012].
MBPCA-OS: an exploratory multiblock method for variables of different measurement levels. Application to study the immune response to SARS-CoV-2 infection and vaccination
Paries M, Vigneau E, Huneau A, Lantz O and Bougeard S
Studying a large number of variables measured on the same observations and organized in blocks - denoted multiblock data - is becoming standard in several domains especially in biology. To explore the relationships between all these variables - at the block- and the variable-level - several exploratory multiblock methods were proposed. However, most of them are only designed for numeric variables. In reality, some data sets contain variables of different measurement levels (i.e., numeric, nominal, ordinal). In this article, we focus on exploratory multiblock methods that handle variables at their appropriate measurement level. Multi-Block Principal Component Analysis with Optimal Scaling (MBPCA-OS) is proposed and applied to multiblock data from the CURIE-O-SA French cohort. In this study, variables are of different measurement levels and organized in four blocks. The objective is to study the immune responses according to the SARS-CoV-2 infection and vaccination statuses, the symptoms and the participant's characteristics.
Revisiting incidence rates comparison under right censorship
Martínez-Camblor P and Díaz-Coto S
Data description is the first step for understanding the nature of the problem at hand. Usually, it is a simple task that does not require any particular assumption. However, the interpretation of the used descriptive measures can be a source of confusion and misunderstanding. The incidence rate is the quotient between the number of observed events and the sum of time that the studied population was at risk of having this event (person-time). Despite this apparently simple definition, its interpretation is not free of complexity. In this piece of research, we revisit the incidence rate estimator under right-censorship. We analyze the effect that the censoring time distribution can have on the observed results, and its relevance in the comparison of two or more incidence rates. We propose a solution for limiting the impact that the data collection process can have on the results of the hypothesis testing. We explore the finite-sample behavior of the considered estimators from Monte Carlo simulations. Two examples based on synthetic data illustrate the considered problem. The R code and data used are provided as Supplementary Material.
Bayesian second-order sensitivity of longitudinal inferences to non-ignorability: an application to antidepressant clinical trial data
Momeni Roochi E and Eftekhari Mahabadi S
Incomplete data is a prevalent complication in longitudinal studies due to individuals' drop-out before intended completion time. Currently available methods via commercial software for analyzing incomplete longitudinal data at best rely on the ignorability of the drop-outs. If the underlying missing mechanism was non-ignorable, potential bias arises in the statistical inferences. To remove the bias when the drop-out is non-ignorable, joint complete-data and drop-out models have been proposed which involve computational difficulties and untestable assumptions. Since the critical ignorability assumption is unverifiable based on the observed part of the sample, some local sensitivity indices have been proposed in the literature. Specifically, Eftekhari Mahabadi (Second-order local sensitivity to non-ignorability in Bayesian inferences. Stat Med 2018;59:55-95) proposed a second-order local sensitivity tool for Bayesian analysis of cross-sectional studies and show its better performance for handling bias compared with the first-order ones. In this paper, we aim to extend this index for the Bayesian sensitivity analysis of normal longitudinal studies with drop-outs. The index is driven based on a selection model for the drop-out mechanism and a Bayesian linear mixed-effect complete-data model. The presented formulas are calculated using the posterior estimation and draws from the simpler ignorable model. The method is illustrated via some simulation studies and sensitivity analysis of a real antidepressant clinical trial data. Overall, the numerical analysis showed that when repeated outcomes are subject to missingness, regression coefficient estimates are nearly approximated well by a linear function in the neighbourhood of MAR model, but there are a considerable amount of second-order sensitivity for the error term and random effect variances in Bayesian linear mixed-effect model framework.
Improving the mixed model for repeated measures to robustly increase precision in randomized trials
Wang B and Du Y
In randomized trials, repeated measures of the outcome are routinely collected. The mixed model for repeated measures (MMRM) leverages the information from these repeated outcome measures, and is often used for the primary analysis to estimate the average treatment effect at the primary endpoint. MMRM, however, can suffer from bias and precision loss when it models intermediate outcomes incorrectly, and hence fails to use the post-randomization information harmlessly. This paper proposes an extension of the commonly used MMRM, called IMMRM, that improves the robustness and optimizes the precision gain from covariate adjustment, stratified randomization, and adjustment for intermediate outcome measures. Under regularity conditions and missing completely at random, we prove that the IMMRM estimator for the average treatment effect is robust to arbitrary model misspecification and is asymptotically equal or more precise than the analysis of covariance (ANCOVA) estimator and the MMRM estimator. Under missing at random, IMMRM is less likely to be misspecified than MMRM, and we demonstrate via simulation studies that IMMRM continues to have less bias and smaller variance. Our results are further supported by a re-analysis of a randomized trial for the treatment of diabetes.
Testing for association between ordinal traits and genetic variants in pedigree-structured samples by collapsing and kernel methods
Chien LC
In genome-wide association studies (GWAS), logistic regression is one of the most popular analytics methods for binary traits. Multinomial regression is an extension of binary logistic regression that allows for multiple categories. However, many GWAS methods have been limited application to binary traits. These methods have improperly often been used to account for ordinal traits, which causes inappropriate type I error rates and poor statistical power. Owing to the lack of analysis methods, GWAS of ordinal traits has been known to be problematic and gaining attention. In this paper, we develop a general framework for identifying ordinal traits associated with genetic variants in pedigree-structured samples by collapsing and kernel methods. We use the local odds ratios GEE technology to account for complicated correlation structures between family members and ordered categorical traits. We use the retrospective idea to treat the genetic markers as random variables for calculating genetic correlations among markers. The proposed genetic association method can accommodate ordinal traits and allow for the covariate adjustment. We conduct simulation studies to compare the proposed tests with the existing models for analyzing the ordered categorical data under various configurations. We illustrate application of the proposed tests by simultaneously analyzing a family study and a cross-sectional study from the Genetic Analysis Workshop 19 (GAW19) data.