COMPUTATIONAL STATISTICS

A New Approach to Modeling the Cure Rate in the Presence of Interval Censored Data
Pal S, Peng Y and Aselisewine W
We consider interval censored data with a cured subgroup that arises from longitudinal followup studies with a heterogeneous population where a certain proportion of subjects is not susceptible to the event of interest. We propose a two component mixture cure model, where the first component describing the probability of cure is modeled by a support vector machine-based approach and the second component describing the survival distribution of the uncured group is modeled by a proportional hazard structure. Our proposed model provides flexibility in capturing complex effects of covariates on the probability of cure unlike the traditional models that rely on modeling the cure probability using a generalized linear model with a known link function. For the estimation of model parameters, we develop an expectation maximization-based estimation algorithm. We conduct simulation studies and show that our proposed model performs better in capturing complex effects of covariates on the cure probability when compared to the traditional logit link-based two component mixture cure model. This results in more accurate (smaller bias) and precise (smaller mean square error) estimates of the cure probabilities, which in-turn improves the predictive accuracy of the latent cured status. We further show that our model's ability to capture complex covariate effects also improves the estimation results corresponding to the survival distribution of the uncured. Finally, we apply the proposed model and estimation procedure to an interval censored data on smoking cessation.
Joint Bayesian longitudinal models for mixed outcome types and associated model selection techniques
Seedorff N, Brown G, Scorza B and Petersen CA
Motivated by data measuring progression of leishmaniosis in a cohort of US dogs, we develop a Bayesian longitudinal model with autoregressive errors to jointly analyze ordinal and continuous outcomes. Multivariate methods can borrow strength across responses and may produce improved longitudinal forecasts of disease progression over univariate methods. We explore the performance of our proposed model under simulation, and demonstrate that it has improved prediction accuracy over traditional Bayesian hierarchical models. We further identify an appropriate model selection criterion. We show that our method holds promise for use in the clinical setting, particularly when ordinal outcomes are measured alongside other variables types that may aid clinical decision making. This approach is particularly applicable when multiple, imperfect measures of disease progression are available.
Spatio-temporal clustering analysis using generalized lasso with an application to reveal the spread of Covid-19 cases in Japan
Rahardiantoro S and Sakamoto W
This study addressed the issue of determining multiple potential clusters with regularization approaches for the purpose of spatio-temporal clustering. The generalized lasso framework has flexibility to incorporate adjacencies between objects in the penalty matrix and to detect multiple clusters. A generalized lasso model with two penalties is proposed, which can be separated into two generalized lasso models: trend filtering of temporal effect and fused lasso of spatial effect for each time point. To select the tuning parameters, the approximate leave-one-out cross-validation (ALOCV) and generalized cross-validation (GCV) are considered. A simulation study is conducted to evaluate the proposed method compared to other approaches in different problems and structures of multiple clusters. The generalized lasso with ALOCV and GCV provided smaller MSE in estimating the temporal and spatial effect compared to unpenalized method, ridge, lasso, and generalized ridge. In temporal effects detection, the generalized lasso with ALOCV and GCV provided relatively smaller and more stable MSE than other methods, for different structure of true risk values. In spatial effects detection, the generalized lasso with ALOCV provided higher index of edges detection accuracy. The simulation also suggested using a common tuning parameter over all time points in spatial clustering. Finally, the proposed method was applied to the weekly Covid-19 data in Japan form March 21, 2020, to September 11, 2021, along with the interpretation of dynamic behavior of multiple clusters.
Policy evaluation using model over-fitting: the Nordic case
Tapia A, González SL, Vergara JR, Villafuerte M and Montiel LV
The interest of this article is to better understand the effects of different public policy alternatives to handle the COVID-19 pandemic. In this work we use the susceptible, infected, recovered (SIR) model to find which of these policies have an actual impact on the dynamic of the spread. Starting with raw data on the number of deceased people in a country, we over-fit our SIR model to find the times at which the main parameters, the number of daily contacts and the probability of contagion, require adjustments. For each , we go to historic records to find policies and social events that could explain these changes. This approach helps to evaluate events through the eyes of the popular epidemiological SIR model, and to find insights that are hard to recognize in a standard econometric model.
Multiple partition Markov model for B.1.1.7, B.1.351, B.1.617.2, and P.1 variants of SARS-CoV 2 virus
García JE, González-López VA and Tasca GH
With tools originating from Markov processes, we investigate the similarities and differences between genomic sequences in format coming from four variants of the SARS-CoV 2 virus, B.1.1.7 (UK), B.1.351 (South Africa), B.1.617.2 (India), and P.1 (Brazil). We treat the virus' sequences as samples of finite memory Markov processes acting in We model each sequence, revealing some heterogeneity between sequences belonging to the same variant. We identified the five most representative sequences for each variant using a robust notion of classification, see Fernández et al. (Math Methods Appl Sci 43(13):7537-7549. 10.1002/mma.5705 ). Using a notion derived from a metric between processes, see García et al. (Appl Stoch Models Bus Ind 34(6):868-878. 10.1002/asmb.2346), we identify four groups, each group representing a variant. It is also detected, by this metric, global proximity between the variants B.1.351 and B.1.1.7. With the selected sequences, we assemble a multiple partition model, see Cordeiro et al. (Math Methods Appl Sci 43(13):7677-7691. 10.1002/mma.6079), revealing in which states of the state space the variants differ, concerning the mechanisms for choosing the next element in . Through this model, we identify that the variants differ in their transition probabilities in eleven states out of a total of 256 states. For these eleven states, we reveal how the transition probabilities change from variant (group of variants) to variant (group of variants). In other words, we indicate precisely the stochastic reasons for the discrepancies.
Multiway clustering with time-varying parameters
Cerqueti R, Mattera R and Scepi G
This paper proposes a clustering approach for multivariate time series with time-varying parameters in a multiway framework. Although clustering techniques based on time series distribution characteristics have been extensively studied, methods based on time-varying parameters have only recently been explored and are missing for multivariate time series. This paper fills the gap by proposing a multiway approach for distribution-based clustering of multivariate time series. To show the validity of the proposed clustering procedure, we provide both a simulation study and an application to real air quality time series data.
An extended approach for the generalized powered uniform distribution
Rondero-Guerrero C, González-Hernández I and Soto-Campos C
A new uniform distribution model, generalized powered uniform distribution (), which is based on incorporating the parameter into the probability density function (pdf) associated with the power of random variable values and includes a powered mean operator, is introduced in this paper. From this new model, the shape properties of the pdf as well as the higher-order moments, the moment generating function, the model that simulates the and other important statistics can be derived. This approach allows the generalization of the distribution presented by Jayakumar and Sankaran (2016) through the new distribution. Two sets of real data related to COVID-19 and bladder cancer were tested to demonstrate the proposed model's potential. The maximum likelihood method was used to calculate the parameter estimators by applying the maxLik package in R. The results showed that this new model is more flexible and useful than other comparable models.
Bayesian variable selection using Knockoffs with applications to genomics
Yap JK and Gauran IIM
Given the costliness of HIV drug therapy research, it is important not only to maximize true positive rate (TPR) by identifying which genetic markers are related to drug resistance, but also to minimize false discovery rate (FDR) by reducing the number of incorrect markers unrelated to drug resistance. In this study, we propose a multiple testing procedure that unifies key concepts in computational statistics, namely Model-free Knockoffs, Bayesian variable selection, and the local false discovery rate. We develop an algorithm that utilizes the augmented data-Knockoff matrix and implement Bayesian Lasso. We then identify signals using test statistics based on Markov Chain Monte Carlo outputs and local false discovery rate. We test our proposed methods against non-bayesian methods such as Benjamini-Hochberg (BHq) and Lasso regression in terms TPR and FDR. Using numerical studies, we show the proposed method yields lower FDR compared to BHq and Lasso for certain cases, such as for low and equi-dimensional cases. We also discuss an application to an HIV-1 data set, which aims to be applied analyzing genetic markers linked to drug resistant HIV in the Philippines in future work.
Topic based quality indexes assessment through sentiment
Ortu M, Frigau L and Contu G
This paper proposes a new methodology called TOpic modeling Based Index Assessment through Sentiment (TOBIAS). This method aims at modeling the effects of the topics, moods, and sentiments of the comments describing a phenomenon upon its overall rating. TOBIAS is built combining different techniques and methodologies. Firstly, Sentiment Analysis identifies sentiments, emotions, and moods, and Topic Modeling finds the main relevant topics inside comments. Then, Partial Least Square Path Modeling estimates how they affect an overall rating that summarizes the performance of the analyzed phenomenon. We carried out TOBIAS on a real case study on the university courses' quality evaluated by the University of Cagliari (Italy) students. We found TOBIAS able to provide interpretable results on the impact of discussed topics by students with their expressed sentiments, emotions, and moods and with the overall rating.
A hybrid approach for the analysis of complex categorical data structures: assessment of latent distance learning perception in higher education
Iannario M, D'Enza AI and Romano R
A long tradition of analysing ordinal response data deals with parametric models, which started with the seminal approach of cumulative models. When data are collected by means of Likert scale survey questions in which several scored items measure one or more latent traits, one of the sore topics is how to deal with the ordered categories. A stacked ensemble (or hybrid) model is introduced in the proposal to tackle the limitations of summing up the items. In particular, multiple items responses are synthesised into a single meta-item, defined via a joint data reduction approach; the meta-item is then modelled according to regression approaches for ordered polytomous variables accounting for potential scaling effects. Finally, a recursive partitioning method yielding trees provides automatic variable selection. The performance of the method is evaluated empirically by using a survey on Distance Learning perception.
Results and student perspectives on a web-scraping assignment from Utah State University's data technologies course to evaluate the African activity in the statistical computing community
Fleming A, Coltrin JD, Medri J, Hilyard C, Tellez R and Symanzik J
In 2019, members of the Executive Committee of the International Association for Statistical Computing (IASC) were contacted by members of the IASC from Africa asking whether it would be feasible to establish a new regional IASC section in Africa. The establishment of a new regional section requires several steps that are outlined in the IASC Statutes at https://iasc-isi.org/statutes/. The approval likely depends on whether the proposed new regional section has the potential to conduct typical section activities, such as organizing regional conferences, workshops, and short courses. To establish whether it is feasible to add a regional section in Africa, the IASC must know whether there is currently enough high-level activity within African countries with respect to computational statistics. To answer this question, we looked at author affiliations of articles published in the journal (COST) and the journal (CSDA) from 2015 to 2020 and used these data as a proxy to compare author productivity for authors with an affiliation in Africa in 2019 and 2020, compared to authors with an affiliation in Latin America in 2015 and 2016. This article looks at quantitative results to the questions above, provides insight on how students from Utah State University's STAT 5080/6080 "Data Technologies" course from the Fall 2019 semester used web scraping techniques in a homework assignment to gather author affiliations from COST and CSDA to answer these questions, and includes the evaluation of student feedback obtained after the end of the course.
Permutation based testing on covariance separability
Park S, Lim J, Wang X and Lee S
Separability is an attractive feature of covariance matrices or matrix variate data, which can improve and simplify many multivariate procedures. Due to its importance, testing separability has attracted much attention in the past. The procedures in the literature are of two types, likelihood ratio test (LRT) and Rao's score test (RST). Both are based on the normality assumption or the large-sample asymptotic properties of the test statistics. In this paper, we develop a new approach that is very different from existing ones. We propose to reformulate the null hypothesis (the separability of a covariance matrix of interest) into many sub-hypotheses (the separability of the sub-matrices of the covariance matrix), which are testable using a permutation based procedure.We then combine the testing results of sub-hypotheses using the Bonferroni and two-stage additive procedures. Our permutation based procedures are inherently distribution free; thus it is robust to non-normality of the data. In addition, unlike the LRT, they are applicable to situations when the sample size is smaller than the number of unknown parameters in the covariance matrix. Our numerical study and data examples show the advantages of our procedures over the existing LRT and RST.
Neural network gradient Hamiltonian Monte Carlo
Li L, Holbrook A, Shahbaba B and Baldi P
Hamiltonian Monte Carlo is a widely used algorithm for sampling from posterior distributions of complex Bayesian models. It can efficiently explore high-dimensional parameter spaces guided by simulated Hamiltonian flows. However, the algorithm requires repeated gradient calculations, and these computations become increasingly burdensome as data sets scale. We present a method to substantially reduce the computation burden by using a neural network to approximate the gradient. First, we prove that the proposed method still maintains convergence to the true distribution though the approximated gradient no longer comes from a Hamiltonian system. Second, we conduct experiments on synthetic examples and real data to validate the proposed method.
Efficient inference in state-space models through adaptive learning in online Monte Carlo expectation maximization
Henderson D and Lunter G
Expectation maximization (EM) is a technique for estimating maximum-likelihood parameters of a latent variable model given observed data by alternating between taking expectations of sufficient statistics, and maximizing the expected log likelihood. For situations where sufficient statistics are intractable, stochastic approximation EM (SAEM) is often used, which uses Monte Carlo techniques to approximate the expected log likelihood. Two common implementations of SAEM, Batch EM (BEM) and online EM (OEM), are parameterized by a "learning rate", and their efficiency depend strongly on this parameter. We propose an extension to the OEM algorithm, termed Introspective Online Expectation Maximization (IOEM), which removes the need for specifying this parameter by adapting the learning rate to trends in the parameter updates. We show that our algorithm matches the efficiency of the optimal BEM and OEM algorithms in multiple models, and that the efficiency of IOEM can exceed that of BEM/OEM methods with optimal learning rates when the model has many parameters. Finally we use IOEM to fit two models to a financial time series. A Python implementation is available at https://github.com/luntergroup/IOEM.git.
Estimating the number of clusters via a corrected clustering instability
Haslbeck JMB and Wulff DU
We improve instability-based methods for the selection of the number of clusters in cluster analysis by developing a corrected clustering distance that corrects for the unwanted influence of the distribution of cluster sizes on cluster instability. We show that our corrected instability measure outperforms current instability-based measures across the whole sequence of possible , overcoming limitations of current insability-based methods for large . We also compare, for the first time, model-based and model-free approaches to determining cluster-instability and find their performance to be comparable. We make our method available in the R-package cstab.
Decomposition of the Gini index by income source for aggregated data and its applications
Shao B
The Gini index is well-known for a single measure of inequality. The purpose of this article is to explore a matrix structure of the Gini index in a setting of multiple source income. Using matrices, we analyze the decomposition of the Gini index by income source and derive an explicit formula for the factors in terms of the associated percentile levels based on aggregated data reporting. Each factor is shown to be the sums of the two split off parts of the income within a percentile bracket. Both have unequalizing and equalizing contribution to the total inequality, respectively. We use code and apply the methodology to several data sets including a sample of European aggregated income reporting in 2014 for illustration. A byproduct from the Gini decomposition provides a matrix approach to the decomposition of the associated Lorenz curve in terms of the density distribution matrix and a Toeplitz matrix.
Direct statistical inference for finite Markov jump processes via the matrix exponential
Sherlock C
Given noisy, partial observations of a time-homogeneous, finite-statespace Markov chain, conceptually simple, direct statistical inference is available, in theory, via its rate matrix, or infinitesimal generator, , since is the transition matrix over time . However, perhaps because of inadequate tools for matrix exponentiation in programming languages commonly used amongst statisticians or a belief that the necessary calculations are prohibitively expensive, statistical inference for continuous-time Markov chains with a large but finite state space is typically conducted via particle MCMC or other relatively complex inference schemes. When, as in many applications arises from a reaction network, it is usually sparse. We describe variations on known algorithms which allow fast, robust and accurate evaluation of the product of a non-negative vector with the exponential of a large, sparse rate matrix. Our implementation uses relatively recently developed, efficient, linear algebra tools that take advantage of such sparsity. We demonstrate the straightforward statistical application of the key algorithm on a model for the mixing of two alleles in a population and on the Susceptible-Infectious-Removed epidemic model.
New classes of tests for the Weibull distribution using Stein's method in the presence of random right censoring
Bothma E, Allison JS and Visagie IJH
We develop two new classes of tests for the Weibull distribution based on Stein's method. The proposed tests are applied in the full sample case as well as in the presence of random right censoring. We investigate the finite sample performance of the new tests using a comprehensive Monte Carlo study. In both the absence and presence of censoring, it is found that the newly proposed classes of tests outperform competing tests against the majority of the distributions considered. In the cases where censoring is present we consider various censoring distributions. Some remarks on the asymptotic properties of the proposed tests are included. We present another result of independent interest; a test initially proposed for use with full samples is amended to allow for testing for the Weibull distribution in the presence of censoring. The techniques developed in the paper are illustrated using two practical examples.
Non-parametric seasonal unit root tests under periodic non-stationary volatility
Gög Ebakan KÇL and Eroglu BA
This paper presents a new non-parametric seasonal unit root testing framework that is robust to periodic non-stationary volatility in innovation variance by making an extension to the fractional seasonal variance ratio unit root tests of Eroğlu et al. (Econ Lett 167:75-80, 2018). The setup allows for both periodic heteroskedasticity structure of Burridge and Taylar (J Econ 104(1):91-117, 2001) and non-stationary volatility structure of Cavaliere and Taylor (Econ Theory 24(1):43-71, 2008). We show that the limiting null distributions of the variance ratio tests depend on nuisance parameters derived from the underlying volatility process. Monte Carlo simulations show that the standard variance ratio tests can be substantially oversized in the presence of such effects. Consequently, we propose wild bootstrap implementations of the variance ratio tests. Wild bootstrap resampling schemes are shown to deliver asymptotically pivotal inference. The simulation evidence depicts that the proposed bootstrap tests perform well in practice and essentially correct the size problems observed in the standard fractional seasonal variance ratio tests, even under extreme patterns of heteroskedasticity.
Semi-supervised adapted HMMs for P2P credit scoring systems with reject inference
El Annas M, Benyacoub B and Ouzineb M
The majority of current credit-scoring models, used for loan approval processing, are generally built on the basis of the information from the accepted credit applicants whose ability to repay the loan is known. This situation generates what is called the selection bias, presented by a sample that is not representative of the population of applicants, since rejected applications are excluded. Thus, the impact on the eligibility of those models from a statistical and economic point of view. Especially for the models used in the peer-to-peer lending platforms, since their rejection rate is extremely high. The method of inferring rejected applicants information in the process of construction of the credit scoring models is known as reject inference. This study proposes a semi-supervised learning framework based on hidden Markov models (SSHMM), as a novel method of reject inference. Real data from the Lending Club platform, the most used online lending marketplace in the United States as well as the rest of the world, is used to experiment the effectiveness of our method over existing approaches. The results of this study clearly illustrate the proposed method's superiority, stability, and adaptability.
Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data
Weisser C, Gerloff C, Thielmann A, Python A, Reuter A, Kneib T and Säfken B
Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model.