STATISTICS & PROBABILITY LETTERS

On exact Bayesian credible sets for discrete parameters
Song C and Li B
We introduce a generalized Bayesian credible set that can achieve any preassigned credible level, addressing a limitation of the current credible sets. This is achieved by exploiting a connection between the highest posterior density set and the Neyman-Pearson lemma.
Universally Consistent K-Sample Tests via Dependence Measures
Panda S, Shen C, Perry R, Zorn J, Lutz A, Priebe CE and Vogelstein JT
The K-sample testing problem involves determining whether K groups of data points are each drawn from the same distribution. Analysis of variance is arguably the most classical method to test mean differences, along with several recent methods to test distributional differences. In this paper, we demonstrate the existence of a transformation that allows K-sample testing to be carried out using any dependence measure. Consequently, universally consistent K-sample testing can be achieved using a universally consistent dependence measure, such as distance correlation and the Hilbert-Schmidt independence criterion. This enables a wide range of dependence measures to be easily applied to K-sample testing.
A Bayesian Spatial Scan Statistic for Multinomial Data
Self S and Nolan M
Spatial scan statistics are commonly used to detect clustering. We present a Bayesian spatial scan statistic for multinomial data. After validating our method with a simulation study, we use it to detect clusters of SARS-CoV-2 infection/immunity in South Carolina.
Proximal Causal Inference without Uniqueness Assumptions
Zhang J, Li W, Miao W and Tchetgen ET
We consider identification and inference about a counterfactual outcome mean when there is unmeasured confounding using tools from proximal causal inference. Proximal causal inference requires existence of solutions to at least one of two integral equations. We motivate the existence of solutions to the integral equations from proximal causal inference by demonstrating that, assuming the existence of a solution to one of the integral equations, -estimability of a mean functional of that solution requires the existence of a solution to the other integral equation. Solutions to the integral equations may not be unique, which complicates estimation and inference. We construct a consistent estimator for the solution set for one of the integral equations and then adapt the theory of extremum estimators to find from the estimated set a consistent estimator for a uniquely defined solution. A debiased estimator is shown to be root- consistent, regular, and semiparametrically locally efficient under additional regularity conditions.
A Bartlett-type correction for likelihood ratio tests with application to testing equality of Gaussian graphical models
Banzato E, Chiogna M, Djordjilović V and Risso D
This work defines a new correction for the likelihood ratio test for a two-sample problem within the multivariate normal context. This correction applies to decomposable graphical models, where testing equality of distributions can be decomposed into lower dimensional problems.
Identification of the outcome distribution and sensitivity analysis under weak confounder-instrument interaction
Mao L
Recently, Wang and Tchetgen Tchetgen (2018) showed that the global average treatment effect is identifiable even in the presence of unmeasured confounders so long as they do not modify the instrument's additive effect on the treatment. We use a simple and direct method to show that this no-interaction assumption allows identification of the entire outcome distribution, which leads to multiply robust estimation procedures for nonlinear functionals like the quantile and Mann-Whitney treatment effects. Similarly, we can bound these causal estimands through the outcome distribution in sensitivity analysis against confounder-instrument interaction.
Learning Sparse Deep Neural Networks with a Spike-and-Slab Prior
Sun Y, Song Q and Liang F
Deep learning has achieved great successes in many machine learning tasks. However, the deep neural networks (DNNs) are often severely over-parameterized, making them computationally expensive, memory intensive, less interpretable and mis-calibrated. We study sparse DNNs under the Bayesian framework: we establish posterior consistency and structure selection consistency for Bayesian DNNs with a spike-and-slab prior, and illustrate their performance using examples on high-dimensional nonlinear variable selection, large network compression and model calibration. Our numerical results indicate that sparsity is essential for improving the prediction accuracy and calibration of the DNN.
A goodness-of-fit test based on neural network sieve estimators
Shen X, Jiang C, Sakhanenko L and Lu Q
Neural networks have become increasingly popular in the field of machine learning and have been successfully used in many applied fields (e.g., imaging recognition). With more and more research has been conducted on neural networks, we have a better understanding of the statistical proprieties of neural networks. While many studies focus on bounding the prediction error of neural network estimators, limited research has been done on the statistical inference of neural networks. From a statistical point of view, it is of great interest to investigate the statistical inference of neural networks as it could facilitate hypothesis testing in many fields (e.g., genetics, epidemiology, and medical science). In this paper, we propose a goodness-of-fit test statistic based on neural network sieve estimators. The test statistic follows an asymptotic distribution, which makes it easy to use in practice. We have also verified the theoretical asymptotic results via simulation studies and a real data application.
Middle censoring in the multinomial distribution with applications
Jammalamadaka SR and Bapat SR
In a multinomial set-up with k possible outcomes, we develop estimation under a "middle censoring" paradigm, which is as defined in Jammalamadaka and Mangalam (2003). This problem has many special features because of the inter-dependent probabilities, which we explore here.
Test-statistic correlation and data-row correlation
Zhuo B, Jiang D and Di Y
When a statistical test is repeatedly applied to rows of a data matrix, correlations among data rows will give rise to correlations among corresponding test statistics. We investigate the relationship between test-statistic correlation and data-row correlation and discuss its implications.
Tests for regression coefficients in high dimensional partially linear models
Liu Y, Zhang S, Ma S and Zhang Q
We propose a U-statistics test for regression coefficients in high dimensional partially linear models. In addition, the proposed method is extended to test part of the coefficients. Asymptotic distributions of the test statistics are established. Simulation studies demonstrate satisfactory finite-sample performance.
On critical points of Gaussian random fields under diffeomorphic transformations
Cheng D and Schwartzman A
Let and be smooth Gaussian random fields parameterized on Riemannian manifolds and , respectively, such that , where is a diffeomorphic transformation. We study the expected number and height distribution of the critical points of in connection with those of . As an important case, when is an anisotropic Gaussian random field, then we show that its expected number of critical points becomes proportional to that of an isotropic field , while the height distribution remains the same as that of .
A New Functional Representation of Broad Sense Agreement
Wei B, Dai T, Peng L, Guo Y and Manatunga A
We derive a new functional representation of Broad Sense Agreement (BSA) index that evaluates the agreement/alignment between a continuous measurement and an ordinary measurement. Using this result, we develop an alternative BSA estimator, which can offer significant numerical advantages.
A Note on Monotonicity in Repeated Attempt Selection Models
Park S and Daniels MJ
Study designs where follow-up samples are collected through multiple attempts have been called repeated attempts designs. In this note we explore the monotonicity thought to underly the models fit for these designs and show it does not always hold.
Limit theorem for the Robin Hood game
Angel O, Matzavinos A and Roitershtein A
In its simplest form, the Robin Hood game is described by the following urn scheme: every day the Sheriff of Nottingham puts balls in an urn. Then Robin chooses () balls to remove from the urn. Robin's goal is to remove balls in such a way that none of them are left in the urn indefinitely. Let be the random time that is required for Robin to take out all · balls put in the urn during the first days. Our main result is a limit theorem for if Robin selects the balls uniformly at random. Namely, we show that the random variable · converges in law to a Fréchet distribution as goes to infinity.
Multivariate spatial meta kriging
Guhaniyogi R and Banerjee S
This work extends earlier work on spatial meta kriging for the analysis of large multivariate spatial datasets as commonly encountered in environmental and climate sciences. Spatial meta-kriging partitions the data into subsets, analyzes each subset using a Bayesian spatial process model and then obtains approximate posterior inference for the entire dataset by optimally combining the individual posterior distributions from each subset. Importantly, as is often desired in spatial analysis, spatial meta kriging offers posterior predictive inference at arbitrary locations for the outcome as well as the residual spatial surface after accounting for spatially oriented predictors. Our current work explores spatial meta kriging idea to enhance scalability of multivariate spatial Gaussian process model that uses linear model co-regionalization (LMC) to account for the correlation between multiple components. The approach is simple, intuitive and scales multivariate spatial process models to big data effortlessly. A simulation study reveals inferential and predictive accuracy offered by spatial meta kriging on multivariate observations.
Empirical likelihood ratio tests with power one
Vexler A and Zou L
In the 1970s, Professor Robbins and his coauthors extended the Vile and Wald inequality in order to derive the fundamental theoretical results regarding likelihood ratio based sequential tests with power one. The law of the iterated logarithm confirms an optimal property of the power one tests. In parallel with Robbins's decision-making procedures, we propose and examine sequential empirical likelihood ratio (ELR) tests with power one. In this setting, we develop the nonparametric one- and two-sided ELR tests. It turns out that the proposed sequential ELR tests significantly outperform the classical nonparametric -statistic-based counterparts in many scenarios based on different underlying data distributions.
Big data: Some statistical issues
Cox DR, Kartsonaki C and Keogh RH
A broad review is given of the impact of big data on various aspects of investigation. There is some but not total emphasis on issues in epidemiological research.
A practical guide to big data
Smirnova E, Ivanescu A, Bai J and Crainiceanu CM
Big Data is increasingly prevalent in science and data analysis. We provide a short tutorial for adapting to these changes and making the necessary adjustments to the academic culture to keep Biostatistics truly impactful in scientific research.
Statistical Challenges of Big Brain Network Data
Chung MK
We explore the main characteristics of big brain network data that offer unique statistical challenges. The brain networks are biologically expected to be both sparse and hierarchical. Such unique characterizations put specific topological constraints onto statistical approaches and models we can use effectively. We explore the limitations of the current models used in the field and offer alternative approaches and explain new challenges.
Screening group variables in the proportional hazards model
Ahn KW, Sahr N and Kim S
We propose a method to screen group variables under the high dimensional group variable setting for the proportional hazards model. We study the sure screening property of the proposed method for independent and clustered survival data. The simulation study shows that the proposed method performs better for group variable screening than some existing procedures.