Functional Concurrent Regression Mixture Models Using Spiked Ewens-Pitman Attraction Priors
Functional concurrent, or varying-coefficient, regression models are a form of functional data analysis methods in which functional covariates and outcomes are collected concurrently. Two active areas of research for this class of models are identifying influential functional covariates and clustering their relations across observations. In various applications, researchers have applied and developed methods to address these objectives separately. However, no approach currently performs both tasks simultaneously. In this paper, we propose a fully Bayesian functional concurrent regression mixture model that simultaneously performs functional variable selection and clustering for subject-specific trajectories. Our approach introduces a novel spiked Ewens-Pitman attraction prior that identifies and clusters subjects' trajectories marginally for each functional covariate while using similarities in subjects' auxiliary covariate patterns to inform clustering allocation. Using simulated data, we evaluate the clustering, variable selection, and parameter estimation performance of our approach and compare its performance with alternative spiked processes. We then apply our method to functional data collected in a novel, smartphone-based smoking cessation intervention study to investigate individual-level dynamic relations between smoking behaviors and potential risk factors.
How Trustworthy Is Your Tree? Bayesian Phylogenetic Effective Sample Size Through the Lens of Monte Carlo Error
Bayesian inference is a popular and widely-used approach to infer phylogenies (evolutionary trees). However, despite decades of widespread application, it remains difficult to judge how well a given Bayesian Markov chain Monte Carlo (MCMC) run explores the space of phylogenetic trees. In this paper, we investigate the Monte Carlo error of phylogenies, focusing on high-dimensional summaries of the posterior distribution, including variability in estimated edge/branch (known in phylogenetics as "split") probabilities and tree probabilities, and variability in the estimated summary tree. Specifically, we ask if there is any measure of effective sample size (ESS) applicable to phylogenetic trees which is capable of capturing the Monte Carlo error of these three summary measures. We find that there are some ESS measures capable of capturing the error inherent in using MCMC samples to approximate the posterior distributions on phylogenies. We term these tree ESS measures, and identify a set of three which are useful in practice for assessing the Monte Carlo error. Lastly, we present visualization tools that can improve comparisons between multiple independent MCMC runs by accounting for the Monte Carlo error present in each chain. Our results indicate that common post-MCMC workflows are insufficient to capture the inherent Monte Carlo error of the tree, and highlight the need for both within-chain mixing and between-chain convergence assessments.
A General Bayesian Functional Spatial Partitioning Method for Multiple Region Discovery Applied to Prostate Cancer MRI
Current protocols to estimate the number, size, and location of cancerous lesions in the prostate using multiparametric magnetic resonance imaging (mpMRI) are highly dependent on reader experience and expertise. Automatic voxel-wise cancer classifiers do not directly provide estimates of number, location, and size of cancerous lesions that are clinically important. Existing spatial partitioning methods estimate linear or piecewise-linear boundaries separating regions of local stationarity in spatially registered data and are inadequate for the application of lesion detection. Frequentist segmentation and clustering methods often require pre-specification of the number of clusters and do not quantify uncertainty. Previously, we developed a novel Bayesian functional spatial partitioning method to estimate the boundary surrounding a single cancerous lesion using data derived from mpMRI. We propose a Bayesian functional spatial partitioning method for multiple lesion detection with an unknown number of lesions. Our method utilizes functional estimation to model the smooth boundary curves surrounding each cancerous lesion. In a Reversible Jump Markov Chain Monte Carlo (RJ-MCMC) framework, we develop novel jump steps to jointly estimate and quantify uncertainty in the number of lesions, their boundaries, and the spatial parameters in each lesion. Through simulation we show that our method is robust to the shape of the lesions, number of lesions, and region-specific spatial processes. We illustrate our method through the detection of prostate cancer lesions using MRI.
Bayesian Analysis of Exponential Random Graph Models Using Stochastic Gradient Markov Chain Monte Carlo
The exponential random graph model (ERGM) is a popular model for social networks, which is known to have an intractable likelihood function. Sampling from the posterior for such a model is a long-standing problem in statistical research. We analyze the performance of the stochastic gradient Langevin dynamics (SGLD) algorithm (also known as noisy Longevin Monte Carlo) in tackling this problem, where the stochastic gradient is calculated via running a short Markov chain (the so-called inner Markov chain in this paper) at each iteration. We show that if the model size grows with the network size slowly enough, then SGLD converges to the true posterior in 2-Wasserstein distance as the network size and iteration number become large regardless of the length of the inner Markov chain performed at each iteration. Our study provides a scalable algorithm for analyzing large-scale social networks with possibly high-dimensional ERGMs.
Shrinkage with shrunken shoulders: Gibbs sampling shrinkage model posteriors with guaranteed convergence rates
Use of continuous shrinkage priors - with a "spike" near zero and heavy-tails towards infinity - is an increasingly popular approach to induce sparsity in parameter estimates. When the parameters are only weakly identified by the likelihood, however, the posterior may end up with tails as heavy as the prior, jeopardizing robustness of inference. A natural solution is to "shrink the shoulders" of a shrinkage prior by lightening up its tails beyond a reasonable parameter range, yielding a version of the prior. We develop a regularization approach which, unlike previous proposals, preserves computationally attractive structures of original shrinkage priors. We study theoretical properties of the Gibbs sampler on resulting posterior distributions, with emphasis on convergence rates of the Pólya-Gamma Gibbs sampler for sparse logistic regression. Our analysis shows that the proposed regularization leads to geometric ergodicity under a broad range of global-local shrinkage priors. Essentially, the only requirement is for the prior on the local scale to satisfy . If further satisfies for , as in the case of Bayesian bridge priors, we show the sampler to be uniformly ergodic.
Reproducible Model Selection Using Bagged Posteriors
Bayesian model selection is premised on the assumption that the data are generated from one of the postulated models. However, in many applications, all of these models are incorrect (that is, there is misspecification). When the models are misspecified, two or more models can provide a nearly equally good fit to the data, in which case Bayesian model selection can be highly unstable, potentially leading to self-contradictory findings. To remedy this instability, we propose to use bagging on the posterior distribution ("BayesBag") - that is, to average the posterior model probabilities over many bootstrapped datasets. We provide theoretical results characterizing the asymptotic behavior of the posterior and the bagged posterior in the (misspecified) model selection setting. We empirically assess the BayesBag approach on synthetic and real-world data in (i) feature selection for linear regression and (ii) phylogenetic tree reconstruction. Our theory and experiments show that, when all models are misspecified, BayesBag (a) provides greater reproducibility and (b) places posterior mass on optimal models more reliably, compared to the usual Bayesian posterior; on the other hand, under correct specification, BayesBag is slightly more conservative than the usual posterior, in the sense that BayesBag posterior probabilities tend to be slightly farther from the extremes of zero and one. Overall, our results demonstrate that BayesBag provides an easy-to-use and widely applicable approach that improves upon Bayesian model selection by making it more stable and reproducible.
Generalized Geographically Weighted Regression Model within a Modularized Bayesian Framework
Geographically weighted regression (GWR) models handle geographical dependence through a spatially varying coefficient model and have been widely used in applied science, but its general Bayesian extension is unclear because it involves a weighted log-likelihood which does not imply a probability distribution on data. We present a Bayesian GWR model and show that its essence is dealing with partial misspecification of the model. Current modularized Bayesian inference models accommodate partial misspecification from a single component of the model. We extend these models to handle partial misspecification in more than one component of the model, as required for our Bayesian GWR model. Information from the various spatial locations is manipulated via a geographically weighted kernel and the optimal manipulation is chosen according to a Kullback-Leibler (KL) divergence. We justify the model via an information risk minimization approach and show the consistency of the proposed estimator in terms of a geographically weighted KL divergence.
Perfect Sampling of the Posterior in the Hierarchical Pitman-Yor Process
The predictive probabilities of the hierarchical Pitman-Yor process are approximated through Monte Carlo algorithms that exploits the Chinese Restaurant Franchise (CRF) representation. However, in order to simulate the posterior distribution of the hierarchical Pitman-Yor process, a set of auxiliary variables representing the arrangement of customers in tables of the CRF must be sampled through Markov chain Monte Carlo. This paper develops a perfect sampler for these latent variables employing ideas from the Propp-Wilson algorithm and evaluates its average running time by extensive simulations. The simulations reveal a significant dependence of running time on the parameters of the model, which exhibits sharp transitions. The algorithm is compared to simpler Gibbs sampling procedures, as well as a procedure for unbiased Monte Carlo estimation proposed by Glynn and Rhee. We illustrate its use with an example in microbial genomics studies.
Scalable Approximate Bayesian Computation for Growing Network Models via Extrapolated and Sampled Summaries
Approximate Bayesian computation (ABC) is a simulation-based likelihood-free method applicable to both model selection and parameter estimation. ABC parameter estimation requires the ability to forward simulate datasets from a candidate model, but because the sizes of the observed and simulated datasets usually need to match, this can be computationally expensive. Additionally, since ABC inference is based on comparisons of summary statistics computed on the observed and simulated data, using computationally expensive summary statistics can lead to further losses in efficiency. ABC has recently been applied to the family of mechanistic network models, an area that has traditionally lacked tools for inference and model choice. Mechanistic models of network growth repeatedly add nodes to a network until it reaches the size of the observed network, which may be of the order of millions of nodes. With ABC, this process can quickly become computationally prohibitive due to the resource intensive nature of network simulations and evaluation of summary statistics. We propose two methodological developments to enable the use of ABC for inference in models for large growing networks. First, to save time needed for forward simulating model realizations, we propose a procedure to extrapolate (via both least squares and Gaussian processes) summary statistics from small to large networks. Second, to reduce computation time for evaluating summary statistics, we use sample-based rather than census-based summary statistics. We show that the ABC posterior obtained through this approach, which adds two additional layers of approximation to the standard ABC, is similar to a classic ABC posterior. Although we deal with growing network models, both extrapolated summaries and sampled summaries are expected to be relevant in other ABC settings where the data are generated incrementally.
Combining chains of Bayesian models with Markov melding
A challenge for practitioners of Bayesian inference is specifying a model that incorporates multiple relevant, heterogeneous data sets. It may be easier to instead specify distinct submodels for each source of data, then join the submodels together. We consider chains of submodels, where submodels directly relate to their neighbours via common quantities which may be parameters or deterministic functions thereof. We propose , an extension of Markov melding, a generic method to combine chains of submodels into a joint model. One challenge we address is appropriately capturing the prior dependence between common quantities within a submodel, whilst also reconciling differences in priors for the same common quantity between two adjacent submodels. Estimating the posterior of the resulting overall joint model is also challenging, so we describe a sampler that uses the chain structure to incorporate information contained in the submodels in multiple stages, possibly in parallel. We demonstrate our methodology using two examples. The first example considers an ecological integrated population model, where multiple data sets are required to accurately estimate population immigration and reproduction rates. We also consider a joint longitudinal and time-to-event model with uncertain, submodel-derived event times. Chained Markov melding is a conceptually appealing approach to integrating submodels in these settings.
Robust Adaptive Incorporation of Historical Control Data in a Randomized Trial of External Cooling to Treat Septic Shock
This paper proposes randomized controlled clinical trial design to evaluate external cooling as a means to control fever and thereby reduce mortality in patients with septic shock. The trial will include concurrent external cooling and control arms while adaptively incorporating historical control arm data. Bayesian group sequential monitoring will be done using a posterior comparative test based on the 60-day survival distribution in each concurrent arm. Posterior inference will follow from a Bayesian discrete time survival model that facilitates adaptive incorporation of the historical control data through an innovative regression framework with a multivariate spike-and-slab prior distribution on the historical bias parameters. For each interim test, the amount of information borrowed from the historical control data will be determined adaptively in a manner that reflects the degree of agreement between historical and concurrent control arm data. Guidance is provided for selecting Bayesian posterior probability group-sequential monitoring boundaries. Simulation results elucidating how the proposed method borrows strength from the historical control data are reported. In the absence of historical control arm bias, the proposed design controls the type I error rate and provides substantially larger power than reasonable comparators, whereas in the presence bias of varying magnitude, type I error rate inflation is curbed.
Improving multilevel regression and poststratification with structured priors
A central theme in the field of survey statistics is estimating population-level quantities through data coming from potentially non-representative samples of the population. Multilevel regression and poststratification (MRP), a model-based approach, is gaining traction against the traditional weighted approach for survey estimates. MRP estimates are susceptible to bias if there is an underlying structure that the methodology does not capture. This work aims to provide a new framework for specifying structured prior distributions that lead to bias reduction in MRP estimates. We use simulation studies to explore the benefit of these prior distributions and demonstrate their efficacy on non-representative US survey data. We show that structured prior distributions offer absolute bias reduction and variance reduction for posterior MRP estimates in a large variety of data regimes.
Centered Partition Processes: Informative Priors for Clustering (with Discussion)
There is a very rich literature proposing Bayesian approaches for clustering starting with a prior probability distribution on partitions. Most approaches assume exchangeability, leading to simple representations in terms of Exchangeable Partition Probability Functions (EPPF). Gibbs-type priors encompass a broad class of such cases, including Dirichlet and Pitman-Yor processes. Even though there have been some proposals to relax the exchangeability assumption, allowing covariate-dependence and partial exchangeability, limited consideration has been given on how to include concrete prior knowledge on the partition. For example, we are motivated by an epidemiological application, in which we wish to cluster birth defects into groups and we have prior knowledge of an initial clustering provided by experts. As a general approach for including such prior knowledge, we propose a Centered Partition (CP) process that modifies the EPPF to favor partitions close to an initial one. Some properties of the CP prior are described, a general algorithm for posterior computation is developed, and we illustrate the methodology through simulation examples and an application to the motivating epidemiology study of birth defects.
On the Existence of Uniformly Most Powerful Bayesian Tests With Application to Non-Central Chi-Squared Tests
Uniformly most powerful Bayesian tests (UMPBT's) are an objective class of Bayesian hypothesis tests that can be considered the Bayesian counterpart of classical uniformly most powerful tests. Because the rejection regions of UMPBT's can be matched to the rejection regions of classical uniformly most powerful tests (UMPTs), UMPBT's provide a mechanism for calibrating Bayesian evidence thresholds, Bayes factors, classical significance levels and p-values. The purpose of this article is to expand the application of UMPBT's outside the class of exponential family models. Specifically, we introduce sufficient conditions for the existence of UMPBT's and propose a unified approach for their derivation. An important application of our methodology is the extension of UMPBT's to testing whether the non-centrality parameter of a chi-squared distribution is zero. The resulting tests have broad applicability, providing default alternative hypotheses to compute Bayes factors in, for example, Pearson's chi-squared test for goodness-of-fit, tests of independence in contingency tables, and likelihood ratio, score and Wald tests.
A Phase I-II Basket Trial Design to Optimize Dose-Schedule Regimes Based on Delayed Outcomes
This paper proposes a Bayesian adaptive basket trial design to optimize the dose-schedule regimes of an experimental agent within disease subtypes, called "baskets", for phase I-II clinical trials based on late-onset efficacy and toxicity. To characterize the association among the baskets and regimes, a Bayesian hierarchical model is assumed that includes a heterogeneity parameter, adaptively updated during the trial, that quantifies information shared across baskets. To account for late-onset outcomes when doing sequential decision making, unobserved outcomes are treated as missing values and imputed by exploiting early biomarker and low-grade toxicity information. Elicited joint utilities of efficacy and toxicity are used for decision making. Patients are randomized adaptively to regimes while accounting for baskets, with randomization probabilities proportional to the posterior probability of achieving maximum utility. Simulations are presented to assess the design's robustness and ability to identify optimal dose-schedule regimes within disease subtypes, and to compare it to a simplified design that treats the subtypes independently.
Flexible Bayesian Dynamic Modeling of Correlation and Covariance Matrices
Modeling correlation (and covariance) matrices can be challenging due to the positive-definiteness constraint and potential high-dimensionality. Our approach is to decompose the covariance matrix into the correlation and variance matrices and propose a novel Bayesian framework based on modeling the correlations as products of unit vectors. By specifying a wide range of distributions on a sphere (e.g. the squared-Dirichlet distribution), the proposed approach induces flexible prior distributions for covariance matrices (that go beyond the commonly used inverse-Wishart prior). For modeling real-life spatio-temporal processes with complex dependence structures, we extend our method to dynamic cases and introduce unit-vector Gaussian process priors in order to capture the evolution of correlation among components of a multivariate time series. To handle the intractability of the resulting posterior, we introduce the adaptive Δ-Spherical Hamiltonian Monte Carlo. We demonstrate the validity and flexibility of our proposed framework in a simulation study of periodic processes and an analysis of rat's local field potential activity in a complex sequence memory task.
Using Bayesian Latent Gaussian Graphical Models to Infer Symptom Associations in Verbal Autopsies
Learning dependence relationships among variables of mixed types provides insights in a variety of scientific settings and is a well-studied problem in statistics. Existing methods, however, typically rely on copious, high quality data to accurately learn associations. In this paper, we develop a method for scientific settings where learning dependence structure is essential, but data are sparse and have a high fraction of missing values. Specifically, our work is motivated by survey-based cause of death assessments known as verbal autopsies (VAs). We propose a Bayesian approach to characterize dependence relationships using a latent Gaussian graphical model that incorporates informative priors on the marginal distributions of the variables. We demonstrate such information can improve estimation of the dependence structure, especially in settings with little training data. We show that our method can be integrated into existing probabilistic cause-of-death assignment algorithms and improves model performance while recovering dependence patterns between symptoms that can inform efficient questionnaire design in future data collection.
A New Bayesian Single Index Model with or without Covariates Missing at Random
For many biomedical, environmental, and economic studies, the single index model provides a practical dimension reaction as well as a good physical interpretation of the unknown nonlinear relationship between the response and its multiple predictors. However, widespread uses of existing Bayesian analysis for such models are lacking in practice due to some major impediments, including slow mixing of the Markov Chain Monte Carlo (MCMC), the inability to deal with missing covariates and a lack of theoretical justification of the rate of convergence of Bayesian estimates. We present a new Bayesian single index model with an associated MCMC algorithm that incorporates an efficient Metropolis-Hastings (MH) step for the conditional distribution of the index vector. Our method leads to a model with good interpretations and prediction, implementable Bayesian inference, fast convergence of the MCMC and a first-time extension to accommodate missing covariates. We also obtain, for the first time, the set of sufficient conditions for obtaining the optimal rate of posterior convergence of the overall regression function. We illustrate the practical advantages of our method and computational tool via reanalysis of an environmental study.
Bayesian Network Marker Selection via the Thresholded Graph Laplacian Gaussian Prior
Selecting informative nodes over large-scale networks becomes increasingly important in many research areas. Most existing methods focus on the local network structure and incur heavy computational costs for the large-scale problem. In this work, we propose a novel prior model for Bayesian network marker selection in the generalized linear model (GLM) framework: the Thresholded Graph Laplacian Gaussian (TGLG) prior, which adopts the graph Laplacian matrix to characterize the conditional dependence between neighboring markers accounting for the global network structure. Under mild conditions, we show the proposed model enjoys the posterior consistency with a diverging number of edges and nodes in the network. We also develop a Metropolis-adjusted Langevin algorithm (MALA) for efficient posterior computation, which is scalable to large-scale networks. We illustrate the superiorities of the proposed method compared with existing alternatives via extensive simulation studies and an analysis of the breast cancer gene expression dataset in the Cancer Genome Atlas (TCGA).
Hierarchical Normalized Completely Random Measures for Robust Graphical Modeling
Gaussian graphical models are useful tools for exploring network structures in multivariate normal data. In this paper we are interested in situations where data show departures from Gaussianity, therefore requiring alternative modeling distributions. The multivariate -distribution, obtained by dividing each component of the data vector by a gamma random variable, is a straightforward generalization to accommodate deviations from normality such as heavy tails. Since different groups of variables may be contaminated to a different extent, Finegold and Drton (2014) introduced the Dirichlet -distribution, where the divisors are clustered using a Dirichlet process. In this work, we consider a more general class of nonparametric distributions as the prior on the divisor terms, namely the class of normalized completely random measures (NormCRMs). To improve the effectiveness of the clustering, we propose modeling the dependence among the divisors through a nonparametric hierarchical structure, which allows for the sharing of parameters across the samples in the data set. This desirable feature enables us to cluster together different components of multivariate data in a parsimonious way. We demonstrate through simulations that this approach provides accurate graphical model inference, and apply it to a case study examining the dependence structure in radiomics data derived from The Cancer Imaging Atlas.
High-Dimensional Confounding Adjustment Using Continuous Spike and Slab Priors
In observational studies, estimation of a causal effect of a treatment on an outcome relies on proper adjustment for confounding. If the number of the potential confounders () is larger than the number of observations (), then direct control for all potential confounders is infeasible. Existing approaches for dimension reduction and penalization are generally aimed at predicting the outcome, and are less suited for estimation of causal effects. Under standard penalization approaches (e.g. Lasso), if a variable is strongly associated with the treatment but weakly with the outcome , the coefficient will be shrunk towards zero thus leading to confounding bias. Under the assumption of a linear model for the outcome and sparsity, we propose continuous spike and slab priors on the regression coefficients corresponding to the potential confounders . Specifically, we introduce a prior distribution that does not heavily shrink to zero the coefficients ( s) of the s that are strongly associated with but weakly associated with . We compare our proposed approach to several state of the art methods proposed in the literature. Our proposed approach has the following features: 1) it reduces confounding bias in high dimensional settings; 2) it shrinks towards zero coefficients of instrumental variables; and 3) it achieves good coverages even in small sample sizes. We apply our approach to the National Health and Nutrition Examination Survey (NHANES) data to estimate the causal effects of persistent pesticide exposure on triglyceride levels.