ENVIRONMETRICS

A dependent Bayesian Dirichlet process model for source apportionment of particle number size distribution
Baerenbold O, Meis M, Martínez-Hernández I, Euán C, Burr WS, Tremper A, Fuller G, Pirani M and Blangiardo M
The relationship between particle exposure and health risks has been well established in recent years. Particulate matter (PM) is made up of different components coming from several sources, which might have different level of toxicity. Hence, identifying these sources is an important task in order to implement effective policies to improve air quality and population health. The problem of identifying sources of particulate pollution has already been studied in the literature. However, current methods require an a priori specification of the number of sources and do not include information on covariates in the source allocations. Here, we propose a novel Bayesian nonparametric approach to overcome these limitations. In particular, we model source contribution using a Dirichlet process as a prior for source profiles, which allows us to estimate the number of components that contribute to particle concentration rather than fixing this number beforehand. To better characterize them we also include meteorological variables (wind speed and direction) as covariates within the allocation process via a flexible Gaussian kernel. We apply the model to apportion particle number size distribution measured near London Gatwick Airport (UK) in 2019. When analyzing this data, we are able to identify the most common PM sources, as well as new sources that have not been identified with the commonly used methods.
An illustration of model agnostic explainability methods applied to environmental data
Wikle CK, Datta A, Hari BV, Boone EL, Sahoo I, Kavila I, Castruccio S, Simmons SJ, Burr WS and Chang W
Historically, two primary criticisms statisticians have of machine learning and deep neural models is their lack of uncertainty quantification and the inability to do inference (i.e., to explain what inputs are important). Explainable AI has developed in the last few years as a sub-discipline of computer science and machine learning to mitigate these concerns (as well as concerns of fairness and transparency in deep modeling). In this article, our focus is on explaining which inputs are important in models for predicting environmental data. In particular, we focus on three general methods for explainability that are model agnostic and thus applicable across a breadth of models without internal explainability: "feature shuffling", "interpretable local surrogates", and "occlusion analysis". We describe particular implementations of each of these and illustrate their use with a variety of models, all applied to the problem of long-lead forecasting monthly soil moisture in the North American corn belt given sea surface temperature anomalies in the Pacific Ocean.
Two years of COVID-19 pandemic: The Italian experience of Statgroup-19
Jona Lasinio G, Divino F, Lovison G, Mingione M, Alaimo Di Loro P, Farcomeni A and Maruotti A
The amount and poor quality of available data and the need of appropriate modeling of the main epidemic indicators require specific skills. In this context, the statistician plays a key role in the process that leads to policy decisions, starting with monitoring changes and evaluating risks. The "what" and the "why" of these changes represent fundamental research questions to provide timely and effective tools to manage the evolution of the epidemic. Answers to such questions need appropriate statistical models and visualization tools. Here, we give an overview of the role played by Statgroup-19, an independent Italian research group born in March 2020. The group includes seven statisticians from different Italian universities, each with different backgrounds but with a shared interest in data analysis, statistical modeling, and biostatistics. Since the beginning of the COVID-19 pandemic the group has interacted with authorities and journalists to support policy decisions and inform the general public about the evolution of the epidemic. This collaboration led to several scientific papers and an accrued visibility across various media, all made possible by the continuous interaction across the group members that shared their unique expertise.
Continuous Model Averaging for Benchmark Dose Analysis: Averaging Over Distributional Forms
Wheeler MW, Cortinas J, Aerts M, Gift JS and Davis JA
When estimating a benchmark dose (BMD) from chemical toxicity experiments, model averaging is recommended by the National Institute for Occupational Safety and Health, World Health Organization and European Food Safety Authority. Though numerous studies exist for Model Average BMD estimation using dichotomous responses, fewer studies investigate it for BMD estimation using continuous response. In this setting, model averaging a BMD poses additional problems as the assumed distribution is essential to many BMD definitions, and distributional uncertainty is underestimated when one error distribution is chosen a priori. As model averaging combines full models, there is no reason one cannot include multiple error distributions. Consequently, we define a continuous model averaging approach over distributional models and show that it is superior to single distribution model averaging. To show the superiority of the approach, we apply the method to simulated and experimental response data.
Marginal inference for hierarchical generalized linear mixed models with patterned covariance matrices using the Laplace approximation
Ver Hoef JM, Blagg E, Dumelle M, Dixon PM, Zimmerman DL and Conn PB
We develop hierarchical models and methods in a fully parametric approach to generalized linear mixed models for any patterned covariance matrix. The Laplace approximation is used to marginally estimate covariance parameters by integrating over all fixed and latent random effects. The Laplace approximation relies on Newton-Raphson updates, which also leads to predictions for the latent random effects. We develop methodology for complete marginal inference, from estimating covariance parameters and fixed effects to making predictions for unobserved data. The marginal likelihood is developed for six distributions that are often used for binary, count, and positive continuous data, and our framework is easily extended to other distributions. We compare our methods to fully Bayesian methods, automatic differentiation, and integrated nested Laplace approximations (INLA) for bias, mean-squared (prediction) error, and interval coverage, and all methods yield very similar results. However, our methods are much faster than Bayesian methods, and more general than INLA. Examples with binary and proportional data, count data, and positive-continuous data are used to illustrate all six distributions with a variety of patterned covariance structures that include spatial models (both geostatistical and areal models), time series models, and mixtures with typical random intercepts based on grouping.
Association between air pollution and COVID-19 disease severity via Bayesian multinomial logistic regression with partially missing outcomes
Hoskovec L, Martenies S, Burket TL, Magzamen S and Wilson A
Recent ecological analyses suggest air pollution exposure may increase susceptibility to and severity of coronavirus disease 2019 (COVID-19). Individual-level studies are needed to clarify the relationship between air pollution exposure and COVID-19 outcomes. We conduct an individual-level analysis of long-term exposure to air pollution and weather on peak COVID-19 severity. We develop a Bayesian multinomial logistic regression model with a multiple imputation approach to impute partially missing health outcomes. Our approach is based on the stick-breaking representation of the multinomial distribution, which offers computational advantages, but presents challenges in interpreting regression coefficients. We propose a novel inferential approach to address these challenges. In a simulation study, we demonstrate our method's ability to impute missing outcome data and improve estimation of regression coefficients compared to a complete case analysis. In our analysis of 55,273 COVID-19 cases in Denver, Colorado, increased annual exposure to fine particulate matter in the year prior to the pandemic was associated with increased risk of severe COVID-19 outcomes. We also found COVID-19 disease severity to be associated with interactions between exposures. Our individual-level analysis fills a gap in the literature and helps to elucidate the association between long-term exposure to air pollution and COVID-19 outcomes.
A spatiotemporal analysis of NO concentrations during the Italian 2020 COVID-19 lockdown
Fioravanti G, Cameletti M, Martino S, Cattani G and Pisoni E
When a new environmental policy or a specific intervention is taken in order to improve air quality, it is paramount to assess and quantify-in space and time-the effectiveness of the adopted strategy. The lockdown measures taken worldwide in 2020 to reduce the spread of the SARS-CoV-2 virus can be envisioned as a policy intervention with an indirect effect on air quality. In this paper we propose a statistical spatiotemporal model as a tool for intervention analysis, able to take into account the effect of weather and other confounding factor, as well as the spatial and temporal correlation existing in the data. In particular, we focus here on the 2019/2020 relative change in nitrogen dioxide (NO ) concentrations in the north of Italy, for the period of March and April during which the lockdown measure was in force. We found that during March and April 2020 most of the studied area is characterized by negative relative changes (median values around 25%), with the exception of the first week of March and the fourth week of April (median values around 5%). As these changes cannot be attributed to a weather effect, it is likely that they are a byproduct of the lockdown measures. There are two aspects of our research that are equally interesting. First, we provide a unique statistical perspective for calculating the relative change in the NO by jointly modeling pollutant concentrations time series. Second, as an output we provide a collection of weekly continuous maps, describing the spatial pattern of the NO 2019/2020 relative changes.
A hierarchical integrative group least absolute shrinkage and selection operator for analyzing environmental mixtures
Boss J, Rix A, Chen YH, Narisetty NN, Wu Z, Ferguson KK, McElrath TF, Meeker JD and Mukherjee B
Environmental health studies are increasingly measuring multiple pollutants to characterize the joint health effects attributable to exposure mixtures. However, the underlying dose-response relationship between toxicants and health outcomes of interest may be highly nonlinear, with possible nonlinear interaction effects. Existing penalized regression methods that account for exposure interactions either cannot accommodate nonlinear interactions while maintaining strong heredity or are computationally unstable in applications with limited sample size. In this paper, we propose a general shrinkage and selection framework to identify noteworthy nonlinear main and interaction effects among a set of exposures. We design hierarchical integrative group least absolute shrinkage and selection operator (HiGLASSO) to (a) impose strong heredity constraints on two-way interaction effects (hierarchical), (b) incorporate adaptive weights without necessitating initial coefficient estimates (integrative), and (c) induce sparsity for variable selection while respecting group structure (group LASSO). We prove sparsistency of the proposed method and apply HiGLASSO to an environmental toxicants dataset from the LIFECODES birth cohort, where the investigators are interested in understanding the joint effects of 21 urinary toxicant biomarkers on urinary 8-isoprostane, a measure of oxidative stress. An implementation of HiGLASSO is available in the higlasso R package, accessible through the Comprehensive R Archive Network.
Benchmark dose risk analysis with mixed-factor quantal data in environmental risk assessment
Sans-Fuentes MA and Piegorsch WW
Benchmark analysis is a general risk estimation strategy for identifying the benchmark dose (BMD) past which the risk of exhibiting an adverse environmental response exceeds a fixed, target value of benchmark response (BMR). Estimation of BMD and of its lower confidence limit (BMDL) is well understood for the case of an adverse response to a single stimulus. In many environmental settings, however, one or more additional, secondary, qualitative factor(s) may collude to affect the adverse outcome, such that the risk changes with differential levels of the secondary factor. This paper extends the single-dose BMD paradigm to a mixed-factor setting with a secondary qualitative factor possessing two levels. With focus on quantal-response data and using a generalized linear model with a complementary-log link function, we derive expressions for BMD and BMDL. We study the operating characteristics of six different multiplicity-adjusted approaches to calculate the BMDL, using Monte Carlo evaluations. We illustrate the calculations via an example data set from environmental carcinogenicity testing.
Fast Grid Search and Bootstrap-based Inference for Continuous Two-phase Polynomial Regression Models
Son H and Fong Y
Two-phase polynomial regression models (Robison, 1964; Fuller, 1969; Gallant and Fuller, 1973; Zhan et al., 1996) are widely used in ecology, public health, and other applied fields to model nonlinear relationships. These models are characterized by the presence of threshold parameters, across which the mean functions are allowed to change. That the threshold is a parameter of the model to be estimated from the data is an essential feature of two-phase models. It distinguishes them, and more generally, multi-phase models, from the spline models and has profound implications for both computation and inference for the models. Estimation of two-phase polynomial regression models is a non-convex, non-smooth optimization problem. Grid search provides high quality solutions to the estimation problem, but is very slow when done by brute force. Building upon our previous work on piecewise linear two-phase regression models estimation, we develop fast grid search algorithms for two-phase polynomial regression models and demonstrate their performance. Furthermore, we develop bootstrap-based pointwise and simultaneous confidence bands for mean functions. Monte Carlo studies are conducted to demonstrate the computational and statistical properties of the proposed methods. Three real datasets are used to help illustrate the application of two-phase models, with special attention on model choice.
Effects of corona virus disease-19 control measures on air quality in North China
Zheng X, Guo B, He J and Chen SX
Corona virus disease-19 (COVID-19) has substantially reduced human activities and the associated anthropogenic emissions. This study quantifies the effects of COVID-19 control measures on six major air pollutants over 68 cities in North China by a Difference in Relative-Difference method that allows estimation of the COVID-19 effects while taking account of the general annual air quality trends, temporal and meteorological variations, and the spring festival effects. Significant COVID-19 effects on all six major air pollutants are found, with NO having the largest decline (-39.6%), followed by PM (-30.9%), O (-16.3%), PM (-14.3%), CO (-13.9%), and the least in SO (-10.0%), which shows the achievability of air quality improvement by a large reduction in anthropogenic emissions. The heterogeneity of effects among the six pollutants and different regions can be partly explained by coal consumption and industrial output data.
An extended and unified modeling framework for benchmark dose estimation for both continuous and binary data
Aerts M, Wheeler MW and Abrahantes JC
Protection and safety authorities recommend the use of model averaging to determine the benchmark dose approach as a scientifically more advanced method compared with the no-observed-adverse-effect-level approach for obtaining a reference point and deriving health-based guidance values. Model averaging however highly depends on the set of candidate dose-response models and such a set should be rich enough to ensure that a well-fitting model is included. The currently applied set of candidate models for continuous endpoints is typically limited to two models, the exponential and Hill model, and differs completely from the richer set of candidate models currently used for binary endpoints. The objective of this article is to propose a general and wide framework of dose response models, which can be applied both to continuous and binary endpoints and covers the current models for both type of endpoints. In combination with the bootstrap, this framework offers a unified approach to benchmark dose estimation. The methodology is illustrated using two data sets, one with a continuous and another with a binary endpoint.
Bayesian Nonparametric Monotone Regression
Wilson A, Tryner J, L'Orange C and Volckens J
In many applications there is interest in estimating the relation between a predictor and an outcome when the relation is known to be monotone or otherwise constrained due to the physical processes involved. We consider one such application-inferring time-resolved aerosol concentration from a low-cost differential pressure sensor. The objective is to estimate a monotone function and make inference on the scaled first derivative of the function. We proposed Bayesian nonparametric monotone regression which uses a Bernstein polynomial basis to construct the regression function and puts a Dirichlet process prior on the regression coefficients. The base measure of the Dirichlet process is a finite mixture of a mass point at zero and a truncated normal. This construction imposes monotonicity while clustering the basis functions. Clustering the basis functions reduces the parameter space and allows the estimated regression function to be linear. With the proposed approach we can make closed-formed inference on the derivative of the estimated function including full quantification of uncertainty. In a simulation study the proposed method performs similar to other monotone regression approaches when the true function is wavy but performs better when the true function is linear. We apply the method to estimate time-resolved aerosol concentration with a newly-developed portable aerosol monitor. The R package bnmr is made available to implement the method.
Probabilistic predictive principal component analysis for spatially misaligned and high-dimensional air pollution data with missing observations
Vu PT, Larson TV and Szpiro AA
Accurate predictions of pollutant concentrations at new locations are often of interest in air pollution studies on fine particulate matters (PM), in which data is usually not measured at all study locations. PM is also a mixture of many different chemical components. Principal component analysis (PCA) can be incorporated to obtain lower-dimensional representative scores of such multi-pollutant data. Spatial prediction can then be used to estimate these scores at new locations. Recently developed predictive PCA modifies the traditional PCA algorithm to obtain scores with spatial structures that can be well predicted at unmeasured locations. However, these approaches require complete data, whereas multi-pollutant data tends to have complex missing patterns in practice. We propose probabilistic versions of predictive PCA which allow for flexible model-based imputation that can account for spatial information and subsequently improve the overall predictive performance.
Multivariate Air Pollution Prediction Modeling with partial Missingness
Boaz RM, Lawson AB and Pearce JL
Missing observations from air pollution monitoring networks have posed a longstanding problem for health investigators of air pollution. Growing interest in mixtures of air pollutants has further complicated this problem, as many new challenges have arisen that require development of novel methods. The objective of this study is to develop a methodology for multivariate prediction of air pollution. We focus specifically on tackling different forms of missing data, such as: spatial (sparse sites), outcome (pollutants not measured at some sites), and temporal (varieties of interrupted time series). To address these challenges, we develop a novel multivariate fusion framework, which leverages the observed inter-pollutant correlation structure to reduce error in the simultaneous prediction of multiple air pollutants. Our joint fusion model employs predictions from the Environmental Protection Agency's Community Multiscale Air Quality (CMAQ) model along with spatio-temporal error terms. We have implemented our models on both simulated data and a case study in South Carolina for 8 pollutants over a 28-day period in June 2006. We found that our model, which uses a multivariate correlated error in a Bayesian framework, showed promising predictive accuracy particularly for gaseous pollutants.
On spatial conditional extremes for ocean storm severity
Shooter R, Ross E, Tawn J and Jonathan P
We describe a model for the conditional dependence of a spatial process measured at one or more remote locations given extreme values of the process at a conditioning location, motivated by the conditional extremes methodology of Heffernan and Tawn. Compared to alternative descriptions in terms of max-stable spatial processes, the model is advantageous because it is conceptually straightforward and admits different forms of extremal dependence (including asymptotic dependence and asymptotic independence). We use the model within a Bayesian framework to estimate the extremal dependence of ocean storm severity (quantified using significant wave height, ) for locations on spatial transects with approximate east-west (E-W) and north-south (N-S) orientations in the northern North Sea (NNS) and central North Sea (CNS). For on the standard Laplace marginal scale, the conditional extremes "linear slope" parameter decays approximately exponentially with distance for all transects. Furthermore, the decay of mean dependence with distance is found to be faster in CNS than NNS. The persistence of mean dependence is greatest for the E-W transect in NNS, potentially because this transect is approximately aligned with the direction of propagation of the most severe storms in the region.
Adaptive predictive principal components for modeling multivariate air pollution
Bose M, Larson T and Szpiro AA
Air pollution monitoring locations are typically spatially misaligned with locations of participants in a cohort study, so to analyze pollution-health associations, exposures must be predicted at subject locations. For a pollution measure like PM (fine particulate matter) comprised of multiple chemical components, the predictive principal component analysis (PCA) algorithm derives a low-dimensional representation of component profiles for use in health analyses. Geographic covariates and spatial splines help determine the principal component loadings of the pollution data to give improved prediction accuracy of the principal component scores. While predictive PCA can accommodate pollution data of arbitrary dimension, it is currently limited to a small number of pre-selected geographic covariates. We propose an adaptive predictive PCA algorithm, which automatically identifies a combination of covariates that is most informative in choosing the principal component directions in the pollutant space. We show that adaptive predictive PCA improves the accuracy of multi-pollutant exposure predictions at subject locations.
Linear regression with left-censored covariates and outcome using a pseudolikelihood approach
Jones MP
Environmental toxicology studies often involve sample values that fall below a laboratory procedure's limit of quantification. Such left-censored data give rise to several problems for regression analyses. First, both covariates and outcome may be left censored. Second, the transformed toxicant levels may not be normal but mixtures of normals because of differences in personal characteristics, e.g. exposure history and demographic factors. Third, the outcome and covariates may be linear functions of left-censored variates, such as averages and differences. Fourth, some toxicant levels may be functions of other toxicant levels resulting in a recursive system. In this paper marginal and pseudo-likelihood based methods are proposed for estimation of the means and covariance matrix of variates found in these four settings. Next, linear regression methods are developed allowing outcomes and covariates to be linear combinations of left-censored measures. This is extended to a recursive system of modeling equations. Bootstrap standard errors and confidence intervals are used. Simulation studies demonstrate the proposed methods are accurate for a wide range of study designs and left-censoring probabilities. The proposed methods are illustrated through the analysis of an on-going community-based study of polychlorinated biphenyls, which motivated the proposed methodology.
Multivariate left-censored Bayesian model for predicting exposure using multiple chemical predictors
Groth C, Banerjee S, Ramachandran G, Stenzel MR and Stewart PA
Environmental health exposures to airborne chemicals often originate from chemical mixtures. Environmental health professionals may be interested in assessing exposure to one or more of the chemicals in these mixtures, but often exposure measurement data are not available, either because measurements were not collected/assessed for all exposure scenarios of interest or because some of the measurements were below the analytical methods' limits of detection (i.e. censored). In some cases, based on chemical laws, two or more components may have linear relationships with one another, whether in a single or in multiple mixtures. Although bivariate analyses can be used if the correlation is high, often correlations are low. To serve this need, this paper develops a multivariate framework for assessing exposure using relationships of the chemicals present in these mixtures. This framework accounts for censored measurements in all chemicals, allowing us to develop unbiased exposure estimates. We assessed our model's performance against simpler models at a variety of censoring levels and assessed our model's 95% coverage. We applied our model to assess vapor exposure from measurements of three chemicals in crude oil taken on the during the oil spill response and clean-up.
Modeling the health effects of time-varying complex environmental mixtures: Mean field variational Bayes for lagged kernel machine regression
Liu SH, Bobb JF, Claus Henn B, Schnaas L, Tellez-Rojo MM, Gennings C, Arora M, Wright RO, Coull BA and Wand MP
There is substantial interest in assessing how exposure to environmental mixtures, such as chemical mixtures, affect child health. Researchers are also interested in identifying critical time windows of susceptibility to these complex mixtures. A recently developed method, called lagged kernel machine regression (LKMR), simultaneously accounts for these research questions by estimating effects of time-varying mixture exposures, and identifying their critical exposure windows. However, LKMR inference using Markov chain Monte Carlo methods (MCMC-LKMR) is computationally burdensome and time intensive for large datasets, limiting its applicability. Therefore, we develop a mean field variational Bayesian inference procedure for lagged kernel machine regression (MFVB-LKMR). The procedure achieves computational efficiency and reasonable accuracy as compared with the corresponding MCMC estimation method. Updating parameters using MFVB may only take minutes, while the equivalent MCMC method may take many hours or several days. We apply MFVB-LKMR to PROGRESS, a prospective cohort study in Mexico. Results from a subset of PROGRESS using MFVB-LKMR provide evidence of significant positive association between second trimester cobalt levels and z-scored birthweight. This positive association is heightened by cesium exposure. MFVB-LKMR is a promising approach for computationally efficient analysis of environmental health datasets, to identify critical windows of exposure to complex mixtures.
Bayesian inference in time-varying additive hazards models with applications to disease mapping
Chernoukhov A, Hussein A, Nkurunziza S and Bandyopadhyay D
Environmental health and disease mapping studies are often concerned with the evaluation of the combined effect of various socio-demographic and behavioral factors, and environmental exposures on time-to-events of interest, such as death of individuals, organisms or plants. In such studies, estimation of the hazard function is often of interest. In addition to known explanatory variables, the hazard function maybe subject to spatial/geographical variations, such that proximally located regions may experience similar hazards than regions that are distantly located. A popular approach for handling this type of spatially-correlated time-to-event data is the Cox's Proportional Hazards (PH) regression model with spatial frailties. However, the PH assumption poses a major practical challenge, as it entails that the effects of the various explanatory variables remain constant over time. This assumption is often unrealistic, for instance, in studies with long follow-ups where the effects of some exposures on the hazard may vary drastically over time. Our goal in this paper is to offer a flexible semiparametric additive hazards model (AH) with spatial frailties. Our proposed model allows both the frailties as well as the regression coefficients to be time-varying, thus relaxing the proportionality assumption. Our estimation framework is Bayesian, powered by carefully tailored posterior sampling strategies via Markov chain Monte Carlo techniques. We apply the model to a dataset on prostate cancer survival from the US state of Louisiana to illustrate its advantages.