JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS

A Bayesian Partial Membership Model for Multiple Exposures with Uncertain Group Memberships
Zavez AE, McSorley EM, Yeates AJ and Thurston SW
We present a Bayesian partial membership model that estimates the associations between an outcome, a small number of latent variables, and multiple observed exposures where the number of latent variables is specified . We assign one observed exposure as the sentinel marker for each latent variable. The model allows non-sentinel exposures to have complete membership in one latent group, or partial membership across two or more latent groups. MCMC sampling is used to determine latent group partial memberships for the non-sentinel exposures, and estimate all model parameters. We compare the performance of our model to competing approaches in a simulation study and apply our model to inflammatory marker data measured in a large mother-child cohort of the Seychelles Child Development Study (SCDS). In simulations, our model estimated model parameters with little bias, adequate coverage, and tighter credible intervals compared to competing approaches. Under our partial membership model with two latent groups, SCDS inflammatory marker classifications generally aligned with the scientific literature. Incorporating additional SCDS inflammatory markers and more latent groups produced similar groupings of markers that also aligned with the literature. Associations between covariates and birth weight were similar across latent variable models and were consistent with earlier work in this SCDS cohort.
An Application of Spatio-temporal Modeling to Finite Population Abundance Prediction
Higham M, Dumelle M, Hammond C, Ver Hoef J and Wells J
Spatio-temporal models can be used to analyze data collected at various spatial locations throughout multiple time points. However, even with a finite number of spatial locations, there may be a lack of resources to collect data from every spatial location at every time point. We develop a spatio-temporal finite-population block kriging (ST-FPBK) method to predict a quantity of interest, such as a mean or total, across a finite number of spatial locations. This ST-FPBK predictor incorporates an appropriate variance reduction for sampling from a finite population. Through an application to moose surveys in the east-central region of Alaska, we show that the predictor has a substantially smaller standard error compared to a predictor from the purely spatial model that is currently used to analyze moose surveys in the region. We also show how the model can be used to forecast a prediction for abundance in a time point for which spatial locations have not yet been surveyed. A separate simulation study shows that the spatio-temporal predictor is unbiased and that prediction intervals from the ST-FPBK predictor attain appropriate coverage. For ecological monitoring surveys completed with some regularity through time, use of ST-FPBK could improve precision. We also give an R package that ecologists and resource managers could use to incorporate data from past surveys in predicting a quantity from a current survey.
A Causal Mediation Model for Longitudinal Mediators and Survival Outcomes with an Application to Animal Behavior
Zeng S, Lange EC, Archie EA, Campos FA, Alberts SC and Li F
In animal behavior studies, a common goal is to investigate the causal pathways between an exposure and outcome, and a mediator that lies in between. Causal mediation analysis provides a principled approach for such studies. Although many applications involve longitudinal data, the existing causal mediation models are not directly applicable to settings where the mediators are measured on irregular time grids. In this paper, we propose a causal mediation model that accommodates longitudinal mediators on arbitrary time grids and survival outcomes simultaneously. We take a functional data analysis perspective and view longitudinal mediators as realizations of underlying smooth stochastic processes. We define causal estimands of direct and indirect effects accordingly and provide corresponding identification assumptions. We employ a functional principal component analysis approach to estimate the mediator process and propose a Cox hazard model for the survival outcome that flexibly adjusts the mediator process. We then derive a g-computation formula to express the causal estimands using the model coefficients. The proposed method is applied to a longitudinal data set from the Amboseli Baboon Research Project to investigate the causal relationships between early adversity, adult physiological stress responses, and survival among wild female baboons. We find that adversity experienced in early life has a significant direct effect on females' life expectancy and survival probability, but find little evidence that these effects were mediated by markers of the stress response in adulthood. We further developed a sensitivity analysis method to assess the impact of potential violation to the key assumption of sequential ignorability. Supplementary materials accompanying this paper appear on-line.
Estimating a Causal Exposure Response Function with a Continuous Error-Prone Exposure: A Study of Fine Particulate Matter and All-Cause Mortality
Josey KP, deSouza P, Wu X, Braun D and Nethery R
Numerous studies have examined the associations between long-term exposure to fine particulate matter (PM) and adverse health outcomes. Recently, many of these studies have begun to employ high-resolution predicted PM concentrations, which are subject to measurement error. Previous approaches for exposure measurement error correction have either been applied in non-causal settings or have only considered a categorical exposure. Moreover, most procedures have failed to account for uncertainty induced by error correction when fitting an exposure-response function (ERF). To remedy these deficiencies, we develop a multiple imputation framework that combines regression calibration and Bayesian techniques to estimate a causal ERF. We demonstrate how the output of the measurement error correction steps can be seamlessly integrated into a Bayesian additive regression trees (BART) estimator of the causal ERF. We also demonstrate how locally-weighted smoothing of the posterior samples from BART can be used to create a more accurate ERF estimate. Our proposed approach also properly propagates the exposure measurement error uncertainty to yield accurate standard error estimates. We assess the robustness of our proposed approach in an extensive simulation study. We then apply our methodology to estimate the effects of PM on all-cause mortality among Medicare enrollees in New England from 2000-2012.
Population Size Estimation using Zero-truncated Poisson Regression with Measurement Error
Hwang WH, Stoklosa J and Wang CY
Population size estimation is an important research field in biological sciences. In practice, covariates are often measured upon capture on individuals sampled from the population. However, some biological measurements, such as body weight may vary over time within a subject's capture history. This can be treated as a population size estimation problem in the presence of covariate measurement error. We show that if the unobserved true covariate and measurement error are both normally distributed, then a naïve estimator without taking into account measurement error will under-estimate the population size. We then develop new methods to correct for the effect of measurement errors. In particular, we present a conditional score and a nonparametric corrected score approach that are both consistent for population size estimation. Importantly, the proposed approaches do not require the distribution assumption on the true covariates, furthermore the latter does not require normality assumptions on the measurement errors. This is highly relevant in biological applications, as the distribution of covariates is often non-normal or unknown. We investigate finite sample performance of the new estimators via extensive simulated studies. The methods are applied to real data from a capture-recapture study.
Techniques to improve ecological interpretability of black-box machine learning models
Welchowski T, Maloney KO, Mitchell R and Schmidt M
Statistical modeling of ecological data is often faced with a large number of variables as well as possible nonlinear relationships and higher-order interaction effects (GBT) have been successful in addressing these issues and have shown a good predictive performance in modeling nonlinear relationships, in particular in classification settings with a categorical response variable. They also tend to be robust against outliers. However, their black-box nature makes it difficult to interpret these models. We introduce several recently developed statistical tools to the environmental research community in order to advance interpretation of these black-box models. To analyze the properties of the tools, we applied gradient boosted trees to investigate biological health of streams within the contiguous U.S., as measured by a benthic macroinvertebrate biotic index. Based on these data and a simulation study, we demonstrate the advantages and limitations of (PDP), (ICE) curves and (ALE) in their ability to identify covariate-response relationships. Additionally interaction effects were quantified according to interaction strength (IAS) and Friedman's statistic. Interpretable machine learning techniques are useful tools to open the black-box of gradient boosted trees in the environmental sciences. This finding is supported by our case study on the effect of impervious surface on the benthic condition, which agrees with previous results in the literature. Overall the most important variables were ecoregion, bed stability, watershed area, riparian vegetation and catchment slope. These variables were also present in most identified interaction effects. In conclusion, graphical tools (PDP, ICE, ALE) enable visualization and easier interpretation of GBT but should be supported by analytical statistical measures. Future methodological research is needed to investigate the properties of interaction tests.
A Statistical Perspective on the Challenges in Molecular Microbial Biology
Jeganathan P and Holmes SP
High throughput sequencing (HTS)-based technology enables identifying and quantifying non-culturable microbial organisms in all environments. Microbial sequences have enhanced our understanding of the human microbiome, the soil and plant environment, and the marine environment. All molecular microbial data pose statistical challenges due to contamination sequences from reagents, batch effects, unequal sampling, and undetected taxa. Technical biases and heteroscedasticity have the strongest effects, but different strains across subjects and environments also make direct differential abundance testing unwieldy. We provide an introduction to a few statistical tools that can overcome some of these difficulties and demonstrate those tools on an example. We show how standard statistical methods, such as simple hierarchical mixture and topic models, can facilitate inferences on latent microbial communities. We also review some nonparametric Bayesian approaches that combine visualization and uncertainty quantification. The intersection of molecular microbial biology and statistics is an exciting new venue. Finally, we list some of the important open problems that would benefit from more careful statistical method development.
Statistical downscaling with spatial misalignment: Application to wildland fire PM concentration forecasting
Majumder S, Guan Y, Reich BJ, O'Neill S and Rappold AG
Fine particulate matter, PM, has been documented to have adverse health effects and wildland fires are a major contributor to PM air pollution in the US. Forecasters use numerical models to predict PM concentrations to warn the public of impending health risk. Statistical methods are needed to calibrate the numerical model forecast using monitor data to reduce bias and quantify uncertainty. Typical model calibration techniques do not allow for errors due to misalignment of geographic locations. We propose a spatiotemporal downscaling methodology that uses image registration techniques to identify the spatial misalignment and accounts for and corrects the bias produced by such warping. Our model is fitted in a Bayesian framework to provide uncertainty quantification of the misalignment and other sources of error. We apply this method to different simulated data sets and show enhanced performance of the method in presence of spatial misalignment. Finally, we apply the method to a large fire in Washington state and show that the proposed method provides more realistic uncertainty quantification than standard methods.
Modeling and Prediction of Multiple Correlated Functional Outcomes
Cao J, Soiaporn K, Carroll RJ and Ruppert D
We propose a copula-based approach for analyzing functional data with correlated multiple functional outcomes exhibiting heterogeneous shape characteristics. To accommodate the possibly large number of parameters due to having several functional outcomes, parameter estimation is performed in two steps: first, the parameters for the marginal distributions are estimated using the skew t family, and then the dependence structure both within and across outcomes is estimated using a Gaussian copula. We develop an estimation algorithm for the dependence parameters based on the Karhunen-Loève expansion and an EM algorithm that significantly reduces the dimension of the problem and is computationally efficient. We also demonstrate prediction of an unknown outcome when the other outcomes are known. We apply our methodology to diffusion tensor imaging data for multiple sclerosis (MS) patients with three outcomes and identify differences in both the marginal distributions and the dependence structure between the MS and control groups. Our proposed methodology is quite general and can be applied to other functional data with multiple outcomes in biology and other fields.
Bias Correction in Estimating Proportions by Pooled Testing
Hepworth G and Biggerstaff BJ
In the estimation of proportions by pooled testing, the MLE is biased, and several methods of correcting the bias have been presented in previous studies. We propose a new estimator based on the bias correction method introduced by Firth (Biometrika 80:27-38, 1993), which uses a modification of the score function, and we provide an easily computable, Newton-Raphson iterative formula for its computation. Our proposed estimator is almost unbiased across a range of problems, and superior to existing methods. We show that for equal pool sizes the new estimator is equivalent to the estimator proposed by Burrows (Phytopathology 77:363-365, 1987). The performance of our estimator is examined using pooled testing problems encountered in plant disease assessment and prevalence estimation of mosquito-borne viruses.
Distributional Validation of Precipitation Data Products with Spatially Varying Mixture Models
Warr LR, Heaton MJ, Christensen WF, White PA and Rupper SB
The high mountain regions of Asia contain more glacial ice than anywhere on the planet outside of the polar regions. Because of the large population living in the Indus watershed region who are reliant on melt from these glaciers for fresh water, understanding the factors that affect glacial melt along with the impacts of climate change on the region is important for managing these natural resources. While there are multiple climate data products (e.g., reanalysis and global climate models) available to study the impact of climate change on this region, each product will have a different amount of skill in projecting a given climate variable, such as precipitation. In this research, we develop a spatially varying mixture model to compare the distribution of precipitation in the High Mountain Asia region as produced by climate models with the corresponding distribution from in situ observations from the Asian Precipitation-Highly Resolved Observational Data Integration Towards Evaluation (APHRODITE) data product. Parameter estimation is carried out via a computationally efficient Markov chain Monte Carlo algorithm. Each of the estimated climate distributions from each climate data product is then validated against APHRODITE using a spatially varying Kullback-Leibler divergence measure. Supplementary materials accompanying this paper appear online. Supplementary materials for this article are available at 10.1007/s13253-022-00515-0.
A Spatiotemporal Analytical Outlook of the Exposure to Air Pollution and COVID-19 Mortality in the USA
Chakraborty S, Dey T, Jun Y, Lim CY, Mukherjee A and Dominici F
The world is experiencing a pandemic due to Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), also known as COVID-19. The USA is also suffering from a catastrophic death toll from COVID-19. Several studies are providing preliminary evidence that short- and long-term exposure to air pollution might increase the severity of COVID-19 outcomes, including a higher risk of death. In this study, we develop a spatiotemporal model to estimate the association between exposure to fine particulate matter PM2.5 and mortality accounting for several social and environmental factors. More specifically, we implement a Bayesian zero-inflated negative binomial regression model with random effects that vary in time and space. Our goal is to estimate the association between air pollution and mortality accounting for the spatiotemporal variability that remained unexplained by the measured confounders. We applied our model to four regions of the USA with weekly data available for each county within each region. We analyze the data separately for each region because each region shows a different disease spread pattern. We found a positive association between long-term exposure to PM2.5 and the mortality from the COVID-19 disease for all four regions with three of four being statistically significant. Data and code are available at our GitHub repository. Supplementary materials accompanying this paper appear on-line.
Bias Correction in Estimating Proportions by Imperfect Pooled Testing
Hepworth G and Biggerstaff BJ
In the estimation of proportions by pooled testing, the MLE is biased. Hepworth and Biggerstaff (JABES, 22:602-614, 2017) proposed an estimator based on the bias correction method of Firth (Biometrika 80:27-38, 1993) and showed that it is almost unbiased across a range of pooled testing problems involving no misclassification. We now extend their work to allow for imperfect testing. We derive the estimator, provide a Newton-Raphson iterative formula for its computation and test it in situations involving equal or unequal pool sizes, drawing on problems encountered in plant disease assessment and prevalence estimation of mosquito-borne viruses. Our estimator is highly effective at reducing the bias for prevalences consistent with the pooled testing procedure employed.
A Case Study Competition Among Methods for Analyzing Large Spatial Data
Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, Gerber F, Gramacy RB, Hammerling D, Katzfuss M, Lindgren F, Nychka DW, Sun F and Zammit-Mangion A
The Gaussian process is an indispensable tool for spatial data analysts. The onset of the "big data" era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low-rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics. Supplementary materials regarding implementation details of the methods and code are available for this article online.
A Mixed Model for Assessing the Effect of Numerous Plant Species Interactions on Grassland Biodiversity and Ecosystem Function Relationships
McDonnell J, McKenna T, Yurkonis KA, Hennessy D, de Andrade Moral R and Brophy C
In grassland ecosystems, it is well known that increasing plant species diversity can improve ecosystem functions (i.e., ecosystem responses), for example, by increasing productivity and reducing weed invasion. Diversity-Interactions models use species proportions and their interactions as predictors in a regression framework to assess biodiversity and ecosystem function relationships. However, it can be difficult to model numerous interactions if there are many species, and interactions may be temporally variable or dependent on spatial planting patterns. We developed a new Diversity-Interactions mixed model for jointly assessing many species interactions and within-plot species planting pattern over multiple years. We model pairwise interactions using a small number of fixed parameters that incorporate spatial effects and supplement this by including all pairwise interaction variables as random effects, each constrained to have the same variance within each year. The random effects are indexed by pairs of species within plots rather than a plot-level factor as is typical in mixed models, and capture remaining variation due to pairwise species interactions parsimoniously. We apply our novel methodology to three years of weed invasion data from a 16-species grassland experiment that manipulated plant species diversity and spatial planting pattern and test its statistical properties in a simulation study.Supplementary materials accompanying this paper appear online. Supplementary materials for this article are available at 10.1007/s13253-022-00505-2.
Semiparametric Mixed-Effects Ordinary Differential Equation Models with Heavy-Tailed Distributions
Liu B, Wang L, Nie Y and Cao J
Ordinary differential equation (ODE) models are popularly used to describe complex dynamical systems. When estimating ODE parameters from noisy data, a common distribution assumption is using the Gaussian distribution. It is known that the Gaussian distribution is not robust when abnormal data exist. In this article, we develop a hierarchical semiparametric mixed-effects ODE model for longitudinal data under the Bayesian framework. For robust inference on ODE parameters, we consider a class of heavy-tailed distributions to model the random effects of ODE parameters and observations errors. An MCMC method is proposed to sample ODE parameters from the posterior distributions. Our proposed method is illustrated by studying a gene regulation experiment. Simulation studies show that our proposed method provides satisfactory results for the semiparametric mixed-effects ODE models with finite samples. Supplementary materials accompanying this paper appear online.
Asynchronous Changepoint Estimation for Spatially Correlated Functional Time Series
Wang M, Harris T and Li B
We propose a new solution under the Bayesian framework to simultaneously estimate mean-based asynchronous changepoints in spatially correlated functional time series. Unlike previous methods that assume a shared changepoint at all spatial locations or ignore spatial correlation, our method treats changepoints as a spatial process. This allows our model to respect spatial heterogeneity and exploit spatial correlations to improve estimation. Our method is derived from the ubiquitous cumulative sum (CUSUM) statistic that dominates changepoint detection in functional time series. However, instead of directly searching for the maximum of the CUSUM-based processes, we build spatially correlated two-piece linear models with appropriate variance structure to locate all changepoints at once. The proposed linear model approach increases the robustness of our method to variability in the CUSUM process, which, combined with our spatial correlation model, improves changepoint estimation near the edges. We demonstrate through extensive simulation studies that our method outperforms existing functional changepoint estimators in terms of both estimation accuracy and uncertainty quantification, under either weak or strong spatial correlation, and weak or strong change signals. Finally, we demonstrate our method using a temperature data set and a coronavirus disease 2019 (COVID-19) study. Supplementary materials accompanying this paper appear online. Supplementary materials for this article are available at 10.1007/s13253-022-00519-w.
Discussion on "Competition on Spatial Statistics for Large Datasets"
Allard D, Clarotto L, Opitz T and Romary T
We discuss the methods and results of the RESSTE team in the competition on spatial statistics for large datasets. In the first sub-competition, we implemented block approaches both for the estimation of the covariance parameters and for prediction using ordinary kriging. In the second sub-competition, a two-stage procedure was adopted. In the first stage, the marginal distribution is estimated neglecting spatial dependence, either according to the flexible Tuckey and distribution or nonparametrically. In the second stage, estimation of the covariance parameters and prediction are performed using Kriging. Vecchias's approximation implemented in the GpGp package proved to be very efficient. We then make some propositions for future competitions.
Computational Efficiency and Precision for Replicated-Count and Batch-Marked Hidden Population Models
Parker MRP, Cowen LLE, Cao J and Elliott LT
We address two computational issues common to open-population -mixture models, hidden integer-valued autoregressive models, and some hidden Markov models. The first issue is computation time, which can be dramatically improved through the use of a fast Fourier transform. The second issue is tractability of the model likelihood function for large numbers of hidden states, which can be solved by improving numerical stability of calculations. As an illustrative example, we detail the application of these methods to the open-population -mixture models. We compare computational efficiency and precision between these methods and standard methods employed by state-of-the-art ecological software. We show faster computing times (a to times speed improvement for population size upper bounds of 500 and 1000, respectively) over state-of-the-art ecological software for -mixture models. We also apply our methods to compute the size of a large elk population using an -mixture model and show that while our methods converge, previous software cannot produce estimates due to numerical issues. These solutions can be applied to many ecological models to improve precision when logs of sums exist in the likelihood function and to improve computational efficiency when convolutions are present in the likelihood function. Supplementary materials accompanying this paper appear online. Supplementary materials for this article are available at 10.1007/s13253-022-00509-y.
Discussion on Competition for Spatial Statistics for Large Datasets
Flury R and Furrer R
We discuss the experiences and results of the team's participation in the comprehensive and unbiased comparison of different spatial approximations conducted in the . In each of the different sub-competitions, we estimated parameters of the covariance model based on a likelihood function and predicted missing observations with simple kriging. We approximated the covariance model either with covariance tapering or a compactly supported Wendland covariance function.
Optimizing Pooled Testing for Estimating the Prevalence of Multiple Diseases
Warasi MS, Hungerford LL and Lahmers K
Pooled testing can enhance the efficiency of diagnosing individuals with diseases of low prevalence. Often, pooling is implemented using standard groupings (2, 5, 10, etc.). On the other hand, optimization theory can provide specific guidelines in finding the ideal pool size and pooling strategy. This article focuses on optimizing the precision of disease prevalence estimators calculated from multiplex pooled testing data. In the context of a surveillance application of animal diseases, we study the estimation efficiency (i.e., precision) and cost efficiency of the estimators with adjustments for the number of expended tests. This enables us to determine the pooling strategies that offer the highest benefits when jointly estimating the prevalence of multiple diseases, such as theileriosis and anaplasmosis. The outcomes of our work can be used in designing pooled testing protocols, not only in simple pooling scenarios but also in more complex scenarios where individual retesting is performed in order to identify positive cases. A software application using the shiny package in R is provided with this article to facilitate implementation of our methods. Supplementary materials accompanying this paper appear online. Supplementary materials for this article are available at 10.1007/s13253-022-00511-4.