JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY

Causal inference over stochastic networks
Clark DA and Handcock MS
Claiming causal inferences in network settings necessitates careful consideration of the often complex dependency between outcomes for actors. Of particular importance are treatment spillover or outcome interference effects. We consider causal inference when the actors are connected via an underlying network structure. Our key contribution is a model for causality when the underlying network is endogenous; where the ties between actors and the actor covariates are statistically dependent. We develop a joint model for the relational and covariate generating process that avoids restrictive separability and fixed network assumptions, as these rarely hold in realistic social settings. While our framework can be used with general models, we develop the highly expressive class of Exponential-family Random Network models (ERNM) of which Markov random fields and Exponential-family Random Graph models are special cases. We present potential outcome-based inference within a Bayesian framework and propose a modification to the exchange algorithm to allow for sampling from ERNM posteriors. We present results of a simulation study demonstrating the validity of the approach. Finally, we demonstrate the value of the framework in a case study of smoking in the context of adolescent friendship networks.
A framework for understanding selection bias in real-world healthcare data
Kundu R, Shi X, Morrison J, Barrett J and Mukherjee B
Using administrative patient-care data such as Electronic Health Records (EHR) and medical/pharmaceutical claims for population-based scientific research has become increasingly common. With vast sample sizes leading to very small standard errors, researchers need to pay more attention to potential biases in the estimates of association parameters of interest, specifically to biases that do not diminish with increasing sample size. Of these multiple sources of biases, in this paper, we focus on understanding selection bias. We present an analytic framework using directed acyclic graphs for guiding applied researchers to dissect how different sources of selection bias may affect estimates of the association between a binary outcome and an exposure (continuous or categorical) of interest. We consider four easy-to-implement weighting approaches to reduce selection bias with accompanying variance formulae. We demonstrate through a simulation study when they can rescue us in practice with analysis of real-world data. We compare these methods using a data example where our goal is to estimate the well-known association of cancer and biological sex, using EHR from a longitudinal biorepository at the University of Michigan Healthcare system. We provide annotated R codes to implement these weighted methods with associated inference.
Grace periods in comparative effectiveness studies of sustained treatments
Wanis KN, Sarvet AL, Wen L, Block JP, Rifas-Shiman SL, Robins JM and Young JG
Researchers are often interested in estimating the effect of sustained use of a treatment on a health outcome. However, adherence to strict treatment protocols can be challenging for individuals in practice and, when non-adherence is expected, estimates of the effect of sustained use may not be useful for decision making. As an alternative, more relaxed treatment protocols which allow for periods of time off treatment (i.e. grace periods) have been considered in pragmatic randomized trials and observational studies. In this article, we consider the interpretation, identification, and estimation of treatment strategies which include grace periods. We contrast grace period strategies which allow individuals the flexibility to take treatment as they would naturally do, with grace period strategies in which the investigator specifies the distribution of treatment utilization. We estimate the effect of initiation of a thiazide diuretic or an angiotensin-converting enzyme inhibitor in hypertensive individuals under various strategies which include grace periods.
Identifying dietary consumption patterns from survey data: a Bayesian nonparametric latent class model
Stephenson BJK, Wu SM and Dominici F
Dietary assessments provide the snapshots of population-based dietary habits. Questions remain about how generalisable those snapshots are in national survey data, where certain subgroups are sampled disproportionately. We propose a Bayesian overfitted latent class model to derive dietary patterns, accounting for survey design and sampling variability. Compared to standard approaches, our model showed improved identifiability of the true population pattern and prevalence in simulation. We focus application of this model to identify the intake patterns of adults living at or below the 130% poverty income level. Five dietary patterns were identified and characterised by reproducible code/data made available to encourage further research.
Incorporating testing volume into estimation of effective reproduction number dynamics
Goldstein IH, Wakefield J and Minin VM
Branching process inspired models are widely used to estimate the effective reproduction number-a useful summary statistic describing an infectious disease outbreak-using counts of new cases. Case data is a real-time indicator of changes in the reproduction number, but is challenging to work with because cases fluctuate due to factors unrelated to the number of new infections. We develop a new model that incorporates the number of diagnostic tests as a surveillance model covariate. Using simulated data and data from the SARS-CoV-2 pandemic in California, we demonstrate that incorporating tests leads to improved performance over the state of the art.
A dynamic social relations model for clustered longitudinal dyadic data with continuous or ordinal responses
Pillinger R, Steele F, Leckie G and Jenkins J
Social relations models allow the identification of cluster, actor, partner, and relationship effects when analysing clustered dyadic data on interactions between individuals or other units of analysis. We propose an extension of this model which handles longitudinal data and incorporates dynamic structure, where the response may be continuous, binary, or ordinal. This allows the disentangling of the relationship effects from temporal fluctuation and measurement error and the investigation of whether individuals respond to their partner's behaviour at the previous observation. We motivate and illustrate the model with an application to Canadian data on pairs of individuals within families observed working together on a conflict discussion task.
Measuring Social Inclusion in Europe: a non-additive approach with the expert-preferences of public policy planners
Carrino L, Farnia L and Giove S
This paper introduces a normative, expert-informed, time-dependent index of Social Inclusion for European administrative regions in five countries, using longitudinal data from Eurostat. Our contribution is twofold: first, our indicator is based on a non-additive aggregation operator (the Choquet Integral), which allows us to model many preferences' structures and to overcome the limitations embedded in other approaches. Second, we elicit the parameters of the aggregation operator from an expert panel of Italian policymakers in Social Policy, and Economics scholars. Our results highlight that Mediterranean countries exhibit lower Inclusion levels than Northern/Central countries, and that this disparity has grown in the last decade. Our results complement and partially challenge existing evidence from data-driven aggregation methods.
Estimating SARS-CoV-2 seroprevalence
Rosin SP, Shook-Sa BE, Cole SR and Hudgens MG
Governments and public health authorities use seroprevalence studies to guide responses to the COVID-19 pandemic. Seroprevalence surveys estimate the proportion of individuals who have detectable SARS-CoV-2 antibodies. However, serologic assays are prone to misclassification error, and non-probability sampling may induce selection bias. In this paper, non-parametric and parametric seroprevalence estimators are considered that address both challenges by leveraging validation data and assuming equal probabilities of sample inclusion within covariate-defined strata. Both estimators are shown to be consistent and asymptotically normal, and consistent variance estimators are derived. Simulation studies are presented comparing the estimators over a range of scenarios. The methods are used to estimate severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) seroprevalence in New York City, Belgium, and North Carolina.
A practical revealed preference model for separating preferences and availability effects in marriage formation
Goyal S, Handcock MS, Jackson HM, Rendall MS and Yeung FC
Many demographic problems require models for partnership formation. We consider a model for matchings within a bipartite population where individuals have utility for people based on observed and unobserved characteristics. It represents both the availability of potential partners of different types and the preferences of individuals for such people. We develop an estimator for the preference parameters based on sample survey data on partnerships and population composition. We conduct simulation studies based on the Survey of Income and Program Participation showing that the estimator recovers preference parameters that are invariant under different population availabilities and has the correct confidence coverage.
An experimental evaluation of a stopping rule aimed at maximizing cost-quality trade-offs in surveys
Wagner J, Zhang X, Elliott MR, West BT and Coffey SM
Surveys face difficult choices in managing cost-error trade-offs. Stopping rules for surveys have been proposed as a method for managing these trade-offs. A stopping rule will limit effort on a select subset of cases to reduce costs with minimal harm to quality. Previously proposed stopping rules have focused on quality with an implicit assumption that all cases have the same cost. This assumption is unlikely to be true, particularly when some cases will require more effort and, therefore, more costs than others. We propose a new rule that looks at both predicted costs and quality. This rule is tested experimentally against another rule that focuses on stopping cases that are expected to be difficult to recruit. The experiment was conducted on the 2020 data collection of the Health and Retirement Study (HRS). We test both Bayesian and non-Bayesian (maximum-likelihood or ML) versions of the rule. The Bayesian version of the prediction models uses historical data to establish prior information. The Bayesian version led to higher-quality data for roughly the same cost, while the ML version led to small reductions in quality with larger reductions in cost compared to the control rule.
Multilevel longitudinal analysis of social networks
Koskinen J and Snijders TAB
Stochastic actor-oriented models (SAOMs) are a modelling framework for analysing network dynamics using network panel data. This paper extends the SAOM to the analysis of multilevel network panels through a random coefficient model, estimated with a Bayesian approach. The proposed model allows testing theories about network dynamics, social influence, and interdependence of multiple networks. It is illustrated by a study of the dynamic interdependence of friendship networks and minor delinquency. Data were available for 126 classrooms in the first year of secondary school, of which 82 were used, containing relatively few missing data points and having not too much network turnover.
Bayesian multistate modelling of incomplete chronic disease burden data
Jackson C, Zapata-Diomedi B and Woodcock J
A widely-used model for determining the long-term health impacts of public health interventions, often called a "multistate lifetable", requires estimates of incidence, case fatality, and sometimes also remission rates, for multiple diseases by age and gender. Generally, direct data on both incidence and case fatality are not available in every disease and setting. For example, we may know population mortality and prevalence rather than case fatality and incidence. This paper presents Bayesian continuous-time multistate models for estimating transition rates between disease states based on incomplete data. This builds on previous methods by using a formal statistical model with transparent data-generating assumptions, while providing accessible software as an R package. Rates for people of different ages and areas can be related flexibly through splines or hierarchical models. Previous methods are also extended to allow age-specific trends through calendar time. The model is used to estimate case fatality for multiple diseases in the city regions of England, based on incidence, prevalence and mortality data from the Global Burden of Disease study. The estimates can be used to inform health impact models relating to those diseases and areas. Different assumptions about rates are compared, and we check the influence of different data sources.
An integrated abundance model for estimating county-level prevalence of opioid misuse in Ohio
Hepler SA, Kline DM, Bonny A, McKnight E and Waller LA
Opioid misuse is a national epidemic and a significant drug related threat to the United States. While the scale of the problem is undeniable, estimates of the local prevalence of opioid misuse are lacking, despite their importance to policy-making and resource allocation. This is due, in part, to the challenge of directly measuring opioid misuse at a local level. In this paper, we develop a Bayesian hierarchical spatio-temporal abundance model that integrates indirect county-level data on opioid-related outcomes with state-level survey estimates on prevalence of opioid misuse to estimate the latent county-level prevalence and counts of people who misuse opioids. A simulation study shows that our integrated model accurately recovers the latent counts and prevalence. We apply our model to county-level surveillance data on opioid overdose deaths and treatment admissions from the state of Ohio. Our proposed framework can be applied to other applications of small area estimation for hard to reach populations, which is a common occurrence with many health conditions such as those related to illicit behaviors.
COVID-19 clinical footprint to infer about mortality
Rodríguez CE and Mena RH
Information on 4.1 million patients identified as COVID-19 positive in Mexico is used to understand the relationship between comorbidities, symptoms, hospitalisations and deaths due to the COVID-19 disease. Using the presence or absence of these variables a clinical footprint for each patient is created. The risk, expected mortality and the prediction of death outcomes, among other relevant quantities, are obtained and analysed by means of a multivariate Bernoulli distribution. The proposal considers all possible footprint combinations resulting in a robust model suitable for Bayesian inference. The analysis is carried out considering the information on the monthly COVID-19 cases, from March 2020 to the first days of January 2022. This allows one to appreciate the evolution of the mortality risk over time and the effect the strategies of the health authorities have had on it. Supporting information for this article, containing code and the dataset used for the analysis, is available online.
A Semiparametric Approach to Model-Based Sensitivity Analysis in Observational Studies
Zhang B and Tchetgen Tchetgen EJ
When drawing causal inference from observational data, there is almost always concern about unmeasured confounding. One way to tackle this is to conduct a sensitivity analysis. One widely-used sensitivity analysis framework hypothesizes the existence of a scalar unmeasured confounder U and asks how the causal conclusion would change were U measured and included in the primary analysis. Work along this line often makes various parametric assumptions on U, for the sake of mathematical and computational convenience. In this article, we further this line of research by developing a valid sensitivity analysis that leaves the distribution of U unrestricted. Compared to many existing methods in the literature, our method allows for a larger and more flexible family of models, mitigates observable implications (Franks et al., 2019), and works seamlessly with any primary analysis that models the outcome regression parametrically. We construct both pointwise confidence intervals and confidence bands that are uniformly valid over a given sensitivity parameter space, thus formally accounting for unknown sensitivity parameters. We apply our proposed method on an influential yet controversial study of the causal relationship between war experiences and political activeness using observational data from Uganda.
Estimating the Number of Persons with HIV in Jails via Web Scraping and Record Linkage
Shook-Sa BE, Hudgens MG, Kavee AL and Rosen DL
This paper presents methods to estimate the number of persons with HIV in North Carolina jails by applying finite population inferential approaches to data collected using web scraping and record linkage techniques. Administrative data are linked with web-scraped rosters of incarcerated persons in a nonrandom subset of counties. Outcome regression and calibration weighting are adapted for state-level estimation. Methods are compared in simulations and are applied to data from the US state of North Carolina. Outcome regression yielded more precise inference and allowed for county-level estimates, an important study objective, while calibration weighting exhibited double robustness under misspecification of the outcome or weight model.
When survey science met web tracking: Presenting an error framework for metered data
Bosch OJ and Revilla M
Metered data, also called web-tracking data, are generally collected from a sample of participants who willingly install or configure, onto their devices, technologies that track digital traces left when people go online (e.g., URLs visited). Since metered data allow for the observation of online behaviours unobtrusively, it has been proposed as a useful tool to understand what people do online and what impacts this might have on online and offline phenomena. It is crucial, nevertheless, to understand its limitations. Although some research have explored the potential errors of metered data, a systematic categorisation and conceptualisation of these errors are missing. Inspired by the Total Survey Error, we present a Total Error framework for digital traces collected with Meters (TEM). The TEM framework (1) describes the data generation and the analysis process for metered data and (2) documents the sources of bias and variance that may arise in each step of this process. Using a case study we also show how the TEM can be applied in real life to identify, quantify and reduce metered data errors. Results suggest that metered data might indeed be affected by the error sources identified in our framework and, to some extent, biased. This framework can help improve the quality of both stand-alone metered data research projects, as well as foster the understanding of how and when survey and metered data can be combined.
When the Ends do not Justify the Means: Learning Who is Predicted to Have Harmful Indirect Effects
Rudolph KE and Díaz I
There is a growing literature on finding rules by which to assign treatment based on an individual's characteristics such that a desired outcome under the intervention is maximized. A related goal entails identifying a subpopulation of individuals predicted to have a harmful indirect effect (the effect of treatment on an outcome through mediators), perhaps even in the presence of a predicted beneficial total treatment effect. In some cases, the implications of a likely harmful indirect effect may outweigh an anticipated beneficial total treatment effect, and would motivate further discussion of whether to treat identified individuals. We build on the mediation and optimal treatment rule literatures to propose a method of identifying a subgroup for which the treatment effect through the mediator is expected to be harmful. Our approach is nonparametric, incorporates post-treatment confounders of the mediator-outcome relationship, and does not make restrictions on the distribution of baseline covariates, mediating variables, or outcomes. We apply the proposed approach to identify a subgroup of boys in the MTO housing voucher experiment who are predicted to have a harmful indirect effect of housing voucher receipt on subsequent psychiatric disorder incidence through aspects of their school and neighborhood environments.
Estimation of reproduction numbers in real time: Conceptual and statistical challenges
Pellis L, Birrell PJ, Blake J, Overton CE, Scarabel F, Stage HB, Brooks-Pollock E, Danon L, Hall I, House TA, Keeling MJ, Read JM, and De Angelis D
The reproduction number has been a central metric of the COVID-19 pandemic response, published weekly by the UK government and regularly reported in the media. Here, we provide a formal definition and discuss the advantages and most common misconceptions around this quantity. We consider the intuition behind different formulations of , the complexities in its estimation (including the unavoidable lags involved), and its value compared to other indicators (e.g. the growth rate) that can be directly observed from aggregate surveillance data and react more promptly to changes in epidemic trend. As models become more sophisticated, with age and/or spatial structure, formulating becomes increasingly complicated and inevitably model-dependent. We present some models currently used in the UK pandemic response as examples. Ultimately, limitations in the available data streams, data quality and time constraints force pragmatic choices to be made on a quantity that is an average across time, space, social structure and settings. Effectively communicating these challenges is important but often difficult in an emergency.
Nearest neighbor ratio imputation with incomplete multinomial outcome in survey sampling
Gao C, Thompson KJ, Kim JK and Yang S
Nonresponse is a common problem in survey sampling. Appropriate treatment can be challenging, especially when dealing with detailed breakdowns of totals. Often, the nearest neighbor imputation method is used to handle such incomplete multinomial data. In this article, we investigate the nearest neighbor ratio imputation estimator, in which auxiliary variables are used to identify the closest donor and the vector of proportions from the donor is applied to the total of the recipient to implement ratio imputation. To estimate the asymptotic variance, we first treat the nearest neighbor ratio imputation as a special case of predictive matching imputation and apply the linearization method of Yang and Kim (2020). To account for the non-negligible sampling fractions, parametric and generalized additive models are employed to incorporate the smoothness of the imputation estimator, which results in a valid variance estimator. We apply the proposed method to estimate expenditures detail items based on empirical data from the 2018 collection of the Service Annual Survey, conducted by the United States Census Bureau. Our simulation results demonstrate the validity of our proposed estimators and also confirm that the derived variance estimators have good performance even when the sampling fraction is non-negligible.
Assessing epidemic curves for evidence of superspreading
Meagher J and Friel N
The expected number of secondary infections arising from each index case, referred to as the reproduction or number, is a vital summary statistic for understanding and managing epidemic diseases. There are many methods for estimating ; however, few explicitly model heterogeneous disease reproduction, which gives rise to superspreading within the population. We propose a parsimonious discrete-time branching process model for epidemic curves that incorporates heterogeneous individual reproduction numbers. Our Bayesian approach to inference illustrates that this heterogeneity results in less certainty on estimates of the time-varying cohort reproduction number . We apply these methods to a COVID-19 epidemic curve for the Republic of Ireland and find support for heterogeneous disease reproduction. Our analysis allows us to estimate the expected proportion of secondary infections attributable to the most infectious proportion of the population. For example, we estimate that the 20% most infectious index cases account for approximately 75%-98% of the expected secondary infections with 95% posterior probability. In addition, we highlight that heterogeneity is a vital consideration when estimating .