Sociological Methodology

Polygenic Indices (a.k.a. Polygenic Scores) in Social Science: A Guide for Interpretation and Evaluation
Burt CH
Polygenic indices (PGI)-the new recommended label for polygenic scores (PGS) in social science-are genetic summary scales often used to represent an individual's liability for a disease, trait, or behavior based on the additive effects of measured genetic variants. Enthusiasm for linking genetic data with social outcomes and the inclusion of premade PGIs in social science datasets have facilitated increased uptake of PGIs in social science research-a trend that will likely continue. Yet, most social scientists lack the expertise to interpret and evaluate PGIs in social science research. Here, we provide a primer on PGIs for social scientists focusing on key concepts, unique statistical genetic considerations, and best practices in calculation, estimation, reporting, and interpretation. We summarize our recommended best practices as a checklist to aid social scientists in evaluating and interpreting studies with PGIs. We conclude by discussing the similarities between PGIs and standard social science scales and unique interpretative considerations.
Comparing the Robustness of Simple Network Scale-Up Method Estimators
Kunke JP, Laga I, Niu X and McCormick TH
Evaluation of Respondent-Driven Sampling Prevalence Estimators Using Real-World Reported Network Degree
Avery L and Rotondi M
Respondent-driven sampling (RDS) is used to measure trait or disease prevalence in populations that are difficult to reach and often marginalized. The authors evaluated the performance of RDS estimators under varying conditions of trait prevalence, homophily, and relative activity. They used large simulated networks ( = 20,000) derived from real-world RDS degree reports and an empirical Facebook network ( = 22,470) to evaluate estimators of binary and categorical trait prevalence. Variability in prevalence estimates is higher when network degree is drawn from real-world samples than from the commonly assumed Poisson distribution, resulting in lower coverage rates. Newer estimators perform well when the sample is a substantive proportion of the population, but bias is present when the population size is unknown. The choice of preferred RDS estimator needs to be study specific, considering both statistical properties and knowledge of the population under study.
Bayesian Multistate Life Table Methods for Large and Complex State Spaces: Development and Illustration of a New Method
Lynch SM and Zang E
Multistate life table methods are an important tool for producing easily understood measures of population health. Most contemporary uses of these methods involve sample data, thus requiring techniques for capturing uncertainty in estimates. In recent decades, several methods have been developed to do so. Among these methods, the Bayesian approach proposed by Lynch and Brown has several unique advantages. However, the approach is limited to estimating years to be spent in only two living states, such as "healthy" and "unhealthy." In this article, the authors extend this method to allow for large state spaces with "quasi-absorbing" states. The authors illustrate the new method and show its advantages using data from the Health and Retirement Study to investigate U.S. regional differences in years of remaining life to be spent with diabetes, chronic conditions, and disabilities. The method works well and yields rich output for reporting and subsequent analyses. The expanded method also should facilitate the use of multi-state life tables to address a wider array of social science research questions.
Uncovering Sociological Effect Heterogeneity Using Tree-Based Machine Learning
Brand JE, Xu J, Koch B and Geraldo P
Individuals do not respond uniformly to treatments, such as events or interventions. Sociologists routinely partition samples into subgroups to explore how the effects of treatments vary by selected covariates, such as race and gender, on the basis of theoretical priors. Data-driven discoveries are also routine, yet the analyses by which sociologists typically go about them are often problematic and seldom move us beyond our biases to explore new meaningful subgroups. Emerging machine learning methods based on decision trees allow researchers to explore sources of variation that they may not have previously considered or envisaged. In this article, the authors use tree-based machine learning, that is, causal trees, to recursively partition the sample to uncover sources of effect heterogeneity. Assessing a central topic in social inequality, college effects on wages, the authors compare what is learned from covariate and propensity score-based partitioning approaches with recursive partitioning based on causal trees. Decision trees, although superseded by forests for estimation, can be used to uncover subpopulations responsive to treatments. Using observational data, the authors expand on the existing causal tree literature by applying leaf-specific effect estimation strategies to adjust for observed confounding, including inverse propensity weighting, nearest neighbor matching, and doubly robust causal forests. We also assess localized balance metrics and sensitivity analyses to address the possibility of differential imbalance and unobserved confounding. The authors encourage researchers to follow similar data exploration practices in their work on variation in sociological effects and offer a straightforward framework by which to do so.
Validating Sequence Analysis Typologies Using Parametric Bootstrap
Studer M
In this article, the author proposes a methodology for the validation of sequence analysis typologies on the basis of parametric bootstraps following the framework proposed by Hennig and Lin (2015). The method works by comparing the cluster quality of an observed typology with the quality obtained by clustering similar but nonclustered data. The author proposes several models to test the different structuring aspects of the sequences important in life-course research, namely, sequencing, timing, and duration. This strategy allows identifying the key structural aspects captured by the observed typology. The usefulness of the proposed methodology is illustrated through an analysis of professional and coresidence trajectories in Switzerland. The proposed methodology is available in the WeightedCluster R library.
Multigenerational Social Mobility: A Demographic Approach
Song X
Most social mobility studies take a two-generation perspective, in which intergenerational relationships are represented by the association between parents' and offspring's socioeconomic status. This approach, albeit widely adopted in the literature, has serious limitations when more than two generations of families are considered. In particular, it ignores the role of families' demographic behaviors in moderating mobility outcomes and the joint role of mobility and demography in shaping long-run family and population processes. This paper provides a demographic approach to the study of multigenerational social mobility, incorporating demographic mechanisms of births, deaths, and mating into statistical models of social mobility. Compared to previous mobility models for estimating the probability of offspring's mobility conditional on parent's social class, the proposed joint demography-mobility model treats the number of offspring in various social classes as the outcome of interest. This new approach shows the extent to which demographic processes may amplify or dampen the effects of family socioeconomic positions due to the direction and strength of the interaction between mobility and differentials in demographic behaviors. I illustrate various demographic methods for studying multigenerational mobility with empirical examples using the IPUMS linked historical U.S. census representative samples (1850 to 1930), the Panel Study of Income Dynamics (1968 to 2015), and simulation data that show other possible scenarios resulting from demography-mobility interactions.
Estimating Contextual Effects from Ego Network Data
Smith JA and Gauthier GR
Network concepts are often used to characterize the features of a social context. For example, past work has asked if individuals in more socially cohesive neighborhoods have better mental health outcomes. Despite the ubiquity of use, it is relatively rare for contextual studies to employ the methods of network analysis. This is the case, in part, because network data are difficult to collect, requiring information on all ties between all actors. This paper asks whether it is possible to avoid such heavy data collection while still retaining the best features of a contextual-network study. The basic idea is to apply network sampling to the problem of contextual models, where one uses sampled ego network data to infer the network features of each context, and then uses the inferred network features as second-level predictors in a hierarchical linear model. We test the validity of this idea in the case of network cohesion. Using two complete datasets as a test, we find that ego network data are sufficient to capture the relationship between cohesion and important outcomes, like attachment and deviance. The hope, going forward, is that researchers will find it easier to incorporate holistic network measures into traditional regression models.
Comment: Summarizing income mobility with multiple smooth quantiles instead of parameterized means
Lundberg I and Stewart BM
Studies of economic mobility summarize the distribution of offspring incomes for each level of parent income. Mitnik and Grusky (2020) highlight that the conventional intergenerational elasticity (IGE) targets the geometric mean and propose a parametric strategy for estimating the arithmetic mean. We decompose the IGE and their proposal into two choices: (1) the summary statistic for the conditional distribution and (2) the functional form. These choices lead us to a different strategy-visualizing several quantiles of the offspring income distribution as smooth functions of parent income. Our proposal solves the problems Mitnik and Grusky highlight with geometric means, avoids the sensitivity of arithmetic means to top incomes, and provides more information than is possible with any single number. Our proposal has broader implications: the default summary (the mean) used in many regressions is sensitive to the tail of the distribution in ways that may be substantively undesirable.
Heterogeneous Treatment Effects in the Presence of Self-Selection: A Propensity Score Perspective
Zhou X and Xie Y
An essential feature common to all empirical social research is variability across units of analysis. Individuals differ not only in background characteristics, but also in how they respond to a particular treatment, intervention, or stimulation. Moreover, individuals may self-select into treatment on the basis of their anticipated treatment effects. To study heterogeneous treatment effects in the presence of self-selection, Heckman and Vytlacil (1999, 2001, 2005, 2007) have developed a structural approach that builds on the marginal treatment effect (MTE). In this paper, we extend the MTE-based approach through a redefinition of MTE. Specifically, we redefine MTE as the expected treatment effect conditional on the propensity score (rather than all observed covariates) as well as a latent variable representing unobserved resistance to treatment. As with the original MTE, the new MTE can also be used as a building block for evaluating standard causal estimands. However, the weights associated with the new MTE are simpler, more intuitive, and easier to compute. Moreover, the new MTE is a bivariate function, and thus is easier to visualize than the original MTE. Finally, the redefined MTE immediately reveals treatment effect heterogeneity among individuals who are at the margin of treatment. As a result, it can be used to evaluate a wide range of policy changes with little analytical twist, and to design policy interventions that optimize the marginal benefits of treatment. We illustrate the proposed method by estimating heterogeneous economic returns to college with National Longitudinal Study of Youth 1979 (NLSY79) data.
Constraints in Random Effects Age-Period-Cohort Models
Luo L and Hodges JS
Random effects (RE) models have been widely used to study the contextual effects of structures such as neighborhood or school. The RE approach has recently been applied to age-period-cohort (APC) models that are unidentified because the predictors are exactly linearly dependent. However, it has not been fully understood how the RE specification identifies these otherwise unidentified APC models. We address this challenge by first making explicit that RE-APC models have greater-not less-rank deficiency than the traditional fixed-effects model, followed by two empirical examples. We then provide intuition and a mathematical proof to explain that for APC models with one RE, treating one effect as an RE is equivalent to constraining the estimates of that effect's linear component and the random intercept to be zero. For APC models with two RE's, the effective constraints implied by the model depend on the true (i.e., in the data-generating mechanism) non-linear components of the effects that are modeled as RE's, so that the estimated components of the RE's are determined by the true components of those effects. In conclusion, RE-APC models impose arbitrary though highly obscure constraints and thus do not differ qualitatively from other constrained APC estimators.
Social Space Diffusion: Applications of a Latent Space Model to Diffusion with Uncertain Ties
Fisher JC
Social networks represent two different facets of social life: (1) stable paths for diffusion, or the spread of something through a connected population, and (2) random draws from an underlying social space, which indicate the relative positions of the people in the network to one another. The dual nature of networks creates a challenge - if the observed network ties are a single random draw, is it realistic to expect that diffusion only follows the observed network ties? This study takes a first step towards integrating these two perspectives by introducing a social space diffusion model. In the model, network ties indicate positions in social space, and diffusion occurs proportionally to distance in social space. Practically, the simulation occurs in two parts. First, positions are estimated using a statistical model (in this example, a latent space model). Then, second, the predicted probabilities of a tie from that model - representing the distances in social space - or a series of networks drawn from those probabilities - representing routine churn in the network - are used as weights in a weighted averaging framework. Using longitudinal data from high school friendship networks, I explore the properties of the model. I show that the model produces smoothed diffusion results, which predict attitudes in future waves 10% better than a diffusion model using the observed network, and up to 5% better than diffusion models using alternative, non-model-based smoothing approaches.
ESTIMATING MULTINOMIAL LOGIT MODELS WITH SAMPLES OF ALTERNATIVES
Jarvis BF
This comment reconsiders advice offered by Bruch and Mare regarding sampling choice sets in conditional logistic regression models of residential mobility. Contradicting Bruch and Mare's advice, past econometric research shows that no statistical correction is needed when using simple random sampling of unchosen alternatives to pare down respondents' choice sets. Using data on stated residential preferences contained in the Los Angeles portion of the Multi-City Study of Urban Inequality, it is shown that following Bruch and Mare's advice-to implement a statistical correction for simple random choice set sampling-leads to biased coefficient estimates. This bias is all but eliminated if the sampling correction is omitted.
Deciding on the Starting Number of Classes of a Latent Class Tree
van den Bergh M, van Kollenburg GH and Vermunt JK
In recent studies, latent class tree (LCT) modeling has been proposed as a convenient alternative to standard latent class (LC) analysis. Instead of using an estimation method in which all classes are formed simultaneously given the specified number of classes, in LCT analysis a hierarchical structure of mutually linked classes is obtained by sequentially splitting classes into two subclasses. The resulting tree structure gives a clear insight into how the classes are formed and how solutions with different numbers of classes are substantively linked to one another. A limitation of the current LCT modeling approach is that it allows only for binary splits, which in certain situations may be too restrictive. Especially at the root node of the tree, where an initial set of classes is created based on the most dominant associations present in the data, it may make sense to use a model with more than two classes. In this article, we propose a modification of the LCT approach that allows for a nonbinary split at the root node, and we provide methods to determine the appropriate number of classes in this first split, based either on theoretical grounds or on a relative improvement of fit measure. This novel approach also can be seen as a hybrid of a standard LC model and a binary LCT model, in which an initial, oversimplified but interpretable model is refined using an LCT approach. Furthermore, we show how to apply an LCT model when a nonstandard LC model is required. These new approaches are illustrated using two empirical applications: one on social capital and the other on (post)materialism.
NONLINEAR AUTOREGRESSIVE LATENT TRAJECTORY MODELS
Bauldry S and Bollen KA
Autoregressive latent trajectory (ALT) models combine features of latent growth curve models and autoregressive models into a single modeling framework. The development of ALT models has focused primarily on models with linear growth components, but some social processes follow nonlinear trajectories. Although it is straightforward to extend ALT models to allow for some forms of nonlinear trajectories, the identification status of such models, approaches to comparing them with alternative models, and the interpretation of parameters have not been systematically assessed. In this paper we focus on two forms of nonlinear autoregressive latent trajectory (NLALT) models. The first form allows for a quadratic growth trajectory, a popular form of nonlinear latent growth curve models. The second form derives from latent basis models, or freed loading models, that allow for arbitrary growth processes. We discuss details concerning parameterization, model identification, estimation, and testing for the two forms of NLALT models. We include a simulation study that illustrates potential biases that may arise from fitting alternative models to data derived from an autoregressive process and individual-specific nonlinear trajectories. In addition, we include an extended empirical example modeling growth trajectories of weight from birth through age 2.
NEW SURVEY QUESTIONS AND ESTIMATORS FOR NETWORK CLUSTERING WITH RESPONDENT-DRIVEN SAMPLING DATA
Verdery AM, Fisher JC, Siripong N, Abdesselam K and Bauldry S
Respondent-driven sampling (RDS) is a popular method for sampling hard-to-survey populations that leverages social network connections through peer recruitment. While RDS is most frequently applied to estimate the prevalence of infections and risk behaviors of interest to public health, such as HIV/AIDS or condom use, it is rarely used to draw inferences about the structural properties of social networks among such populations because it does not typically collect the necessary data. Drawing on recent advances in computer science, we introduce a set of data collection instruments and RDS estimators for network clustering, an important topological property that has been linked to a network's potential for diffusion of information, disease, and health behaviors. We use simulations to explore how these estimators, originally developed for random walk samples of computer networks, perform when applied to RDS samples with characteristics encountered in realistic field settings that depart from random walks. In particular, we explore the effects of multiple seeds, without replacement versus with replacement, branching chains, imperfect response rates, preferential recruitment, and misreporting of ties. We find that clustering coefficient estimators retain desirable properties in RDS samples. This paper takes an important step toward calculating network characteristics using nontraditional sampling methods, and it expands the potential of RDS to tell researchers more about hidden populations and the social factors driving disease prevalence.
Retrospective Reporting of First Employment in the Life-courses of U.S. Women
Shattuck RM and Rendall MS
We investigate the accuracy of young women's retrospective reporting on their first substantial employment in three major, nationally-representative United States surveys, examining hypotheses that longer recall duration, employment histories with lower salience and higher complexity, and an absence of "anchoring" biographical details will adversely affect reporting accuracy. We compare retrospective reports to benchmark panel survey estimates for the same cohorts. We find that sociodemographic groups-notably non-Hispanic White women and women with college-educated mothers-whose early employment histories at these ages are in aggregate more complex (multiple jobs) and lower in salience (more part-time jobs), are more likely to omit the occurrence of their first substantial job or employment, and to misreport their first job or employment as occurring at an older age. We also find that retrospective reports are skewed towards overreporting longer, therefore more salient, later jobs over shorter, earlier jobs. The relatively small magnitudes of differences, however, indicate that the retrospective questions nevertheless capture these summary indicators of first substantial employment reasonably accurately. Moreover, these differences are especially small for groups of women who are more likely to experience labor-market disadvantage, and for women with early births.
Estimating Moderated Causal Effects with Time-varying Treatments and Time-varying Moderators: Structural Nested Mean Models and Regression with Residuals
Wodtke GT and Almirall D
Individuals differ in how they respond to a particular treatment or exposure, and social scientists are often interested in understanding how treatment effects are moderated by observed characteristics of individuals. Effect moderation occurs when individual covariates dampen or amplify the effect of some exposure. This article focuses on estimating moderated causal effects in longitudinal settings where both the treatment and effect moderator vary over time. Effect moderation is typically examined using covariate by treatment interactions in regression analyses, but in the longitudinal setting, this approach may be problematic because time-varying moderators of future treatment may be affected by prior treatment-for example, moderators may also be mediators-and naively conditioning on an outcome of treatment in a conventional regression model can lead to bias. This article introduces to sociology moderated intermediate causal effects and the structural nested mean model for analyzing effect moderation in the longitudinal setting. It discusses problems with conventional regression and presents a new approach to estimation that avoids these problems (regression-with-residuals). The method is illustrated using longitudinal data from the PSID to examine whether the effects of time-varying exposures to poor neighborhoods on the risk of adolescent childbearing are moderated by time-varying family income.
Generalizing the Network Scale-Up Method: A New Estimator for the Size of Hidden Populations
Feehan DM and Salganik MJ
The network scale-up method enables researchers to estimate the size of hidden populations, such as drug injectors and sex workers, using sampled social network data. The basic scale-up estimator offers advantages over other size estimation techniques, but it depends on problematic modeling assumptions. We propose a new generalized scale-up estimator that can be used in settings with non-random social mixing and imperfect awareness about membership in the hidden population. Further, the new estimator can be used when data are collected via complex sample designs and from incomplete sampling frames. However, the generalized scale-up estimator also requires data from two samples: one from the frame population and one from the hidden population. In some situations these data from the hidden population can be collected by adding a small number of questions to already planned studies. For other situations, we develop interpretable adjustment factors that can be applied to the basic scale-up estimator. We conclude with practical recommendations for the design and analysis of future studies.
Assessing the Effectiveness of Anchoring Vignettes in Bias Reduction for Socioeconomic Disparities in Self-Rated Health among Chinese Adults
Xu H and Xie Y
This study investigates how reporting heterogeneity may bias socioeconomic and demographic disparities in self-rated general health, a widely used health indicator, and how such bias can be adjusted by using new anchoring vignettes designed in the 2012 wave of the China Family Panel Studies (CFPS). We find systematic variation by socio-demographic characteristics in thresholds used by respondents in rating their general health status. Such threshold shifts are often non-parallel in that the effect of a certain group characteristic on the shift is stronger at one level than another. We find that the resulting bias of measuring group differentials in self-rated health can be too substantial to be ignored. We demonstrate that the CFPS anchoring vignettes prove to be an effective survey instrument in obtaining bias-adjusted estimates of health disparities not only for the CFPS sample, but also for an independent sample from the China Health and Retirement Longitudinal Study. Effective adjustment for reporting heterogeneity may require vignette administration only to a small subsample (20-30% of the full sample). Using a single vignette can be as effective as using more in terms of anchoring, but the results are sensitive to the choice of vignette design.
Interviewing Practices, Conversational Practices, and Rapport: Responsiveness and Engagement in the Standardized Survey Interview
Garbarski D, Schaeffer NC and Dykema J
"Rapport" has been used to refer to a range of positive psychological features of an interaction -- including a situated sense of connection or affiliation between interactional partners, comfort, willingness to disclose or share sensitive information, motivation to please, or empathy. Rapport could potentially benefit survey participation and response quality by increasing respondents' motivation to participate, disclose, or provide accurate information. Rapport could also harm data quality if motivation to ingratiate or affiliate caused respondents to suppress undesirable information. Some previous research suggests that motives elicited when rapport is high conflict with the goals of standardized interviewing. We examine rapport as an interactional phenomenon, attending to both the content and structure of talk. Using questions about end-of-life planning in the 2003-2005 wave of the Wisconsin Longitudinal Study, we observe that rapport consists of behaviors that can be characterized as dimensions of responsiveness by interviewers and engagement by respondents. We identify and describe types of responsiveness and engagement in selected question-answer sequences and then devise a coding scheme to examine their analytic potential with respect to the criterion of future study participation. Our analysis suggests that responsive and engaged behaviors vary with respect to the goals of standardization-some conflict with these goals, while others complement them.