Bayesian profile regression for clustering analysis involving a longitudinal response and explanatory variables
The identification of sets of co-regulated genes that share a common function is a key question of modern genomics. Bayesian profile regression is a semi-supervised mixture modelling approach that makes use of a response to guide inference toward relevant clusterings. Previous applications of profile regression have considered univariate continuous, categorical, and count outcomes. In this work, we extend Bayesian profile regression to cases where the outcome is longitudinal (or multivariate continuous) and provide PReMiuMlongi, an updated version of PReMiuM, the R package for profile regression. We consider multivariate normal and Gaussian process regression response models and provide proof of principle applications to four simulation studies. The model is applied on budding yeast data to identify groups of genes co-regulated during the cell cycle. We identify 4 distinct groups of genes associated with specific patterns of gene expression trajectories, along with the bound transcriptional factors, likely involved in their co-regulation process.
Strategies for Increasing the Accuracy of Interviewer Observations of Respondent Features: Evidence from the U.S. National Survey of Family Growth
Because survey response rates are consistently declining worldwide, survey researchers strive to obtain as much auxiliary information on sampled units as possible. Surveys using in-person interviewing often request that interviewers collect observations on key features of all sampled units, given that interviewers are the eyes and ears of the survey organization. Unfortunately, these observations are prone to error, which decreases the effectiveness of nonresponse adjustments based on the observations. No studies have investigated the strategies being used by interviewers tasked with making these observations, or examined whether certain strategies improve observation accuracy. This study is the first to examine the associations of observational strategies used by survey interviewers with the accuracy of observations collected by those interviewers. A qualitative analysis followed by multilevel models of observation accuracy show that focusing on relevant correlates of the feature being observed and considering a diversity of cues are associated with increased observation accuracy.
Comparing the Performance of Improved Classify-Analyze Approaches For Distal Outcomes in Latent Profile Analysis
Several approaches are available for estimating the relationship of latent class membership to distal outcomes in latent profile analysis (LPA). A three-step approach is commonly used, but has problems with estimation bias and confidence interval coverage. Proposed improvements include the correction method of Bolck, Croon, and Hagenaars (BCH; 2004), Vermunt's (2010) maximum likelihood (ML) approach, and the inclusive three-step approach of Bray, Lanza, & Tan (2015). These methods have been studied in the related case of latent class analysis (LCA) with categorical indicators, but not as well studied for LPA with continuous indicators. We investigated the performance of these approaches in LPA with normally distributed indicators, under different conditions of distal outcome distribution, class measurement quality, relative latent class size, and strength of association between latent class and the distal outcome. The modified BCH implemented in Latent GOLD had excellent performance. The maximum likelihood and inclusive approaches were not robust to violations of distributional assumptions. These findings broadly agree with and extend the results presented by Bakk and Vermunt (2016) in the context of LCA with categorical indicators.
Modeling Likert scale outcomes with trend-proportional odds with and without cluster data
Likert scales are commonly used in epidemiological studies employing surveys. In this tutorial we demonstrate how the proportional odds model and the trend odds model can be applied simultaneously to data measured in Likert scales, allowing for random cluster effects. We use two datasets as examples: an epidemiological study on aging and cognition among community-dwelling Black persons, and a clustered large survey data from 28,882 students in 81 middle schools. The first example models the Likert outcome from the question: "People act as if they think you are dishonest". The trend-proportional odds model indicates that Black men have higher odds than Black women of reporting being perceived dishonest. The second example models the Likert outcome from the question: "How often have you been beaten up at school?". The trend-proportional odds model indicates that children with disability have a higher odds of severe violence than other children. For both examples, the cumulative odds ratio increases by more than 60% at the higher Likert levels.
Model error in covariance structure models: Some implications for power and Type I error
The present study investigated the degree to which violation of the parameter drift assumption affects the Type I error rate for the test of close fit and power analysis procedures proposed by MacCallum, Browne, and Sugawara (1996) for both the test of close fit and the test of exact fit. The parameter drift assumption states that as sample size increases both sampling error and model error (i.e. the degree to which the model is an approximation in the population) decrease. Model error was introduced using a procedure proposed by Cudeck and Browne (1992). The empirical power for both the test of close fit, in which the null hypothesis specifies that the Root Mean Square Error of Approximation (RMSEA) ≤ .05, and the test of exact fit, in which the null hypothesis specifies that RMSEA = 0, is compared with the theoretical power computed using the MacCallum et al. (1996) procedure. The empirical power and theoretical power for both the test of close fit and the test of exact fit are nearly identical under violations of the assumption. The results also indicated that the test of close fit maintains the nominal Type I error rate under violations of the assumption.
Factorial Invariance and The Specification of Second-Order Latent Growth Models
Latent growth modeling has been a topic of intense interest during the past two decades. Most theoretical and applied work has employed first-order growth models, in which a single manifest variable serves as indicator of trait level at each time of measurement. In the current paper, we concentrate on issues regarding second-order growth models, which have multiple indicators at each time of measurement. With multiple indicators, tests of factorial invariance of parameters across times of measurement can be tested. We conduct such tests using two sets of data, which differ in the extent to which factorial invariance holds, and evaluate longitudinal confirmatory factor, latent growth curve, and latent difference score models. We demonstrate that, if factorial invariance fails to hold, choice of indicator used to identify the latent variable can have substantial influences on the characterization of patterns of growth, strong enough to alter conclusions about growth. We also discuss matters related to the scaling of growth factors and conclude with recommendations for practice and for future research.
Validity Concerns with Multiplying Ordinal Items Defined by Binned Counts: An Application to a Quantity-Frequency Measure of Alcohol Use
Social and behavioral scientists often measure constructs that are truly discrete counts by collapsing (or binning) the counts into a smaller number of ordinal responses. While prior quantitative research has identified a series of concerns with similar binning procedures, there has been a lack of study on the consequences of multiplying these ordinal items to create a desired index. This measurement strategy is incorporated in many research applications, but it is particularly salient in the study of substance use where the product of ordinal quantity (number of drinks) and frequency (number of days) items is used to create an index of total consumption. In the current study, we demonstrate both analytically and empirically that this multiplicative procedure can introduce serious threats to construct validity. These threats, in turn, directly impact the ability to accurately measure alcohol consumption.
Using Pointwise Mutual Information for Breast Cancer Health Disparities Research With SEER-Medicare Claims
Identification of procedures using International Classification of Diseases or Healthcare Common Procedure Coding System codes is challenging when conducting medical claims research. We demonstrate how Pointwise Mutual Information can be used to find associated codes. We apply the method to an investigation of racial differences in breast cancer outcomes. We used Surveillance Epidemiology and End Results (SEER) data linked to Medicare claims. We identified treatment using two methods. First, we used previously published definitions. Second, we augmented definitions using codes empirically identified by the Pointwise Mutual Information statistic. Similar to previous findings, we found that presentation differences between Black and White women closed much of the estimated survival curve gap. However, we found that survival disparities were completely eliminated with the augmented treatment definitions. We were able to control for a wider range of treatment patterns that might affect survival differences between Black and White women with breast cancer.