Simulation studies for methodological research in psychology: A standardized template for planning, preregistration, and reporting
Simulation studies are widely used for evaluating the performance of statistical methods in psychology. However, the quality of simulation studies can vary widely in terms of their design, execution, and reporting. In order to assess the quality of typical simulation studies in psychology, we reviewed 321 articles published in in 2021 and 2022, among which 100/321 = 31.2% report a simulation study. We find that many articles do not provide complete and transparent information about key aspects of the study, such as justifications for the number of simulation repetitions, Monte Carlo uncertainty estimates, or code and data to reproduce the simulation studies. To address this problem, we provide a summary of the ADEMP (aims, data-generating mechanism, estimands and other targets, methods, performance measures) design and reporting framework from Morris et al. (2019) adapted to simulation studies in psychology. Based on this framework, we provide ADEMP-PreReg, a step-by-step template for researchers to use when designing, potentially preregistering, and reporting their simulation studies. We give formulae for estimating common performance measures, their Monte Carlo standard errors, and for calculating the number of simulation repetitions to achieve a desired Monte Carlo standard error. Finally, we give a detailed tutorial on how to apply the ADEMP framework in practice using an example simulation study on the evaluation of methods for the analysis of pre-post measurement experiments. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Why multiple hypothesis test corrections provide poor control of false positives in the real world
Most scientific disciplines use significance testing to draw conclusions about experimental or observational data. This classical approach provides a theoretical guarantee for controlling the number of false positives across a set of hypothesis tests, making it an appealing framework for scientists seeking to limit the number of false effects or associations that they claim to observe. Unfortunately, this theoretical guarantee applies to few experiments, and the true false positive rate (FPR) is much higher. Scientists have plenty of freedom to choose the error rate to control, the tests to include in the adjustment, and the method of correction, making strong error control difficult to attain. In addition, hypotheses are often tested after finding unexpected relationships or patterns, the data are analyzed in several ways, and analyses may be run repeatedly as data accumulate. As a result, adjusted values are too small, incorrect conclusions are often reached, and results are harder to reproduce. In the following, I argue why the FPR is rarely controlled meaningfully and why shrinking parameter estimates is preferable to value adjustments. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
How to conduct an integrative mixed methods meta-analysis: A tutorial for the systematic review of quantitative and qualitative evidence
This article is a guide on how to conduct mixed methods meta-analyses (sometimes called mixed methods systematic reviews, integrative meta-analyses, or integrative meta-syntheses), using an integrative approach. These aggregative methods allow researchers to synthesize qualitative and quantitative findings from a research literature in order to benefit from the strengths of both forms of analysis. The article articulates distinctions in how qualitative and quantitative methodologies work with variation to develop a coherent theoretical basis for their integration. In advancing this methodological approach to integrative mixed methods meta-analysis (IMMMA), I provide rationales for procedural decisions that support methodological integrity and address prior misconceptions that may explain why these methods have not been as commonly used as might be expected. Features of questions and subject matters that lead them to be amenable to this research approach are considered. The steps to conducting an IMMMA then are described, with illustrative examples, and in a manner open to the use of a range of qualitative and quantitative meta-analytic approaches. These steps include the development of research aims, the selection of primary research articles, the generation of units for analysis, and the development of themes and findings. The tutorial provides guidance on how to develop IMMMA findings that have methodological integrity and are based upon the appreciation of the distinctive approaches to modeling variation in quantitative and qualitative methodologies. The article concludes with guidance for report writing and developing principles for practice. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Harvesting heterogeneity: Selective expertise versus machine learning
The heterogeneity of outcomes in behavioral research has long been perceived as a challenge for the validity of various theoretical models. More recently, however, researchers have started perceiving heterogeneity as something that needs to be not only acknowledged but also actively addressed, particularly in applied research. A serious challenge, however, is that classical psychological methods are not well suited for making practical recommendations when heterogeneous outcomes are expected. In this article, we argue that heterogeneity requires a separation between basic and applied behavioral methods, and between different types of behavioral expertise. We propose a novel framework for evaluating behavioral expertise and suggest that selective expertise can easily be automated via various machine learning methods. We illustrate the value of our framework via an empirical study of the preferences towards battery electric vehicles. Our results suggest that a basic multiarm bandit algorithm vastly outperforms human expertise in selecting the best interventions. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Lagged multidimensional recurrence quantification analysis for determining leader-follower relationships within multidimensional time series
The current article introduces lagged multidimensional recurrence quantification analysis. The method is an extension of multidimensional recurrence quantification analysis and allows to quantify the joint dynamics of multivariate time series and to investigate leader-follower relationships in behavioral and physiological data. Moreover, the method enables the quantification of the joint dynamics of a group, when such leader-follower relationships are taken into account. We first provide a formal presentation of the method, and then apply it to synthetic data, as well as data sets from joint action research, investigating the shared dynamics of facial expression and beats-per-minute recordings within different groups. A wrapper function is included, for applying the method together with the "crqa" package in R. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
The potential of preregistration in psychology: Assessing preregistration producibility and preregistration-study consistency
Study preregistration has become increasingly popular in psychology, but its potential to restrict researcher degrees of freedom has not yet been empirically verified. We used an extensive protocol to assess the producibility (i.e., the degree to which a study can be properly conducted based on the available information) of preregistrations and the consistency between preregistrations and their corresponding papers for 300 psychology studies. We found that preregistrations often lack methodological details and that undisclosed deviations from preregistered plans are frequent. These results highlight that biases due to researcher degrees of freedom remain possible in many preregistered studies. More comprehensive registration templates typically yielded more producible preregistrations. We did not find that the producibility and consistency of preregistrations differed over time or between original and replication studies. Furthermore, we found that operationalizations of variables were generally preregistered more producible and consistently than other study parts. Inconsistencies between preregistrations and published studies were mainly encountered for data collection procedures, statistical models, and exclusion criteria. Our results indicate that, to unlock the full potential of preregistration, researchers in psychology should aim to write more producible preregistrations, adhere to these preregistrations more faithfully, and more transparently report any deviations from their preregistrations. This could be facilitated by training and education to improve preregistration skills, as well as the development of more comprehensive templates. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Comments on the measurement of effect sizes for indirect effects in Bayesian analysis of variance
Bayesian analysis of variance (BANOVA), implemented through R packages, offers a Bayesian approach to analyzing experimental data. A tutorial in extensively documents BANOVA. This note critically examines a method for evaluating mediation using partial eta-squared as an effect size measure within the BANOVA framework. We first identify an error in the formula for partial eta-squared and propose a corrected version. Subsequently, we discuss limitations in the interpretability of this effect size measure, drawing on previous research, and argue for its potential unsuitability in assessing indirect effects in mediation analysis. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Item response theory-based continuous test norming
In norm-referenced psychological testing, an individual's performance is expressed in relation to a reference population using a standardized score, like an intelligence quotient score. The reference population can depend on a continuous variable, like age. Current continuous norming methods transform the raw score into an age-dependent standardized score. Such methods have the shortcoming to solely rely on the raw test scores, ignoring valuable information from individual item responses. Instead of modeling the raw test scores, we propose modeling the item scores with a Bayesian two-parameter logistic (2PL) item response theory model with age-dependent mean and variance of the latent trait distribution, 2PL-norm for short. Norms are then derived using the estimated latent trait score and the age-dependent distribution parameters. Simulations show that 2PL-norms are overall more accurate than those from the most popular raw score-based norming methods cNORM and generalized additive models for location, scale, and shape (GAMLSS). Furthermore, the credible intervals of 2PL-norm exhibit clearly superior coverage over the confidence intervals of the raw score-based methods. The only issue of 2PL-norm is its slightly lower performance at the tails of the norms. Among the raw score-based norming methods, GAMLSS outperforms cNORM. For empirical practice this suggests the use of 2PL-norm, if the model assumptions hold. If not, or the interest is solely in the point estimates of the extreme trait positions, GAMLSS-based norming is a better alternative. The use of the 2PL-norm is illustrated and compared with GAMLSS and cNORM using empirical data, and code is provided, so that users can readily apply 2PL-norm to their normative data. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Bayesian estimation and comparison of idiographic network models
Idiographic network models are estimated on time series data of a single individual and allow researchers to investigate person-specific associations between multiple variables over time. The most common approach for fitting graphical vector autoregressive (GVAR) models uses least absolute shrinkage and selection operator (LASSO) regularization to estimate a contemporaneous and a temporal network. However, estimation of idiographic networks can be unstable in relatively small data sets typical for psychological research. This bears the risk of misinterpreting differences in estimated networks as spurious heterogeneity between individuals. As a remedy, we evaluate the performance of a Bayesian alternative for fitting GVAR models that allows for regularization of parameters while accounting for estimation uncertainty. We also develop a novel test, implemented in the tsnet package in R, which assesses whether differences between estimated networks are reliable based on matrix norms. We first compare Bayesian and LASSO approaches across a range of conditions in a simulation study. Overall, LASSO estimation performs well, while a Bayesian GVAR without edge selection may perform better when the true network is dense. In an additional simulation study, the novel test is conservative and shows good false-positive rates. Finally, we apply Bayesian estimation and testing in an empirical example using daily data on clinical symptoms for 40 individuals. We additionally provide functionality to estimate Bayesian GVAR models in Stan within tsnet. Overall, Bayesian GVAR modeling facilitates the assessment of estimation uncertainty which is important for studying interindividual differences of intraindividual dynamics. In doing so, the novel test serves as a safeguard against premature conclusions of heterogeneity. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Multiple imputation of missing data in large studies with many variables: A fully conditional specification approach using partial least squares
Multiple imputation (MI) is one of the most popular methods for handling missing data in psychological research. However, many imputation approaches are poorly equipped to handle a large number of variables, which are a common sight in studies that employ questionnaires to assess psychological constructs. In such a case, conventional imputation approaches often become unstable and require that the imputation model be simplified, for example, by removing variables or combining them into composite scores. In this article, we propose an alternative method that extends the fully conditional specification approach to MI with dimension reduction techniques such as partial least squares. To evaluate this approach, we conducted a series of simulation studies, in which we compared it with other approaches that were based on variable selection, composite scores, or dimension reduction through principal components analysis. Our findings indicate that this novel approach can provide accurate results even in challenging scenarios, where other approaches fail to do so. Finally, we also illustrate the use of this method in real data and discuss the implications of our findings for practice. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Scaling and estimation of latent growth models with categorical indicator variables
Although the interest in latent growth models (LGMs) with categorical indicator variables has recently increased, there are still difficulties regarding the selection of estimation methods and the interpretation of model estimates. However, difficulties in estimating and interpreting categorical LGMs can be avoided by understanding the scaling process. Depending on which parameter constraint methods are selected at each step of the scaling process, the scale applied to the model changes, which can produce significant differences in the estimation results and interpretation. In other words, if a different method is chosen for any of the steps in the scaling process, the estimation results will not be comparable. This study organizes the scaling process and its relationship with estimation methods for categorical LGMs. Specifically, this study organizes the parameter constraint methods included in the scaling process of categorical LGMs and extensively considers the effect of parameter constraints at each step on the meaning of estimates. This study also provides evidence for the scale suitability and interpretability of model estimates through a simple illustration. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Percentage of variance accounted for in structural equation models: The rediscovery of the goodness of fit index
This article delves into the often-overlooked metric of percentage of variance accounted for in structural equation models (SEM). The goodness of fit index (GFI) provides the percentage of variance of the sum of squared covariances explained by the model. Despite being introduced over four decades ago, the GFI has been overshadowed in favor of fit indices that prioritize distinctions between close and nonclose fitting models. Similar to ² in regression, the GFI should not be used to this aim but rather to quantify the model's utility. The central aim of this study is to reintroduce the GFI, introducing a novel approach to computing the GFI using mean and mean-and-variance corrected test statistics, specifically designed for nonnormal data. We use an extensive simulation study to evaluate the precision of inferences on the GFI, including point estimates and confidence intervals. The findings demonstrate that the GFI can be very accurately estimated, even with nonnormal data, and that confidence intervals exhibit reasonable accuracy across diverse conditions, including large models and nonnormal data scenarios. The article provides methods and code for estimating the GFI in any SEM, urging researchers to reconsider the reporting of the percentage of variance accounted for as an essential tool for model assessment and selection. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
How should we model the effect of "change"-Or should we?
There have been long and bitter debates between those who advocate for the use of residualized change as the foundation of longitudinal models versus those who utilize difference scores. However, these debates have focused primarily on modeling change in the outcome variable. Here, we extend these same ideas to the covariate side of the change equation, finding similar issues arise when using lagged versus difference scores as covariates of interest in models of change. We derive a system of relationships that emerge across models differing in how time-varying covariates are represented, and then demonstrate how the set of logical transformations emerges in applied longitudinal settings. We conclude by considering the practical implications of a synthesized understanding of the effects of difference scores as both outcomes and predictors, with specific consequences for mediation analysis within multivariate longitudinal models. Our results suggest that there is reason for caution when using difference scores as time-varying covariates, given their propensity for inducing apparent inferential inversions within different analyses. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
A computational method to reveal psychological constructs from text data
When starting to formalize psychological constructs, researchers traditionally rely on two distinct approaches: the quantitative approach, which defines constructs as part of a testable theory based on prior research and domain knowledge often deploying self-report questionnaires, or the qualitative approach, which gathers data mostly in the form of text and bases construct definitions on exploratory analyses. Quantitative research might lead to an incomplete understanding of the construct, while qualitative research is limited due to challenges in the systematic data processing, especially at large scale. We present a new computational method that combines the comprehensiveness of qualitative research and the scalability of quantitative analyses to define psychological constructs from semistructured text data. Based on structured questions, participants are prompted to generate sentences reflecting instances of the construct of interest. We apply computational methods to calculate embeddings as numerical representations of the sentences, which we then run through a clustering algorithm to arrive at groupings of sentences as psychologically relevant classes. The method includes steps for the measurement and correction of bias introduced by the data generation, and the assessment of cluster validity according to human judgment. We demonstrate the applicability of our method on an example from emotion regulation. Based on short descriptions of emotion regulation attempts collected through an open-ended situational judgment test, we use our method to derive classes of emotion regulation strategies. Our approach shows how machine learning and psychology can be combined to provide new perspectives on the conceptualization of psychological processes. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Uses of uncertain statistical power: Designing future studies, not evaluating completed studies
tatistical power is a topic of intense interest as part of proposed methodological reforms to improve the defensibility of psychological findings. Power has been used in disparate ways-some that follow and some that do not follow from definitional features of statistical power. We introduce a taxonomy on three uses of power (comparing the performance of different procedures, designing or planning studies, and evaluating completed studies) in the context of new developments that consider uncertainty due to sampling variability. This review first describes fundamental concepts underlying power, new quantitative developments in power analysis, and the application of power analysis in designing studies. To facilitate the pedagogy of using power for design, we provide web applications to illustrate these concepts and examples of power analysis using newly developed methods. We also describe why using power for evaluating completed studies can be counterproductive. We conclude with a discussion of future directions in quantitative research on power analysis and provide recommendations for applying power in substantive research. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Solving variables with Monte Carlo simulation experiments: A stochastic root-solving approach
Despite their popularity and flexibility, questions remain regarding how to optimally solve particular unknown variables of interest using Monte Carlo simulation experiments. This article reviews two common approaches based on either performing deterministic iterative searches with noisy objective functions or by constructing interpolation estimates given fitted surrogate functions, highlighting the inefficiencies and inferential concerns of both methods. To address these limitations, and to fill a gap in existing Monte Carlo experimental methodology, a novel algorithm termed the probabilistic bisection algorithm with bolstering and interpolations (ProBABLI) is presented with the goal providing efficient, consistent, and unbiased estimates (with associated confidence intervals) for the stochastic root equations found in Monte Carlo simulation research. Properties of the ProBABLI approach are demonstrated using practical sample size planning applications for independent samples tests and structural equation models given target power rates, precision criteria, and expected power functions that incorporate prior beliefs. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
A computationally efficient and robust method to estimate exploratory factor analysis models with correlated residuals
A critical assumption in exploratory factor analysis (EFA) is that manifest variables are no longer correlated after the influences of the common factors are controlled. The assumption may not be valid in some EFA applications; for example, questionnaire items share other characteristics in addition to their relations to common factors. We present a computationally efficient and robust method to estimate EFA with correlated residuals. We provide details on the implementation of the method with both ordinary least squares estimation and maximum likelihood estimation. We demonstrate the method using empirical data and conduct a simulation study to explore its statistical properties. The results are (a) that the new method encountered much fewer convergence problems than the existing method; (b) that the EFA model with correlated residuals produced a more satisfactory model fit than the conventional EFA model; and (c) that the EFA model with correlated residuals and the conventional EFA model produced very similar estimates for factor loadings. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Causal definitions versus casual estimation: Reply to Valente et al. (2022)
In this response to Valente et al. (2022), I am discussing the plausibility and applicability of the proposed mediation model and its causal effects estimation for single case experimental designs (SCEDs). I will focus on the underlying assumptions that the authors use to identify the causal effects. These assumptions include the particularly problematic assumption of sequential ignorability or no-unmeasured confounders. First, I will discuss the plausibility of the assumption in general and then particularly for SCEDs by providing an analytic argument and a reanalysis of the empirical example in Valente et al. (2022). Second, I will provide a simulation that reproduces the design by Valente et al. (2022) with the exception that, for a more realistic depiction of empirical data, an unmeasured confounder affects the mediator and outcome variables. The results of this simulation study indicate that even minor violations will lead to Type I error rates up to 100% and coverage rates as low as 0% for the defined causal direct and indirect effects. Third, using historical data on the effect of birth control on stork population and birth rates, I will show that mediation models like the proposed method can lead to surprising artifacts. These artifacts can hardly be identified with statistically means including methods such as sensitivity analyses. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Practical implications of equating equivalence tests: Reply to Campbell and Gustafson (2022)
Linde et al. (2021) compared the "two one-sided tests" the "highest density interval-region of practical equivalence", and the "interval Bayes factor" approaches to establishing equivalence in terms of power and Type I error rate using typical decision thresholds. They found that the interval Bayes factor approach exhibited a higher power but also a higher Type I error rate than the other approaches. In response, Campbell and Gustafson (2022) showed that the performances of the three approaches can approximate one another when they are calibrated to have the same Type I error rate. In this article, we argue that these results have little bearing on how these approaches are used in practice; a concrete example is used to highlight this important point. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
An overview of alternative formats to the Likert format: A comment on Wilson et al. (2022)
Wilson et al. (2022) compared the Likert response format to an alternative format, which they called the Guttman response format. Using a Rasch modeling approach, they found that the Guttman response format had better properties relative to the Likert response format. We agree with their analyses and conclusions. However, they have failed to mention many existing articles that have sought to overcome the disadvantages of the Likert format through the use of an alternative format. For example, the so-called "Guttman response format" is essentially the same as the Expanded format, which was proposed by Zhang and Savalei (2016) as a way to overcome the disadvantages of the Likert format. Similar alternative formats have been investigated since the 1960s. In this short response article, we provide a review of several alternative formats, explaining in detail the key characteristics of all the alternative formats that are designed to overcome the problems with the Likert format. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Correction to "Comparing theories with the Ising model of explanatory coherence" by Maier et al. (2023)
Reports an error in "Comparing theories with the Ising model of explanatory coherence" by Maximilian Maier, Noah van Dongen and Denny Borsboom (, Advanced Online Publication, Mar 02, 2023, np). In the article, the copyright attribution was incorrectly listed, and the Creative Commons CC BY 4.0 license disclaimer was incorrectly omitted from the author note. The correct copyright is "© 2023 The Author(s)," and the omitted disclaimer is below: Open Access funding provided by University College London: This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0; https://creativecommons.org/licenses/by/ 4.0). This license permits copying and redistributing the work in any medium or format, as well as adapting the material for any purpose, even commercially. (The following abstract of the original article appeared in record 2023-50323-001.) Theories are among the most important tools of science. Lewin (1943) already noted "There is nothing as practical as a good theory." Although psychologists discussed problems of theory in their discipline for a long time, weak theories are still widespread in most subfields. One possible reason for this is that psychologists lack the tools to systematically assess the quality of their theories. Thagard (1989) developed a computational model for formal theory evaluation based on the concept of explanatory coherence. However, there are possible improvements to Thagard's (1989) model and it is not available in software that psychologists typically use. Therefore, we developed a new implementation of explanatory coherence based on the Ising model. We demonstrate the capabilities of this new Ising model of Explanatory Coherence (IMEC) on several examples from psychology and other sciences. In addition, we implemented it in the R-package IMEC to assist scientists in evaluating the quality of their theories in practice. (PsycInfo Database Record (c) 2024 APA, all rights reserved).