Beyond self-report: Measuring visual, auditory, and tactile mental imagery using a mental comparison task
Finding a reliable and objective measure of individual differences in mental imagery across sensory modalities is difficult, with measures relying on self-report scales or focusing on one modality alone. Based on the idea that mental imagery involves multimodal sensorimotor simulations, a mental comparison task (MCT) was developed across three studies and tested on adults (n = 96, 345, and 448). Analyses examined: (a) the internal consistency of the MCT, (b) whether lexical features of the MCT stimuli (word length and frequency) predicted performance, (c) whether the MCT related to two widely used self-report scales, (d) response latencies and accuracies across the visual, auditory, and tactile modalities, and (e) whether MCT performance was independent of processing speed. The MCT showed evidence of reliability and validity. Responses were fastest and most accurate for the visual modality, followed by the auditory and tactile. However, consistent with the idea that self-report questionnaires index a different aspect of mental imagery, the MCT showed minimal correlations with self-report imagery. Finally, relations between MCT scales remained strong after controlling for processing speed. Findings are discussed in relation to current understanding and measurement of mental imagery.
Parafoveal letter identification in Russian: Confusion matrices based on error rates
In the present study, we introduce parafoveal letter confusion matrices for the Russian language, which uses the Cyrillic script. To ensure that our confusion rates reflect parafoveal processing and no other effects, we employed an adapted boundary paradigm (Rayner, 1975) that prevented the participants from directly fixating the letter stimuli. Additionally, we assessed confusability under isolated and word-like (crowded) conditions using two modern fonts, since previous research showed that letter recognition depended on crowding and font (Coates, 2015; Pelli et al., 2006). Our additional goal was to gain insight into what letter features or configurational patterns might be essential for letter recognition in Russian; thus, we conducted exploratory clustering analysis on visual confusion scores to identify groups of similar letters. To support this analysis, we conducted a comprehensive review of over 20 studies that proposed crucial properties of Latin letters relevant to character perception. The summary of this review is valuable not only for our current study but also for future research in the field.
Optimal design of cluster randomized crossover trials with a continuous outcome: Optimal number of time periods and treatment switches under a fixed number of clusters or fixed budget
In the cluster randomized crossover trial, a sequence of treatment conditions, rather than just one treatment condition, is assigned to each cluster. This contribution studies the optimal number of time periods in studies with a treatment switch at the end of each time period, and the optimal number of treatment switches in a trial with a fixed number of time periods. This is done for trials with a fixed number of clusters, and for trials in which the costs per cluster, subject, and treatment switch are taken into account using a budgetary constraint. The focus is on trials with a cross-sectional design where a continuous outcome variable is measured at the end of each time period. An exponential decay correlation structure is used to model dependencies among subjects within the same cluster. A linear multilevel mixed model is used to estimate the treatment effect and its associated variance. The optimal design minimizes this variance. Matrix algebra is used to identify the optimal design and other highly efficient designs. For a fixed number of clusters, a design with the maximum number of time periods is optimal and treatment switches should occur at each time period. However, when a budgetary constraint is taken into account, the optimal design may have fewer time periods and fewer treatment switches. The Shiny app was developed to facilitate the use of the methodology in this contribution.
An adaptive testing item selection strategy via a deep reinforcement learning approach
Computerized adaptive testing (CAT) aims to present items that statistically optimize the assessment process by considering the examinee's responses and estimated trait levels. Recent developments in reinforcement learning and deep neural networks provide CAT with the potential to select items that utilize more information across all the items on the remaining tests, rather than just focusing on the next several items to be selected. In this study, we reformulate CAT under the reinforcement learning framework and propose a new item selection strategy based on the deep Q-network (DQN) method. Through simulated and empirical studies, we demonstrate how to monitor the training process to obtain the optimal Q-networks, and we compare the accuracy of the DQN-based item selection strategy with that of five traditional strategies-maximum Fisher information, Fisher information weighted by likelihood, Kullback‒Leibler information weighted by likelihood, maximum posterior weighted information, and maximum expected information-on both simulated and real item banks and responses. We further investigate how sample size and the distribution of the trait levels of the examinees used in training affect DQN performance. The results show that DQN achieves lower RMSE and MAE values than traditional strategies under simulated and real banks and responses in most conditions. Suggestions for the use of DQN-based strategies are provided, as well as their code.
Introducing ART: A new method for testing auditory memory with circular reproduction tasks
Theories of visual working memory have seen significant progress through the use of continuous reproduction tasks. However, these tasks have mainly focused on studying visual features, with limited examples existing in the auditory domain. Therefore, it is unknown to what extent newly developed memory models reflect domain-general limitations or are specific to the visual domain. To address this gap, we developed a novel methodology: the Auditory Reproduction Task (ART). This task utilizes Shepard tones, which create an infinite rising or falling tone illusion by dissecting pitch chroma and height, to create a 1-360° auditory circular space. In Experiment 1, we validated the perceptual circularity and uniformity of this auditory stimulus space. In Experiment 2, we demonstrated that auditory working memory shows similar set size effects to visual working memory-report error increased at a set size of 2 relative to 1, caused by swap errors. In Experiment 3, we tested the validity of ART by correlating reproduction errors with commonly used auditory and visual working memory tasks. Analyses revealed that ART errors were significantly correlated with performance in both auditory and visual working memory tasks, albeit with a stronger correlation observed with auditory working memory. While these experiments have only scratched the surface of the theoretical and computational constraints on auditory working memory, they provide a valuable proof of concept for ART. Further research with ART has the potential to deepen our understanding of auditory working memory, as well as to explore the extent to which existing models are tapping into domain-general constraints.
On aggregation invariance of multinomial processing tree models
Multinomial processing tree (MPT) models are prominent and frequently used tools to model and measure cognitive processes underlying responses in many experimental paradigms. Although MPT models typically refer to cognitive processes within single individuals, they have often been applied to group data aggregated across individuals. We investigate the conditions under which MPT analyses of aggregate data make sense. After introducing the notions of structural and empirical aggregation invariance of MPT models, we show that any MPT model that holds at the level of single individuals must also hold at the aggregate level when it is both structurally and empirically aggregation invariant. Moreover, group-level parameters of aggregation-invariant MPT models are equivalent to the expected values (i.e., means) of the corresponding individual parameters. To investigate the robustness of MPT results for aggregate data when one or both invariance conditions are violated, we additionally performed a series of simulation studies, systematically manipulating (1) the sample sizes in different trees of the model, (2) model parameterization, (3) means and variances of crucial model parameters, and (4) their correlations with other parameters of the respective MPT model. Overall, our results show that MPT parameter estimates based on aggregate data are trustworthy under rather general conditions, provided that a few preconditions are met.
Using the multidimensional nominal response model to model faking in questionnaire data: The importance of item desirability characteristics
Faking in self-report personality questionnaires describes a deliberate response distortion aimed at presenting oneself in an overly favorable manner. Unless the influence of faking on item responses is taken into account, faking can harm multiple psychometric properties of a test. In the present article, we account for faking using an extension of the multidimensional nominal response model (MNRM), which is an item response theory (IRT) model that offers a flexible framework for modeling different kinds of response biases. Particularly, we investigated under which circumstances the MNRM can adequately adjust substantive trait scores and latent correlations for the influence of faking and examined the role of variation in the way item content is related to social desirability (i.e., item desirability characteristics) in facilitating the modeling of faking and counteracting its detrimental effects. Using a simulation, we found that the inclusion of a faking dimension in the model can overall improve the recovery of substantive trait person parameters and latent correlations between substantive traits, especially when the impact of faking in the data is high. Item desirability characteristics moderated the effect of modeling faking and were themselves associated with different levels of parameter recovery. In an empirical demonstration with N = 1070 test-takers, we also showed that the faking modeling approach in combination with different item desirability characteristics can prove successful in empirical questionnaire data. We end the article with a discussion of implications for psychological assessment.
Random forest analysis and lasso regression outperform traditional methods in identifying missing data auxiliary variables when the MAR mechanism is nonlinear (p.s. Stop using Little's MCAR test)
The selection of auxiliary variables is an important first step in appropriately implementing missing data methods such as full information maximum likelihood (FIML) estimation or multiple imputation. However, practical guidelines and statistical tests for selecting useful auxiliary variables are somewhat lacking, leading to potentially biased estimates. We propose the use of random forest analysis and lasso regression as alternative methods to select auxiliary variables, particularly in situations in which the missing data pattern is nonlinear or otherwise complex (i.e., interactive relationships between variables and missingness). Monte Carlo simulations demonstrate the effectiveness of random forest analysis and lasso regression compared to traditional methods (t-tests, Little's MCAR test, logistic regressions), in terms of both selecting auxiliary variables and the performance of said auxiliary variables when incorporated in an analysis with missing data. Both techniques outperformed traditional methods, providing a promising direction for improvement of practical methods for handling missing data in statistical analyses.
Estimating power in complex nonlinear structural equation modeling including moderation effects: The powerNLSEM R-package
The model-implied simulation-based power estimation (MSPE) approach is a new general method for power estimation (Irmer et al., 2024). MSPE was developed especially for power estimation of non-linear structural equation models (SEM), but it also can be applied to linear SEM and manifest models using the R package powerNLSEM. After first providing some information about MSPE and the new adaptive algorithm that automatically selects sample sizes for the best prediction of power using simulation, a tutorial on how to conduct the MSPE for quadratic and interaction SEM (QISEM) using the powerNLSEM package is provided. Power estimation is demonstrated for four methods, latent moderated structural equations (LMS), the unconstrained product indicator (UPI), a simple factor score regression (FSR), and a scale regression (SR) approach to QISEM. In two simulation studies, we highlight the performance of the MSPE for all four methods applied to two QISEM with varying complexity and reliability. Further, we justify the settings of the newly developed adaptive search algorithm via performance evaluations using simulation. Overall, the MSPE using the adaptive approach performs well in terms of bias and Type I error rates.
Establishing the reliability of metrics extracted from long-form recordings using LENA and the ACLEW pipeline
Long-form audio recordings are increasingly used to study individual variation, group differences, and many other topics in theoretical and applied fields of developmental science, particularly for the description of children's language input (typically speech from adults) and children's language output (ranging from babble to sentences). The proprietary LENA software has been available for over a decade, and with it, users have come to rely on derived metrics like adult word count (AWC) and child vocalization counts (CVC), which have also more recently been derived using an open-source alternative, the ACLEW pipeline. Yet, there is relatively little work assessing the reliability of long-form metrics in terms of the stability of individual differences across time. Filling this gap, we analyzed eight spoken-language datasets: four from North American English-learning infants, and one each from British English-, French-, American English-/Spanish-, and Quechua-/Spanish-learning infants. The audio data were analyzed using two types of processing software: LENA and the ACLEW open-source pipeline. When all corpora were included, we found relatively low to moderate reliability (across multiple recordings, intraclass correlation coefficient attributed to the child identity [Child ICC], was < 50% for most metrics). There were few differences between the two pipelines. Exploratory analyses suggested some differences as a function of child age and corpora. These findings suggest that, while reliability is likely sufficient for various group-level analyses, caution is needed when using either LENA or ACLEW tools to study individual variation. We also encourage improvement of extant tools, specifically targeting accurate measurement of individual variation.
Assessing computational reproducibility in Behavior Research Methods
Psychological science has thrived thanks to new methods and innovative practices. Journals, including Behavior Research Methods (BRM), continue to support the dissemination and evaluation of research assets including data, software/hardware, statistical code, and databases of stimuli. However, such research assets rarely allow for computational reproducibility, meaning they are difficult to reuse. Therefore, in this preregistered report, we explore how BRM's authors and BRM structures shape the landscape of functional research assets. Our broad research questions concern: (1) How quickly methods and analytical techniques reported in BRM can be used and developed further by other scientists; (2) Whether functionality has improved following changes to BRM journal policy in support of computational reproducibility; (3) Whether we can disentangle such policy changes from changes in reproducibility over time. We randomly sampled equal numbers of papers (N = 204) published in BRM before and after the implementation of policy changes. Pairs of researchers recorded how long it took to ensure assets (data, software/hardware, statistical code, and materials) were fully operational. They also coded the completeness and reusability of the assets. While improvements were observed in all measures, only changes to completeness were altered significantly following the policy changes (d = .37). The effects varied between different types of research assets, with data sets from surveys/experiments showing the largest improvements in completeness and reusability. Perhaps more importantly, changes to policy do appear to have improved the life span of research products by reducing natural decline. We conclude with a discussion of how, in the future, research and policy might better support computational reproducibility within and beyond psychological science.
People make mistakes: Obtaining accurate ground truth from continuous annotations of subjective constructs
Accurately representing changes in mental states over time is crucial for understanding their complex dynamics. However, there is little methodological research on the validity and reliability of human-produced continuous-time annotation of these states. We present a psychometric perspective on valid and reliable construct assessment, examine the robustness of interval-scale (e.g., values between zero and one) continuous-time annotation, and identify three major threats to validity and reliability in current approaches. We then propose a novel ground truth generation pipeline that combines emerging techniques for improving validity and robustness. We demonstrate its effectiveness in a case study involving crowd-sourced annotation of perceived violence in movies, where our pipeline achieves a .95 Spearman correlation in summarized ratings compared to a .15 baseline. These results suggest that highly accurate ground truth signals can be produced from continuous annotations using additional comparative annotation (e.g., a versus b) to correct structured errors, highlighting the need for a paradigm shift in robust construct measurement over time.
Model-implied simulation-based power estimation for correctly specified and distributionally misspecified models: Applications to nonlinear and linear structural equation models
Closed-form (asymptotic) analytical power estimation is only available for limited classes of models, requiring correct model specification for most applications. Simulation-based power estimation can be applied in almost all scenarios where data following the model can be estimated. However, a general framework for calculating the required sample sizes for given power rates is still lacking. We propose a new model-implied simulation-based power estimation (MSPE) method for the z-test that makes use of the asymptotic normality property of estimates of a wide class of estimators, the M-estimators, and give theoretical justification for the approach. M-estimators include maximum-likelihood, least squares estimates and limited information estimators, but also estimators used for misspecified models, hence, the new simulation-based power modeling method is widely applicable. The MSPE employs a parametric model to describe the relationship between power and sample size, which can then be used to determine the required sample size for a specified power rate. We highlight its performance in linear and nonlinear structural equation models (SEM) for correctly specified models and models under distributional misspecification. Simulation results suggest that the new power modeling method is unbiased and shows good performance with regard to root mean squared error and type I error rates for the predicted required sample sizes and predicted power rates, outperforming alternative approaches, such as the naïve approach of selecting a discrete selection of sample sizes with linear interpolation of power or simple logistic regression approaches. The MSPE appears to be a valuable tool to estimate power for models without an (asymptotic) analytical power estimation.
The brief mind wandering three-factor scale (BMW-3)
In recent years, researchers from different fields have become increasingly interested in measuring individual differences in mind wandering as a psychological trait. Although there are several questionnaires that allow for an assessment of people's perceptions of their mind wandering experiences, they either define mind wandering in a very broad sense or do not sufficiently separate different aspects of mind wandering. Here, we introduce the Brief Mind Wandering Three-Factor Scale (BMW-3), a 12-item questionnaire available in German and English. The BMW-3 conceptualizes mind wandering as task-unrelated thought and measures three dimensions of mind wandering: unintentional mind wandering, intentional mind wandering, and meta-awareness of mind wandering. Based on results from 1038 participants (823 German speakers, 215 English speakers), we found support for the proposed three-factorial structure of mind wandering and for scalar measurement invariance of the German and English versions. All subscales showed good internal consistencies and moderate to high test-retest correlations and thus provide an effective assessment of individual differences in mind wandering. Moreover, the BMW-3 showed good convergent validity when compared to existing retrospective measures of mind wandering and mindfulness and was related to conscientiousness, emotional stability, and openness as well as self-reported attentional control. Lastly, it predicted the propensity for mind wandering inside and outside the lab (as assessed by in-the-moment experience sampling), the frequency of experiencing depressive symptoms, and the use of functional and dysfunctional emotion regulation strategies. All in all, the BMW-3 provides a brief, reliable, and valid assessment of mind wandering for basic and clinical research.
Diverse Face Images (DFI): Validated for racial representation and eye gaze
Face processing is a central component of human communication and social engagement. The present investigation introduces a set of racially and ethnically inclusive faces created for researchers interested in perceptual and socio-cognitive processes linked to human faces. The Diverse Face Images (DFI) stimulus set includes high-quality still images of female faces that are racially and ethnically representative, include multiple images of direct and indirect gaze for each model and control for low-level perceptual variance between images. The DFI stimuli will support researchers interested in studying face processing throughout the lifespan as well as other questions that require a diversity of faces or gazes. This report includes a detailed description of stimuli development and norming data for each model. Adults completed a questionnaire rating each image in the DFI stimuli set on three major qualities relevant to face processing: (1) strength of race/ethnicity group associations, (2) strength of eye gaze orientation, and (3) strength of emotion expression. These validation data highlight the presence of rater variability within and between individual model images as well as within and between race and ethnicity groups.
Improvement and application of back random response detection: Based on cumulative sum and change point analysis
In educational and psychological assessments, benefiting from back random response (BRR) is a major type of rapid guessing in misfitting item score patterns. Person-fit statistics (PFS) based on cumulative sum (CUSUM) and change point analysis (CPA) from statistical process control (SPC) are better than other PFS for detecting aberrant response. In this study, we developed new person-fit statistics based on three algorithms from CPA procedure and CUSUM method for detection of person misfit with dichotomous or polytomous items. By means of simulated data, the effectiveness of the new statistics to detect test-takers with BRR was investigated.
A comparison of conventional and resampled personal reliability in detecting careless responding
Detecting careless responding in survey data is important to ensure the credibility of study findings. Of several available detection methods, personal reliability (PR) is one of the best-performing indices. Curran, Journal of Experimental Social Psychology, 66, 4-19, (2016) proposed a resampled version of personal reliability (RPR). Compared to the conventional PR or even-odd consistency, in which just one set of scale halves is used, RPR is based on repeated calculation of PR across several randomly rearranged sets of scale halves. RPR should therefore be less affected than PR by random errors that may occur when a specific set of scale half pairings is used for the PR calculation. In theory, RPR should outperform PR, but it remains unclear whether it in fact does, and under what conditions the potential gain in detection accuracy is the most pronounced. We conducted two studies: a simulation study examined the performance of the conventional PR and RPR in detecting simulated careless responding, and a real data example study analyzed their performance when detecting human-generated careless responding. In both studies, RPR turned out to be a significantly better careless response indicator than PR. The results also revealed that using 25 resamples for the RPR computation is sufficient to obtain the expected gain in detection accuracy over the conventional PR. We therefore recommend using RPR instead of the conventional PR when screening questionnaire data for careless responding.
HeLP: The Hebrew Lexicon project
Lexicon projects (LPs) are large-scale data resources in different languages that present behavioral results from visual word recognition tasks. Analyses using LP data in multiple languages provide evidence regarding cross-linguistic differences as well as similarities in visual word recognition. Here we present the first LP in a Semitic language-the Hebrew Lexicon Project (HeLP). HeLP assembled lexical decision (LD) responses to 10,000 Hebrew words and nonwords, and naming responses to a subset of 5000 Hebrew words. We used the large-scale HeLP data to estimate the impact of general predictors (lexicality, frequency, word length, orthographic neighborhood density), and Hebrew-specific predictors (Semitic structure, presence of clitics, phonological entropy) of visual word recognition performance. Our results revealed the typical effects of lexicality and frequency obtained in many languages, but more complex impact of word length and neighborhood density. Considering Hebrew-specific characteristics, HeLP data revealed better recognition of words with a Semitic structure than words that do not conform to it, and a drop in performance for words comprising clitics. These effects varied, however, across LD and naming tasks. Lastly, a significant inhibitory effect of phonological ambiguity was found in both naming and LD. The implications of these findings for understanding reading in a Semitic language are discussed.
A standardized framework to test event-based experiments
The replication crisis in experimental psychology and neuroscience has received much attention recently. This has led to wide acceptance of measures to improve scientific practices, such as preregistration and registered reports. Less effort has been devoted to performing and reporting the results of systematic tests of the functioning of the experimental setup itself. Yet, inaccuracies in the performance of the experimental setup may affect the results of a study, lead to replication failures, and importantly, impede the ability to integrate results across studies. Prompted by challenges we experienced when deploying studies across six laboratories collecting electroencephalography (EEG)/magnetoencephalography (MEG), functional magnetic resonance imaging (fMRI), and intracranial EEG (iEEG), here we describe a framework for both testing and reporting the performance of the experimental setup. In addition, 100 researchers were surveyed to provide a snapshot of current common practices and community standards concerning testing in published experiments' setups. Most researchers reported testing their experimental setups. Almost none, however, published the tests performed or their results. Tests were diverse, targeting different aspects of the setup. Through simulations, we clearly demonstrate how even slight inaccuracies can impact the final results. We end with a standardized, open-source, step-by-step protocol for testing (visual) event-related experiments, shared via protocols.io. The protocol aims to provide researchers with a benchmark for future replications and insights into the research quality to help improve the reproducibility of results, accelerate multicenter studies, increase robustness, and enable integration across studies.
A Good check on the Bayes factor
Bayes factor hypothesis testing provides a powerful framework for assessing the evidence in favor of competing hypotheses. To obtain Bayes factors, statisticians often require advanced, non-standard tools, making it important to confirm that the methodology is computationally sound. This paper seeks to validate Bayes factor calculations by applying two theorems attributed to Alan Turing and Jack Good. The procedure entails simulating data sets under two hypotheses, calculating Bayes factors, and assessing whether their expected values align with theoretical expectations. We illustrate this method with an ANOVA example and a network psychometrics application, demonstrating its efficacy in detecting calculation errors and confirming the computational correctness of the Bayes factor results. This structured validation approach aims to provide researchers with a tool to enhance the credibility of Bayes factor hypothesis testing, fostering more robust and trustworthy scientific inferences.
An improved diagrammatic procedure for interpreting and scoring the Wisconsin Card Sorting Test: An update to Steve Berry's 1996 edition
The Wisconsin Card Sorting Test (WCST) is a popular neuropsychological test that is complicated to score and interpret. In an attempt to make scoring of the WCST simpler, Berry (The Clinical Neuropsychologist 10, 117-121, 1996) developed a diagrammatic scoring procedure, particularly to aid scoring of perseverative responses. We identified key limitations of Berry's diagram, including its unnecessary ambiguity and complexity, use of terminology different from that used in the standardized WCST manual, and lack of distinction between perseverative errors and perseverative responses. Our new diagrammatic scoring procedure scores each response one-by-one; we strongly suggest that the diagram is used in conjunction with the 1993 WCST manual. Our new diagrammatic scoring procedure aims to assist novice users in learning how to accurately score the task, prevent scoring errors when using the manual version of the task, and help scorers verify whether other existing computerized versions of the task (apart from the PAR version) conform to the Heaton et al. (1993) scoring method. Our diagrammatic scoring procedure holds promise to be incorporated into any future versions of the WCST manual.