Behavior Research Methods

FLexSign: A lexical database in French Sign Language (LSF)
Périn P, Herrera S and Bogliotti C
In psycholinguistics, studies are conducted to understand language processing mechanisms, whether in comprehension or in production, and independently of the language modality. To do so, researchers need accurate psycholinguistic information about the linguistic material they use. One main obstacle to this process is the lack of information available in sign language. While some lexical databases exist in multiple sign languages, to the best of our knowledge, psycholinguistic data for French Sign Language (LSF) signs are not yet available. The present study presents FLexSign, the first interactive lexical database for LSF, inspired by ASL-Lex (Caselli et al., 2017). The database includes familiarity, concreteness, and iconicity data for 546 signs of LSF. These three factors are known to influence the speed or the accuracy of lexical processing. Familiarity and concreteness are known to generate a robust facilitative effect on sign processing, while iconicity plays a complex but crucial role in the creation and organization of sign language lexicons. Therefore, having accurate information on the iconicity of LSF signs would help to better understand the role of this notion in lexical processing. To develop the database, 33 participants were recruited and asked to complete an online questionnaire. The FLexSign database will be of great use to sign language researchers, providing linguistic information that was previously unavailable and offering many opportunities at both the experimental and clinical levels. The database is also open to future contributions.
Interactions between latent variables in count regression models
Kiefer C, Wilker S and Mayer A
In psychology and the social sciences, researchers often model count outcome variables accounting for latent predictors and their interactions. Even though neglecting measurement error in such count regression models (e.g., Poisson or negative binomial regression) can have unfavorable consequences like attenuation bias, such analyses are often carried out in the generalized linear model (GLM) framework using fallible covariates such as sum scores. An alternative is count regression models based on structural equation modeling, which allow to specify latent covariates and thereby account for measurement error. However, the issue of how and when to include interactions between latent covariates or between latent and manifest covariates is rarely discussed for count regression models. In this paper, we present a latent variable count regression model (LV-CRM) allowing for latent covariates as well as interactions among both latent and manifest covariates. We conducted three simulation studies, investigating the estimation accuracy of the LV-CRM and comparing it to GLM-based count regression models. Interestingly, we found that even in scenarios with high reliabilities, the regression coefficients from a GLM-based model can be severely biased. In contrast, even for moderate sample sizes, the LV-CRM provided virtually unbiased regression coefficients. Additionally, statistical inferences yielded mixed results for the GLM-based models (i.e., low coverage rates, but acceptable empirical detection rates), but were generally acceptable using the LV-CRM. We provide an applied example from clinical psychology illustrating how the LV-CRM framework can be used to model count regressions with latent interactions.
Affordance norms for 2825 concrete nouns
Maxwell NP, Huff MJ, Hajnal A, Namias JM, Blau JJC, Day B, Marsh KL, Meagher BR, Shelley-Tremblay JF, Thomas GF and Wagman JB
Objects are commonly described based on their relations to other objects (e.g., associations, semantic similarity, etc.) or their physical features (e.g., birds have wings, feathers, etc.). However, objects can also be described in terms of their actionable properties (i.e., affordances), which reflect interactive relations between actors and objects. While several normed datasets have been developed to categorize various aspects of meaning (e.g., semantic features, cue-target associations, etc.), to date, norms for affordances have not been generated. We address this limitation by developing a set of affordance norms for 2825 concrete nouns. Using an open-response format, we computed affordance strength (AFS; i.e., the probability of an item eliciting a particular action response), affordance proportion (AFP; i.e., the proportion of participants who provided a specific action response), and affordance set size (AFSS; i.e., the total number of unique action responses) for each item. Because our stimuli overlapped with Pexman et al.'s, Behavior Research Methods, 51, 453-466, (2019) body-object interaction norms (BOI), we tested whether AFS, AFP, and AFSS were related to BOI, as objects with more perceived action properties may be viewed as being more interactive. Additionally, we tested the relationship between AFS and AFP and two separate measures of relatedness: cosine similarity (Buchanan et al., Behavior Research Methods, 51, 1849-1863, 2019a, Behavior Research Methods, 51, 1878-1888, 2019b) and forward associative strength (Nelson et al., Behavior Research Methods, Instruments, & Computers, 36(3), 402-407, 2004). All analyses, however, revealed weak relationships between affordance measures and existing semantic norms, suggesting that affordance properties reflect a separate construct.
A Good check on the Bayes factor
Sekulovski N, Marsman M and Wagenmakers EJ
Bayes factor hypothesis testing provides a powerful framework for assessing the evidence in favor of competing hypotheses. To obtain Bayes factors, statisticians often require advanced, non-standard tools, making it important to confirm that the methodology is computationally sound. This paper seeks to validate Bayes factor calculations by applying two theorems attributed to Alan Turing and Jack Good. The procedure entails simulating data sets under two hypotheses, calculating Bayes factors, and assessing whether their expected values align with theoretical expectations. We illustrate this method with an ANOVA example and a network psychometrics application, demonstrating its efficacy in detecting calculation errors and confirming the computational correctness of the Bayes factor results. This structured validation approach aims to provide researchers with a tool to enhance the credibility of Bayes factor hypothesis testing, fostering more robust and trustworthy scientific inferences.
Metrics for quantifying co-development at the individual level
Edwards AA and Petscher Y
Previous research on co-development has focused on modeling the relations at the group level; however, how individuals differ in co-development may provide important information as well. Recent work has used vector plots to visually explore individual differences in co-development; however, these judgements were made based on visual inspection of a vector plot rather than the calculation of metrics. Here we propose two metrics that can be used to quantify co-development at the individual level: the co-development change ratio (CCR) and the angle of co-development metric (ACM). CCR provides information about the symmetry of development, examining whether an individual grew at the same pace in one skill relative to peers as compared to growth in the other skill relative to peers. ACM represents the relative amount and direction of change on each skill. This paper provides a tutorial on how to calculate and interpret these two metrics for quantifying co-development at the individual level.
Improvement and application of back random response detection: Based on cumulative sum and change point analysis
Li Y, Chen Q, Gao Y and Liu T
In educational and psychological assessments, benefiting from back random response (BRR) is a major type of rapid guessing in misfitting item score patterns. Person-fit statistics (PFS) based on cumulative sum (CUSUM) and change point analysis (CPA) from statistical process control (SPC) are better than other PFS for detecting aberrant response. In this study, we developed new person-fit statistics based on three algorithms from CPA procedure and CUSUM method for detection of person misfit with dichotomous or polytomous items. By means of simulated data, the effectiveness of the new statistics to detect test-takers with BRR was investigated.
HeLP: The Hebrew Lexicon project
Stein R, Frost R and Siegelman N
Lexicon projects (LPs) are large-scale data resources in different languages that present behavioral results from visual word recognition tasks. Analyses using LP data in multiple languages provide evidence regarding cross-linguistic differences as well as similarities in visual word recognition. Here we present the first LP in a Semitic language-the Hebrew Lexicon Project (HeLP). HeLP assembled lexical decision (LD) responses to 10,000 Hebrew words and nonwords, and naming responses to a subset of 5000 Hebrew words. We used the large-scale HeLP data to estimate the impact of general predictors (lexicality, frequency, word length, orthographic neighborhood density), and Hebrew-specific predictors (Semitic structure, presence of clitics, phonological entropy) of visual word recognition performance. Our results revealed the typical effects of lexicality and frequency obtained in many languages, but more complex impact of word length and neighborhood density. Considering Hebrew-specific characteristics, HeLP data revealed better recognition of words with a Semitic structure than words that do not conform to it, and a drop in performance for words comprising clitics. These effects varied, however, across LD and naming tasks. Lastly, a significant inhibitory effect of phonological ambiguity was found in both naming and LD. The implications of these findings for understanding reading in a Semitic language are discussed.
Estimating power in complex nonlinear structural equation modeling including moderation effects: The powerNLSEM R-package
Irmer JP, Klein AG and Schermelleh-Engel K
The model-implied simulation-based power estimation (MSPE) approach is a new general method for power estimation (Irmer et al., 2024). MSPE was developed especially for power estimation of non-linear structural equation models (SEM), but it also can be applied to linear SEM and manifest models using the R package powerNLSEM. After first providing some information about MSPE and the new adaptive algorithm that automatically selects sample sizes for the best prediction of power using simulation, a tutorial on how to conduct the MSPE for quadratic and interaction SEM (QISEM) using the powerNLSEM package is provided. Power estimation is demonstrated for four methods, latent moderated structural equations (LMS), the unconstrained product indicator (UPI), a simple factor score regression (FSR), and a scale regression (SR) approach to QISEM. In two simulation studies, we highlight the performance of the MSPE for all four methods applied to two QISEM with varying complexity and reliability. Further, we justify the settings of the newly developed adaptive search algorithm via performance evaluations using simulation. Overall, the MSPE using the adaptive approach performs well in terms of bias and Type I error rates.
Establishing the reliability of metrics extracted from long-form recordings using LENA and the ACLEW pipeline
Cristia A, Gautheron L, Zhang Z, Schuller B, Scaff C, Rowland C, Räsänen O, Peurey L, Lavechin M, Havard W, Fausey CM, Cychosz M, Bergelson E, Anderson H, Al Futaisi N and Soderstrom M
Long-form audio recordings are increasingly used to study individual variation, group differences, and many other topics in theoretical and applied fields of developmental science, particularly for the description of children's language input (typically speech from adults) and children's language output (ranging from babble to sentences). The proprietary LENA software has been available for over a decade, and with it, users have come to rely on derived metrics like adult word count (AWC) and child vocalization counts (CVC), which have also more recently been derived using an open-source alternative, the ACLEW pipeline. Yet, there is relatively little work assessing the reliability of long-form metrics in terms of the stability of individual differences across time. Filling this gap, we analyzed eight spoken-language datasets: four from North American English-learning infants, and one each from British English-, French-, American English-/Spanish-, and Quechua-/Spanish-learning infants. The audio data were analyzed using two types of processing software: LENA and the ACLEW open-source pipeline. When all corpora were included, we found relatively low to moderate reliability (across multiple recordings, intraclass correlation coefficient attributed to the child identity [Child ICC], was < 50% for most metrics). There were few differences between the two pipelines. Exploratory analyses suggested some differences as a function of child age and corpora. These findings suggest that, while reliability is likely sufficient for various group-level analyses, caution is needed when using either LENA or ACLEW tools to study individual variation. We also encourage improvement of extant tools, specifically targeting accurate measurement of individual variation.
People make mistakes: Obtaining accurate ground truth from continuous annotations of subjective constructs
Booth BM and Narayanan SS
Accurately representing changes in mental states over time is crucial for understanding their complex dynamics. However, there is little methodological research on the validity and reliability of human-produced continuous-time annotation of these states. We present a psychometric perspective on valid and reliable construct assessment, examine the robustness of interval-scale (e.g., values between zero and one) continuous-time annotation, and identify three major threats to validity and reliability in current approaches. We then propose a novel ground truth generation pipeline that combines emerging techniques for improving validity and robustness. We demonstrate its effectiveness in a case study involving crowd-sourced annotation of perceived violence in movies, where our pipeline achieves a .95 Spearman correlation in summarized ratings compared to a .15 baseline. These results suggest that highly accurate ground truth signals can be produced from continuous annotations using additional comparative annotation (e.g., a versus b) to correct structured errors, highlighting the need for a paradigm shift in robust construct measurement over time.
Assessing computational reproducibility in Behavior Research Methods
Ellis DA, Towse J, Brown O, Cork A, Davidson BI, Devereux S, Hinds J, Ivory M, Nightingale S, Parry DA, Piwek L, Shaw H and Towse AS
Psychological science has thrived thanks to new methods and innovative practices. Journals, including Behavior Research Methods (BRM), continue to support the dissemination and evaluation of research assets including data, software/hardware, statistical code, and databases of stimuli. However, such research assets rarely allow for computational reproducibility, meaning they are difficult to reuse. Therefore, in this preregistered report, we explore how BRM's authors and BRM structures shape the landscape of functional research assets. Our broad research questions concern: (1) How quickly methods and analytical techniques reported in BRM can be used and developed further by other scientists; (2) Whether functionality has improved following changes to BRM journal policy in support of computational reproducibility; (3) Whether we can disentangle such policy changes from changes in reproducibility over time. We randomly sampled equal numbers of papers (N = 204) published in BRM before and after the implementation of policy changes. Pairs of researchers recorded how long it took to ensure assets (data, software/hardware, statistical code, and materials) were fully operational. They also coded the completeness and reusability of the assets. While improvements were observed in all measures, only changes to completeness were altered significantly following the policy changes (d = .37). The effects varied between different types of research assets, with data sets from surveys/experiments showing the largest improvements in completeness and reusability. Perhaps more importantly, changes to policy do appear to have improved the life span of research products by reducing natural decline. We conclude with a discussion of how, in the future, research and policy might better support computational reproducibility within and beyond psychological science.
Model-implied simulation-based power estimation for correctly specified and distributionally misspecified models: Applications to nonlinear and linear structural equation models
Irmer JP, Klein AG and Schermelleh-Engel K
Closed-form (asymptotic) analytical power estimation is only available for limited classes of models, requiring correct model specification for most applications. Simulation-based power estimation can be applied in almost all scenarios where data following the model can be estimated. However, a general framework for calculating the required sample sizes for given power rates is still lacking. We propose a new model-implied simulation-based power estimation (MSPE) method for the z-test that makes use of the asymptotic normality property of estimates of a wide class of estimators, the M-estimators, and give theoretical justification for the approach. M-estimators include maximum-likelihood, least squares estimates and limited information estimators, but also estimators used for misspecified models, hence, the new simulation-based power modeling method is widely applicable. The MSPE employs a parametric model to describe the relationship between power and sample size, which can then be used to determine the required sample size for a specified power rate. We highlight its performance in linear and nonlinear structural equation models (SEM) for correctly specified models and models under distributional misspecification. Simulation results suggest that the new power modeling method is unbiased and shows good performance with regard to root mean squared error and type I error rates for the predicted required sample sizes and predicted power rates, outperforming alternative approaches, such as the naïve approach of selecting a discrete selection of sample sizes with linear interpolation of power or simple logistic regression approaches. The MSPE appears to be a valuable tool to estimate power for models without an (asymptotic) analytical power estimation.
On aggregation invariance of multinomial processing tree models
Erdfelder E, Quevedo Pütter J and Schnuerch M
Multinomial processing tree (MPT) models are prominent and frequently used tools to model and measure cognitive processes underlying responses in many experimental paradigms. Although MPT models typically refer to cognitive processes within single individuals, they have often been applied to group data aggregated across individuals. We investigate the conditions under which MPT analyses of aggregate data make sense. After introducing the notions of structural and empirical aggregation invariance of MPT models, we show that any MPT model that holds at the level of single individuals must also hold at the aggregate level when it is both structurally and empirically aggregation invariant. Moreover, group-level parameters of aggregation-invariant MPT models are equivalent to the expected values (i.e., means) of the corresponding individual parameters. To investigate the robustness of MPT results for aggregate data when one or both invariance conditions are violated, we additionally performed a series of simulation studies, systematically manipulating (1) the sample sizes in different trees of the model, (2) model parameterization, (3) means and variances of crucial model parameters, and (4) their correlations with other parameters of the respective MPT model. Overall, our results show that MPT parameter estimates based on aggregate data are trustworthy under rather general conditions, provided that a few preconditions are met.
An improved diagrammatic procedure for interpreting and scoring the Wisconsin Card Sorting Test: An update to Steve Berry's 1996 edition
Howlett CA and Moseley GL
The Wisconsin Card Sorting Test (WCST) is a popular neuropsychological test that is complicated to score and interpret. In an attempt to make scoring of the WCST simpler, Berry (The Clinical Neuropsychologist 10, 117-121, 1996) developed a diagrammatic scoring procedure, particularly to aid scoring of perseverative responses. We identified key limitations of Berry's diagram, including its unnecessary ambiguity and complexity, use of terminology different from that used in the standardized WCST manual, and lack of distinction between perseverative errors and perseverative responses. Our new diagrammatic scoring procedure scores each response one-by-one; we strongly suggest that the diagram is used in conjunction with the 1993 WCST manual. Our new diagrammatic scoring procedure aims to assist novice users in learning how to accurately score the task, prevent scoring errors when using the manual version of the task, and help scorers verify whether other existing computerized versions of the task (apart from the PAR version) conform to the Heaton et al. (1993) scoring method. Our diagrammatic scoring procedure holds promise to be incorporated into any future versions of the WCST manual.
Parafoveal letter identification in Russian: Confusion matrices based on error rates
Alexeeva S
In the present study, we introduce parafoveal letter confusion matrices for the Russian language, which uses the Cyrillic script. To ensure that our confusion rates reflect parafoveal processing and no other effects, we employed an adapted boundary paradigm (Rayner, 1975) that prevented the participants from directly fixating the letter stimuli. Additionally, we assessed confusability under isolated and word-like (crowded) conditions using two modern fonts, since previous research showed that letter recognition depended on crowding and font (Coates, 2015; Pelli et al., 2006). Our additional goal was to gain insight into what letter features or configurational patterns might be essential for letter recognition in Russian; thus, we conducted exploratory clustering analysis on visual confusion scores to identify groups of similar letters. To support this analysis, we conducted a comprehensive review of over 20 studies that proposed crucial properties of Latin letters relevant to character perception. The summary of this review is valuable not only for our current study but also for future research in the field.
Using the multidimensional nominal response model to model faking in questionnaire data: The importance of item desirability characteristics
Seitz T, Wetzel E, Hilbig BE and Meiser T
Faking in self-report personality questionnaires describes a deliberate response distortion aimed at presenting oneself in an overly favorable manner. Unless the influence of faking on item responses is taken into account, faking can harm multiple psychometric properties of a test. In the present article, we account for faking using an extension of the multidimensional nominal response model (MNRM), which is an item response theory (IRT) model that offers a flexible framework for modeling different kinds of response biases. Particularly, we investigated under which circumstances the MNRM can adequately adjust substantive trait scores and latent correlations for the influence of faking and examined the role of variation in the way item content is related to social desirability (i.e., item desirability characteristics) in facilitating the modeling of faking and counteracting its detrimental effects. Using a simulation, we found that the inclusion of a faking dimension in the model can overall improve the recovery of substantive trait person parameters and latent correlations between substantive traits, especially when the impact of faking in the data is high. Item desirability characteristics moderated the effect of modeling faking and were themselves associated with different levels of parameter recovery. In an empirical demonstration with N = 1070 test-takers, we also showed that the faking modeling approach in combination with different item desirability characteristics can prove successful in empirical questionnaire data. We end the article with a discussion of implications for psychological assessment.
The brief mind wandering three-factor scale (BMW-3)
Schubert AL, Frischkorn GT, Sadus K, Welhaf MS, Kane MJ and Rummel J
In recent years, researchers from different fields have become increasingly interested in measuring individual differences in mind wandering as a psychological trait. Although there are several questionnaires that allow for an assessment of people's perceptions of their mind wandering experiences, they either define mind wandering in a very broad sense or do not sufficiently separate different aspects of mind wandering. Here, we introduce the Brief Mind Wandering Three-Factor Scale (BMW-3), a 12-item questionnaire available in German and English. The BMW-3 conceptualizes mind wandering as task-unrelated thought and measures three dimensions of mind wandering: unintentional mind wandering, intentional mind wandering, and meta-awareness of mind wandering. Based on results from 1038 participants (823 German speakers, 215 English speakers), we found support for the proposed three-factorial structure of mind wandering and for scalar measurement invariance of the German and English versions. All subscales showed good internal consistencies and moderate to high test-retest correlations and thus provide an effective assessment of individual differences in mind wandering. Moreover, the BMW-3 showed good convergent validity when compared to existing retrospective measures of mind wandering and mindfulness and was related to conscientiousness, emotional stability, and openness as well as self-reported attentional control. Lastly, it predicted the propensity for mind wandering inside and outside the lab (as assessed by in-the-moment experience sampling), the frequency of experiencing depressive symptoms, and the use of functional and dysfunctional emotion regulation strategies. All in all, the BMW-3 provides a brief, reliable, and valid assessment of mind wandering for basic and clinical research.
Diverse Face Images (DFI): Validated for racial representation and eye gaze
Pickron CB, Brown AJ, Hudac CM and Scott LS
Face processing is a central component of human communication and social engagement. The present investigation introduces a set of racially and ethnically inclusive faces created for researchers interested in perceptual and socio-cognitive processes linked to human faces. The Diverse Face Images (DFI) stimulus set includes high-quality still images of female faces that are racially and ethnically representative, include multiple images of direct and indirect gaze for each model and control for low-level perceptual variance between images. The DFI stimuli will support researchers interested in studying face processing throughout the lifespan as well as other questions that require a diversity of faces or gazes. This report includes a detailed description of stimuli development and norming data for each model. Adults completed a questionnaire rating each image in the DFI stimuli set on three major qualities relevant to face processing: (1) strength of race/ethnicity group associations, (2) strength of eye gaze orientation, and (3) strength of emotion expression. These validation data highlight the presence of rater variability within and between individual model images as well as within and between race and ethnicity groups.
Random forest analysis and lasso regression outperform traditional methods in identifying missing data auxiliary variables when the MAR mechanism is nonlinear (p.s. Stop using Little's MCAR test)
Hayes T, Baraldi AN and Coxe S
The selection of auxiliary variables is an important first step in appropriately implementing missing data methods such as full information maximum likelihood (FIML) estimation or multiple imputation. However, practical guidelines and statistical tests for selecting useful auxiliary variables are somewhat lacking, leading to potentially biased estimates. We propose the use of random forest analysis and lasso regression as alternative methods to select auxiliary variables, particularly in situations in which the missing data pattern is nonlinear or otherwise complex (i.e., interactive relationships between variables and missingness). Monte Carlo simulations demonstrate the effectiveness of random forest analysis and lasso regression compared to traditional methods (t-tests, Little's MCAR test, logistic regressions), in terms of both selecting auxiliary variables and the performance of said auxiliary variables when incorporated in an analysis with missing data. Both techniques outperformed traditional methods, providing a promising direction for improvement of practical methods for handling missing data in statistical analyses.
A comparison of conventional and resampled personal reliability in detecting careless responding
Goldammer P, Stöckli PL, Annen H and Schmitz-Wilhelmy A
Detecting careless responding in survey data is important to ensure the credibility of study findings. Of several available detection methods, personal reliability (PR) is one of the best-performing indices. Curran, Journal of Experimental Social Psychology, 66, 4-19, (2016) proposed a resampled version of personal reliability (RPR). Compared to the conventional PR or even-odd consistency, in which just one set of scale halves is used, RPR is based on repeated calculation of PR across several randomly rearranged sets of scale halves. RPR should therefore be less affected than PR by random errors that may occur when a specific set of scale half pairings is used for the PR calculation. In theory, RPR should outperform PR, but it remains unclear whether it in fact does, and under what conditions the potential gain in detection accuracy is the most pronounced. We conducted two studies: a simulation study examined the performance of the conventional PR and RPR in detecting simulated careless responding, and a real data example study analyzed their performance when detecting human-generated careless responding. In both studies, RPR turned out to be a significantly better careless response indicator than PR. The results also revealed that using 25 resamples for the RPR computation is sufficient to obtain the expected gain in detection accuracy over the conventional PR. We therefore recommend using RPR instead of the conventional PR when screening questionnaire data for careless responding.
Introducing ART: A new method for testing auditory memory with circular reproduction tasks
Karabay A, Nijenkamp R, Sarampalis A and Fougnie D
Theories of visual working memory have seen significant progress through the use of continuous reproduction tasks. However, these tasks have mainly focused on studying visual features, with limited examples existing in the auditory domain. Therefore, it is unknown to what extent newly developed memory models reflect domain-general limitations or are specific to the visual domain. To address this gap, we developed a novel methodology: the Auditory Reproduction Task (ART). This task utilizes Shepard tones, which create an infinite rising or falling tone illusion by dissecting pitch chroma and height, to create a 1-360° auditory circular space. In Experiment 1, we validated the perceptual circularity and uniformity of this auditory stimulus space. In Experiment 2, we demonstrated that auditory working memory shows similar set size effects to visual working memory-report error increased at a set size of 2 relative to 1, caused by swap errors. In Experiment 3, we tested the validity of ART by correlating reproduction errors with commonly used auditory and visual working memory tasks. Analyses revealed that ART errors were significantly correlated with performance in both auditory and visual working memory tasks, albeit with a stronger correlation observed with auditory working memory. While these experiments have only scratched the surface of the theoretical and computational constraints on auditory working memory, they provide a valuable proof of concept for ART. Further research with ART has the potential to deepen our understanding of auditory working memory, as well as to explore the extent to which existing models are tapping into domain-general constraints.
A standardized framework to test event-based experiments
Lepauvre A, Hirschhorn R, Bendtz K, Mudrik L and Melloni L
The replication crisis in experimental psychology and neuroscience has received much attention recently. This has led to wide acceptance of measures to improve scientific practices, such as preregistration and registered reports. Less effort has been devoted to performing and reporting the results of systematic tests of the functioning of the experimental setup itself. Yet, inaccuracies in the performance of the experimental setup may affect the results of a study, lead to replication failures, and importantly, impede the ability to integrate results across studies. Prompted by challenges we experienced when deploying studies across six laboratories collecting electroencephalography (EEG)/magnetoencephalography (MEG), functional magnetic resonance imaging (fMRI), and intracranial EEG (iEEG), here we describe a framework for both testing and reporting the performance of the experimental setup. In addition, 100 researchers were surveyed to provide a snapshot of current common practices and community standards concerning testing in published experiments' setups. Most researchers reported testing their experimental setups. Almost none, however, published the tests performed or their results. Tests were diverse, targeting different aspects of the setup. Through simulations, we clearly demonstrate how even slight inaccuracies can impact the final results. We end with a standardized, open-source, step-by-step protocol for testing (visual) event-related experiments, shared via protocols.io. The protocol aims to provide researchers with a benchmark for future replications and insights into the research quality to help improve the reproducibility of results, accelerate multicenter studies, increase robustness, and enable integration across studies.
The role of individual differences in emotional word recognition: Insights from a large-scale lexical decision study
Haro J, Hinojosa JA and Ferré P
This work presents a large lexical decision mega-study in Spanish, with 918 participants and 7500 words, focusing on emotional content and individual differences. The main objective was to investigate how emotional valence and arousal influence word recognition, controlling for a large number of confounding variables. In addition, as a unique contribution, the study examined the modulation of these effects by individual differences. Results indicated a significant effect of valence and arousal on lexical decision times, with an interaction between these variables. A linear effect of valence was observed, with slower recognition times for negative words and faster recognition times for positive words. In addition, arousal showed opposite effects in positive and negative words. Importantly, the effect of emotional variables was affected by personality traits (extroversion, conscientiousness and openness to experience), age and gender, challenging the 'one-size-fits-all' interpretation of emotional word processing. All data collected in the study is available to the research community: https://osf.io/cbtqy . This includes data from each participant (RTs, errors and individual differences scores), as well as values of concreteness (n = 1690), familiarity (n = 1693) and age of acquisition (n = 2171) of the words collected exclusively for this study. This is a useful resource for researchers interested not only in emotional word processing, but also in lexical processing in general and the influence of individual differences.
Optimal design of cluster randomized crossover trials with a continuous outcome: Optimal number of time periods and treatment switches under a fixed number of clusters or fixed budget
Moerbeek M
In the cluster randomized crossover trial, a sequence of treatment conditions, rather than just one treatment condition, is assigned to each cluster. This contribution studies the optimal number of time periods in studies with a treatment switch at the end of each time period, and the optimal number of treatment switches in a trial with a fixed number of time periods. This is done for trials with a fixed number of clusters, and for trials in which the costs per cluster, subject, and treatment switch are taken into account using a budgetary constraint. The focus is on trials with a cross-sectional design where a continuous outcome variable is measured at the end of each time period. An exponential decay correlation structure is used to model dependencies among subjects within the same cluster. A linear multilevel mixed model is used to estimate the treatment effect and its associated variance. The optimal design minimizes this variance. Matrix algebra is used to identify the optimal design and other highly efficient designs. For a fixed number of clusters, a design with the maximum number of time periods is optimal and treatment switches should occur at each time period. However, when a budgetary constraint is taken into account, the optimal design may have fewer time periods and fewer treatment switches. The Shiny app was developed to facilitate the use of the methodology in this contribution.
Validating a brief measure of four facets of social evaluation
Koch A, Smith A, Fiske ST, Abele AE, Ellemers N and Yzerbyt V
Five studies (N = 7972) validated a brief measure and model of four facets of social evaluation (friendliness and morality as horizontal facets; ability and assertiveness as vertical facets). Perceivers expressed their personal impressions or estimated society's impression of different types of targets (i.e., envisioned or encountered groups or individuals) and numbers of targets (i.e., between six and 100) in the separate, items-within-target mode or the joint, targets-within-item mode. Factor analyses confirmed that a two-items-per-facet measure fit the data well and better than a four-items-per-dimension measure that captured the Big Two model (i.e., no facets, just the horizontal and vertical dimensions). As predicted, the correlation between the two horizontal facets and between the two vertical facets was higher than the correlations between any horizontal facet and any vertical facet. Perceivers' evaluations of targets on each facet were predictors of unique and relevant behavior intentions. Perceiving a target as more friendly, moral, able, and assertive increased the likelihood of relying on the target's loyalty, fairness, intellect, and hubris in an economic game, respectively. These results establish the external, internal, convergent, discriminant, and predictive validity of the brief measure and model of four facets of social evaluation.
An adaptive testing item selection strategy via a deep reinforcement learning approach
Wang P, Liu H and Xu M
Computerized adaptive testing (CAT) aims to present items that statistically optimize the assessment process by considering the examinee's responses and estimated trait levels. Recent developments in reinforcement learning and deep neural networks provide CAT with the potential to select items that utilize more information across all the items on the remaining tests, rather than just focusing on the next several items to be selected. In this study, we reformulate CAT under the reinforcement learning framework and propose a new item selection strategy based on the deep Q-network (DQN) method. Through simulated and empirical studies, we demonstrate how to monitor the training process to obtain the optimal Q-networks, and we compare the accuracy of the DQN-based item selection strategy with that of five traditional strategies-maximum Fisher information, Fisher information weighted by likelihood, Kullback‒Leibler information weighted by likelihood, maximum posterior weighted information, and maximum expected information-on both simulated and real item banks and responses. We further investigate how sample size and the distribution of the trait levels of the examinees used in training affect DQN performance. The results show that DQN achieves lower RMSE and MAE values than traditional strategies under simulated and real banks and responses in most conditions. Suggestions for the use of DQN-based strategies are provided, as well as their code.
Person explanatory multidimensional item response theory with the instrument package in R
Kleinsasser MJ, Mistry R, Hsieh HF, McCarthy WJ and Raghunathan T
We present the new R package instrument to perform Bayesian estimation of person explanatory multidimensional item response theory. The package implements an exploratory multidimensional item response theory model and a higher-order multidimensional item response theory model, a type of confirmatory multidimensional item response theory. Explanation of person parameters is accomplished by fixed and random effect linear regression models. Estimation is carried out using Hamiltonian Monte Carlo in Stan. In this article, we provide a detailed description of the models; we use the instrument package to demonstrate fitting explanatory item response models with fixed and random effects (i.e., mixed modeling) of person parameters in R; and, we perform a simulation study to evaluate the performance of our implementation of the models.
Beyond self-report: Measuring visual, auditory, and tactile mental imagery using a mental comparison task
Suggate SP
Finding a reliable and objective measure of individual differences in mental imagery across sensory modalities is difficult, with measures relying on self-report scales or focusing on one modality alone. Based on the idea that mental imagery involves multimodal sensorimotor simulations, a mental comparison task (MCT) was developed across three studies and tested on adults (n = 96, 345, and 448). Analyses examined: (a) the internal consistency of the MCT, (b) whether lexical features of the MCT stimuli (word length and frequency) predicted performance, (c) whether the MCT related to two widely used self-report scales, (d) response latencies and accuracies across the visual, auditory, and tactile modalities, and (e) whether MCT performance was independent of processing speed. The MCT showed evidence of reliability and validity. Responses were fastest and most accurate for the visual modality, followed by the auditory and tactile. However, consistent with the idea that self-report questionnaires index a different aspect of mental imagery, the MCT showed minimal correlations with self-report imagery. Finally, relations between MCT scales remained strong after controlling for processing speed. Findings are discussed in relation to current understanding and measurement of mental imagery.
Linking essay-writing tests using many-facet models and neural automated essay scoring
Uto M and Aramaki K
For essay-writing tests, challenges arise when scores assigned to essays are influenced by the characteristics of raters, such as rater severity and consistency. Item response theory (IRT) models incorporating rater parameters have been developed to tackle this issue, exemplified by the many-facet Rasch models. These IRT models enable the estimation of examinees' abilities while accounting for the impact of rater characteristics, thereby enhancing the accuracy of ability measurement. However, difficulties can arise when different groups of examinees are evaluated by different sets of raters. In such cases, test linking is essential for unifying the scale of model parameters estimated for individual examinee-rater groups. Traditional test-linking methods typically require administrators to design groups in which either examinees or raters are partially shared. However, this is often impractical in real-world testing scenarios. To address this, we introduce a novel method for linking the parameters of IRT models with rater parameters that uses neural automated essay scoring technology. Our experimental results indicate that our method successfully accomplishes test linking with accuracy comparable to that of linear linking using few common examinees.
Establishing construct validity for dynamic measures of behavior using naturalistic study designs
French RC, Kennedy DP and Krendl AC
There has been a recent surge of naturalistic methodology to assess complex topics in psychology and neuroscience. Such methods are lauded for their increased ecological validity, aiming to bridge a gap between highly controlled experimental design and purely observational studies. However, these measures present challenges in establishing construct validity. One domain in which this has emerged is research on theory of mind: the ability to infer others' thoughts and emotions. Traditional measures utilize rigid methodology which suffer from ceiling effects and may fail to fully capture how individuals engage theory of mind in everyday interactions. In the present study, we validate and test a novel approach utilizing a naturalistic task to assess theory of mind. Participants watched a mockumentary-style show while using a joystick to provide continuous, real-time theory of mind judgments. A baseline sample's ratings were used to establish a "ground truth" for the judgments. Ratings from separate young and older adult samples were compared against the ground truth to create similarity scores. This similarity score was compared against two independent tasks to assess construct validity: an explicit judgment performance-based paradigm, and a neuroimaging paradigm assessing response to a static measure of theory of mind. The similarity metric did not have ceiling effects and was significantly positively related to both the performance-based and neural measures. It also replicated age effects that other theory of mind measures demonstrate. Together, our multimodal approach provided convergent evidence that dynamic measures of behavior can yield robust and rigorous assessments of complex psychological processes.