Investigating heterogeneity in IRTree models for multiple response processes with score-based partitioning
Item response tree (IRTree) models form a family of psychometric models that allow researchers to control for multiple response processes, such as different sorts of response styles, in the measurement of latent traits. While IRTree models can capture quantitative individual differences in both the latent traits of interest and the use of response categories, they maintain the basic assumption that the nature and weighting of latent response processes are homogeneous across the entire population of respondents. In the present research, we therefore propose a novel approach for detecting heterogeneity in the parameters of IRTree models across subgroups that engage in different response behavior. The approach uses score-based tests to reveal violations of parameter heterogeneity along extraneous person covariates, and it can be employed as a model-based partitioning algorithm to identify sources of differences in the strength of trait-based responding or other response processes. Simulation studies demonstrate generally accurate Type I error rates and sufficient power for metric, ordinal, and categorical person covariates and for different types of test statistics, with the potential to differentiate between different types of parameter heterogeneity. An empirical application illustrates the use of score-based partitioning in the analysis of latent response processes with real data.
Constructing tests for skill assessment with competence-based test development
Competence-based test development is a recent and innovative method for the construction of tests that are as informative as possible about the competence state (the set of skills an individual has available) underlying the observed item responses. It finds application in different contexts, including the development of tests from scratch, and the improvement or shortening of existing tests. Given a fixed collection of competence states existing in a population of individuals and a fixed collection of competencies (each of which being the subset of skills that allow for solving an item), the competency deletion procedure results in tests that differ from each other in the competencies but are all equally informative about individuals' competence states. This work introduces a streamlined version of the competency deletion procedure that considers information necessary for test construction only, illustrates a straightforward way to incorporate test developer preferences about competencies into the test construction process, and evaluates the performance of the resulting tests in uncovering the competence states from the observed item responses.
Discriminability around polytomous knowledge structures and polytomous functions
The discriminability in polytomous KST was introduced by Stefanutti et al. (Journal of Mathematical Psychology, 2020, 94, 102306). As the interesting topic in polytomous KST, this paper discusses the discriminability around granular polytomous knowledge spaces, polytomous knowledge structures, polytomous surmising functions and polytomous skill functions. More precisely, this paper gives some equivalences between the discriminability of polytomous surmising functions (resp. polytomous skill functions) and the discriminability of granular polytomous knowledge spaces (resp. polytomous knowledge structures). Such findings open the field to a systematic generalization of the discriminability in KST to the polytomous case.
Regularized Bayesian algorithms for Q-matrix inference based on saturated cognitive diagnosis modelling
Q-matrices are crucial components of cognitive diagnosis models (CDMs), which are used to provide diagnostic information and classify examinees according to their attribute profiles. The absence of an appropriate Q-matrix that correctly reflects item-attribute relationships often limits the widespread use of CDMs. Rather than relying on expert judgment for specification and post-hoc methods for validation, there has been a notable shift towards Q-matrix estimation by adopting Bayesian methods. Nevertheless, their dependency on Markov chain Monte Carlo (MCMC) estimation requires substantial computational burdens and their exploratory tendency is unscalable to large-scale settings. As a scalable and efficient alternative, this study introduces the partially confirmatory framework within a saturated CDM, where the Q-matrix can be partially defined by experts and partially inferred from data. To address the dual needs of accuracy and efficiency, the proposed framework accommodates two estimation algorithms-an MCMC algorithm and a Variational Bayesian Expectation Maximization (VBEM) algorithm. This dual-channel approach extends the model's applicability across a variety of settings. Based on simulated and real data, the proposed framework demonstrated its robustness in Q-matrix inference.
Identifiability analysis of the fixed-effects one-parameter logistic positive exponent model
In addition to the usual slope and location parameters included in a regular two-parameter logistic model (2PL), the logistic positive exponent (LPE) model incorporates an item parameter that leads to asymmetric item characteristic curves, which have recently been shown to be useful in some contexts. Although this model has been used in some empirical studies, an identifiability analysis (i.e., checking the (un)identified status of a model and searching for identifiablity restrictions to make an unidentified model identified) has not yet been established. In this paper, we formalize the unidentified status of a large class of fixed-effects item response theory models that includes the LPE model and related versions of it. In addition, we conduct an identifiability analysis of a particular version of the LPE model that is based on the fixed-effects one-parameter logistic model (1PL), which we call the 1PL-LPE model. The main result indicates that the 1PL-LPE model is not identifiable. Ways to make the 1PL-LPE useful in practice and how different strategies for identifiability analyses may affect other versions of the model are also discussed.
Understanding linear interaction analysis with causal graphs
Interaction analysis using linear regression is widely employed in psychology and related fields, yet it often induces confusion among applied researchers and students. This paper aims to address this confusion by developing intuitive visual explanations based on causal graphs. By leveraging causal graphs with distinct interaction nodes, we provide clear insights into interpreting main effects in the presence of interaction, the rationale behind centering to reduce multicollinearity, and other pertinent topics. The proposed graphical approach could serve as a useful complement to existing algebraic explanations, fostering a more comprehensive understanding of the mechanics of linear interaction analysis.
A new Q-matrix validation method based on signal detection theory
The Q-matrix is a crucial component of cognitive diagnostic theory and an important basis for the research and practical application of cognitive diagnosis. In practice, the Q-matrix is typically developed by domain experts and may contain some misspecifications, so it needs to be refined using Q-matrix validation methods. Based on signal detection theory, this paper puts forward a new Q-matrix validation method (i.e., method) and then conducts a simulation study to compare the new method with existing methods. The results show that when the model is DINA (deterministic inputs, noisy 'and' gate), the method outperforms the existing methods under all conditions; under the generalized DINA (G-DINA) model, the method still has the highest validation rate when the sample size is small, and the item quality is high or the rate of Q-matrix misspecification is ≥.4. Finally, a sub-dataset of the PISA 2000 reading assessment is analysed to evaluate the reliability of the method.
Pairwise likelihood estimation and limited-information goodness-of-fit test statistics for binary factor analysis models under complex survey sampling
This paper discusses estimation and limited-information goodness-of-fit test statistics in factor models for binary data using pairwise likelihood estimation and sampling weights. The paper extends the applicability of pairwise likelihood estimation for factor models with binary data to accommodate complex sampling designs. Additionally, it introduces two key limited-information test statistics: the Pearson chi-squared test and the Wald test. To enhance computational efficiency, the paper introduces modifications to both test statistics. The performance of the estimation and the proposed test statistics under simple random sampling and unequal probability sampling is evaluated using simulated data.
MCMC stopping rules in latent variable modelling
Bayesian analysis relies heavily on the Markov chain Monte Carlo (MCMC) algorithm to obtain random samples from posterior distributions. In this study, we compare the performance of MCMC stopping rules and provide a guideline for determining the termination point of the MCMC algorithm in latent variable models. In simulation studies, we examine the performance of four different MCMC stopping rules: potential scale reduction factor (PSRF), fixed-width stopping rule, Geweke's diagnostic, and effective sample size. Specifically, we evaluate these stopping rules in the context of the DINA model and the bifactor item response theory model, two commonly used latent variable models in educational and psychological measurement. Our simulation study findings suggest that single-chain approaches outperform multiple-chain approaches in terms of item parameter accuracy. However, when it comes to person parameter estimates, the effect of stopping rules diminishes. We caution against relying solely on the univariate PSRF, which is the most popular method, as it may terminate the algorithm prematurely and produce biased item parameter estimates if the cut-off value is not chosen carefully. Our research offers guidance to practitioners on choosing suitable stopping rules to improve the precision of the MCMC algorithm in models involving latent variables.
A unified EM framework for estimation and inference of normal ogive item response models
Normal ogive (NO) models have contributed substantially to the advancement of item response theory (IRT) and have become popular educational and psychological measurement models. However, estimating NO models remains computationally challenging. The purpose of this paper is to propose an efficient and reliable computational method for fitting NO models. Specifically, we introduce a novel and unified expectation-maximization (EM) algorithm for estimating NO models, including two-parameter, three-parameter, and four-parameter NO models. A key improvement in our EM algorithm lies in augmenting the NO model to be a complete data model within the exponential family, thereby substantially streamlining the implementation of the EM iteration and avoiding the numerical optimization computation in the M-step. Additionally, we propose a two-step expectation procedure for implementing the E-step, which reduces the dimensionality of the integration and effectively enables numerical integration. Moreover, we develop a computing procedure for estimating the standard errors (SEs) of the estimated parameters. Simulation results demonstrate the superior performance of our algorithm in terms of its recovery accuracy, robustness, and computational efficiency. To further validate our methods, we apply them to real data from the Programme for International Student Assessment (PISA). The results affirm the reliability of the parameter estimates obtained using our method.
Perturbation graphs, invariant causal prediction and causal relations in psychology
Networks (graphs) in psychology are often restricted to settings without interventions. Here we consider a framework borrowed from biology that involves multiple interventions from different contexts (observations and experiments) in a single analysis. The method is called perturbation graphs. In gene regulatory networks, the induced change in one gene is measured on all other genes in the analysis, thereby assessing possible causal relations. This is repeated for each gene in the analysis. A perturbation graph leads to the correct set of causes (not nec-essarily direct causes). Subsequent pruning of paths in the graph (called transitive reduction) should reveal direct causes. We show that transitive reduction will not in general lead to the correct underlying graph. We also show that invariant causal prediction is a generalisation of the perturbation graph method and does reveal direct causes, thereby replacing transitive re-duction. We conclude that perturbation graphs provide a promising new tool for experimental designs in psychology, and combined with invariant causal prediction make it possible to re-veal direct causes instead of causal paths. As an illustration we apply these ideas to a data set about attitudes on meat consumption and to a time series of a patient diagnosed with major depression disorder.
Maximal point-polyserial correlation for non-normal random distributions
We consider the problem of determining the maximum value of the point-polyserial correlation between a random variable with an assigned continuous distribution and an ordinal random variable with categories, which are assigned the first natural values , and arbitrary probabilities . For different parametric distributions, we derive a closed-form formula for the maximal point-polyserial correlation as a function of the and of the distribution's parameters; we devise an algorithm for obtaining its maximum value numerically for any given . These maximum values and the features of the corresponding -point discrete random variables are discussed with respect to the underlying continuous distribution. Furthermore, we prove that if we do not assign the values of the ordinal random variable a priori but instead include them in the optimization problem, this latter approach is equivalent to the optimal quantization problem. In some circumstances, it leads to a significant increase in the maximum value of the point-polyserial correlation. An application to real data exemplifies the main findings. A comparison between the discretization leading to the maximum point-polyserial correlation and those obtained from optimal quantization and moment matching is sketched.
Handling missing data in variational autoencoder based item response theory
Recently Variational Autoencoders (VAEs) have been proposed as a method to estimate high dimensional Item Response Theory (IRT) models on large datasets. Although these improve the efficiency of estimation drastically compared to traditional methods, they have no natural way to deal with missing values. In this paper, we adapt three existing methods from the VAE literature to the IRT setting and propose one new method. We compare the performance of the different VAE-based methods to each other and to marginal maximum likelihood estimation for increasing levels of missing data in a simulation study for both three- and ten-dimensional IRT models. Additionally, we demonstrate the use of the VAE-based models on an existing algebra test dataset. Results confirm that VAE-based methods are a time-efficient alternative to marginal maximum likelihood, but that a larger number of importance-weighted samples are needed when the proportion of missing values is large.
A convexity-constrained parameterization of the random effects generalized partial credit model
An alternative closed-form expression for the marginal joint probability distribution of item scores under the random effects generalized partial credit model is presented. The closed-form expression involves a cumulant generating function and is therefore subjected to convexity constraints. As a consequence, complicated moment inequalities are taken into account in maximum likelihood estimation of the parameters of the model, so that the estimation solution is always proper. Another important favorable consequence is that the likelihood function has a single local extreme point, the global maximum. Furthermore, attention is paid to expected a posteriori person parameter estimation, generalizations of the model, and testing the goodness-of-fit of the model. Procedures proposed are demonstrated in an illustrative example.
Applying support vector machines to a diagnostic classification model for polytomous attributes in small-sample contexts
Over several years, the evaluation of polytomous attributes in small-sample settings has posed a challenge to the application of cognitive diagnosis models. To enhance classification precision, the support vector machine (SVM) was introduced for estimating polytomous attribution, given its proven feasibility for dichotomous cases. Two simulation studies and an empirical study assessed the impact of various factors on SVM classification performance, including training sample size, attribute structures, guessing/slipping levels, number of attributes, number of attribute levels, and number of items. The results indicated that SVM outperformed the pG-DINA model in classification accuracy under dependent attribute structures and small sample sizes. SVM performance improved with an increased number of items but declined with higher guessing/slipping levels, more attributes, and more attribute levels. Empirical data further validated the application and advantages of SVMs.
On a general theoretical framework of reliability
Reliability is an essential measure of how closely observed scores represent latent scores (reflecting constructs), assuming some latent variable measurement model. We present a general theoretical framework of reliability, placing emphasis on measuring the association between latent and observed scores. This framework was inspired by McDonald's (Psychometrika, 76, 511) regression framework, which highlighted the coefficient of determination as a measure of reliability. We extend McDonald's (Psychometrika, 76, 511) framework beyond coefficients of determination and introduce four desiderata for reliability measures (estimability, normalization, symmetry, and invariance). We also present theoretical examples to illustrate distinct measures of reliability and report on a numerical study that demonstrates the behaviour of different reliability measures. We conclude with a discussion on the use of reliability coefficients and outline future avenues of research.
Average treatment effects on binary outcomes with stochastic covariates
When evaluating the effect of psychological treatments on a dichotomous outcome variable in a randomized controlled trial (RCT), covariate adjustment using logistic regression models is often applied. In the presence of covariates, average marginal effects (AMEs) are often preferred over odds ratios, as AMEs yield a clearer substantive and causal interpretation. However, standard error computation of AMEs neglects sampling-based uncertainty (i.e., covariate values are assumed to be fixed over repeated sampling), which leads to underestimation of AME standard errors in other generalized linear models (e.g., Poisson regression). In this paper, we present and compare approaches allowing for stochastic (i.e., randomly sampled) covariates in models for binary outcomes. In a simulation study, we investigated the quality of the AME and stochastic-covariate approaches focusing on statistical inference in finite samples. Our results indicate that the fixed-covariate approach provides reliable results only if there is no heterogeneity in interindividual treatment effects (i.e., presence of treatment-covariate interactions), while the stochastic-covariate approaches are preferable in all other simulated conditions. We provide an illustrative example from clinical psychology investigating the effect of a cognitive bias modification training on post-traumatic stress disorder while accounting for patients' anxiety using an RCT.
Determining the number of attributes in the GDINA model
Exploratory cognitive diagnosis models have been widely used in psychology, education and other fields. This paper focuses on determining the number of attributes in a widely used cognitive diagnosis model, the GDINA model. Under some conditions of cognitive diagnosis models, we prove that there exists a special structure for the covariance matrix of observed data. Due to the special structure of the covariance matrix, an estimator based on eigen-decomposition is proposed for the number of attributes for the GDINA model. The performance of the proposed estimator is verified by simulation studies. Finally, the proposed estimator is applied to two real data sets Examination for the Certificate of Proficiency in English (ECPE) and Big Five Personality (BFP).
Are alternative variables in a set differently associated with a target variable? Statistical tests and practical advice for dealing with dependent correlations
The analysis of multiple bivariate correlations is often carried out by conducting simple tests to check whether each of them is significantly different from zero. In addition, pairwise differences are often judged by eye or by comparing the p-values of the individual tests of significance despite the existence of statistical tests for differences between correlations. This paper uses simulation methods to assess the accuracy (empirical Type I error rate), power, and robustness of 10 tests designed to check the significance of the difference between two dependent correlations with overlapping variables (i.e., the correlation between X and Y and the correlation between X and Y). Five of the tests turned out to be inadvisable because their empirical Type I error rates under normality differ greatly from the nominal alpha level of .05 either across the board or within certain sub-ranges of the parameter space. The remaining five tests were acceptable and their merits were similar in terms of all comparison criteria, although none of them was robust across all forms of non-normality explored in the study. Practical recommendations are given for the choice of a statistical test to compare dependent correlations with overlapping variables.
Incorporating calibration errors in oral reading fluency scoring
Oral reading fluency (ORF) assessments are commonly used to screen at-risk readers and evaluate interventions' effectiveness as curriculum-based measurements. Similar to the standard practice in item response theory (IRT), calibrated passage parameter estimates are currently used as if they were population values in model-based ORF scoring. However, calibration errors that are unaccounted for may bias ORF score estimates and, in particular, lead to underestimated standard errors (SEs) of ORF scores. Therefore, we consider an approach that incorporates the calibration errors in latent variable scores. We further derive the SEs of ORF scores based on the delta method to incorporate the calibration uncertainty. We conduct a simulation study to evaluate the recovery of point estimates and SEs of latent variable scores and ORF scores in various simulated conditions. Results suggest that ignoring calibration errors leads to underestimated latent variable score SEs and ORF score SEs, especially when the calibration sample is small.
Nonparametric CD-CAT for multiple-choice items: Item selection method and Q-optimality
Computerized adaptive testing for cognitive diagnosis (CD-CAT) achieves remarkable estimation efficiency and accuracy by adaptively selecting and then administering items tailored to each examinee. The process of item selection stands as a pivotal component of a CD-CAT algorithm, with various methods having been developed for binary responses. However, multiple-choice (MC) items, an important item type that allows for the extraction of richer diagnostic information from incorrect answers, have been underemphasized. Currently, the Jensen-Shannon divergence (JSD) index introduced by Yigit et al. (Applied Psychological Measurement, 2019, 43, 388) is the only item selection method exclusively designed for MC items. However, the JSD index requires a large sample to calibrate item parameters, which may be infeasible when there is only a small or no calibration sample. To bridge this gap, the study first proposes a nonparametric item selection method for MC items (MC-NPS) by implementing novel discrimination power that measures an item's ability to effectively distinguish among different attribute profiles. A Q-optimal procedure for MC items is also developed to improve the classification during the initial phase of a CD-CAT algorithm. The effectiveness and efficiency of the two proposed algorithms were confirmed by simulation studies.