Matrix completion under complex survey sampling
Multivariate nonresponse is often encountered in complex survey sampling, and simply ignoring it leads to erroneous inference. In this paper, we propose a new matrix completion method for complex survey sampling. Different from existing works either conducting row-wise or column-wise imputation, the data matrix is treated as a whole which allows for exploiting both row and column patterns simultaneously. A column-space-decomposition model is adopted incorporating a low-rank structured matrix for the finite population with easy-to-obtain demographic information as covariates. Besides, we propose a computationally efficient projection strategy to identify the model parameters under complex survey sampling. Then, an augmented inverse probability weighting estimator is used to estimate the parameter of interest, and the corresponding asymptotic upper bound of the estimation error is derived. Simulation studies show that the proposed estimator has a smaller mean squared error than other competitors, and the corresponding variance estimator performs well. The proposed method is applied to assess the health status of the U.S. population.
Nonparametric tests for multistate processes with clustered data
In this work, we propose nonparametric two-sample tests for population-averaged transition and state occupation probabilities for continuous-time and finite state space processes with clustered, right-censored, and/or left-truncated data. We consider settings where the two groups under comparison are independent or dependent, with or without complete cluster structure. The proposed tests do not impose assumptions regarding the structure of the within-cluster dependence and are applicable to settings with informative cluster size and/or non-Markov processes. The asymptotic properties of the tests are rigorously established using empirical process theory. Simulation studies show that the proposed tests work well even with a small number of clusters, and that they can be substantially more powerful compared to the only, to the best of our knowledge, previously proposed test for this problem. The tests are illustrated using data from a multicenter randomized controlled trial on metastatic squamous-cell carcinoma of the head and neck.
Weighted Estimating Equations for Additive Hazards Models with Missing Covariates
This paper presents simple weighted and fully augmented weighted estimators for the additive hazards model with missing covariates when they are missing at random. The additive hazards model estimates the difference in hazards and has an intuitive biological interpretation. The proposed weighted estimators for the additive hazards model use incomplete data nonparametrically and have close-form expressions. We show that they are consistent and asymptotically normal, and are more efficient than the simple weighted estimator which only uses the complete data. We illustrate their finite-sample performance through simulation studies and an application to study the progression from mild cognitive impairment to dementia using data from the Alzheimer's Disease Neuroimaging Initiative as well as an application to the mouse leukemia study.
Sparse and Efficient Estimation for Partial Spline Models with Increasing Dimension
We consider model selection and estimation for partial spline models and propose a new regularization method in the context of smoothing splines. The regularization method has a simple yet elegant form, consisting of roughness penalty on the nonparametric component and shrinkage penalty on the parametric components, which can achieve function smoothing and sparse estimation simultaneously. We establish the convergence rate and oracle properties of the estimator under weak regularity conditions. Remarkably, the estimated parametric components are sparse and efficient, and the nonparametric component can be estimated with the optimal rate. The procedure also has attractive computational properties. Using the representer theory of smoothing splines, we reformulate the objective function as a LASSO-type problem, enabling us to use the LARS algorithm to compute the solution path. We then extend the procedure to situations when the number of predictors increases with the sample size and investigate its asymptotic properties in that context. Finite-sample performance is illustrated by simulations.
Bayesian nonparametric regression with varying residual density
We consider the problem of robust Bayesian inference on the mean regression function allowing the residual density to change flexibly with predictors. The proposed class of models is based on a Gaussian process prior for the mean regression function and mixtures of Gaussians for the collection of residual densities indexed by predictors. Initially considering the homoscedastic case, we propose priors for the residual density based on probit stick-breaking (PSB) scale mixtures and symmetrized PSB (sPSB) location-scale mixtures. Both priors restrict the residual density to be symmetric about zero, with the sPSB prior more flexible in allowing multimodal densities. We provide sufficient conditions to ensure strong posterior consistency in estimating the regression function under the sPSB prior, generalizing existing theory focused on parametric residual distributions. The PSB and sPSB priors are generalized to allow residual densities to change nonparametrically with predictors through incorporating Gaussian processes in the stick-breaking components. This leads to a robust Bayesian regression procedure that automatically down-weights outliers and influential observations in a locally-adaptive manner. Posterior computation relies on an efficient data augmentation exact block Gibbs sampler. The methods are illustrated using simulated and real data applications.
On constrained and regularized high-dimensional regression
High-dimensional feature selection has become increasingly crucial for seeking parsimonious models in estimation. For selection consistency, we derive one necessary and sufficient condition formulated on the notion of degree-of-separation. The minimal degree of separation is necessary for any method to be selection consistent. At a level slightly higher than the minimal degree of separation, selection consistency is achieved by a constrained -method and its computational surrogate-the constrained truncated -method. This permits up to exponentially many features in the sample size. In other words, these methods are optimal in feature selection against any selection method. In contrast, their regularization counterparts-the -regularization and truncated -regularization methods enable so under slightly stronger assumptions. More importantly, sharper parameter estimation/prediction is realized through such selection, leading to minimax parameter estimation. This, otherwise, is impossible in absence of a good selection method for high-dimensional analysis.
Strong consistency of nonparametric Bayes density estimation on compact metric spaces with applications to specific manifolds
This article considers a broad class of kernel mixture density models on compact metric spaces and manifolds. Following a Bayesian approach with a nonparametric prior on the location mixing distribution, sufficient conditions are obtained on the kernel, prior and the underlying space for strong posterior consistency at any continuous density. The prior is also allowed to depend on the sample size n and sufficient conditions are obtained for weak and strong consistency. These conditions are verified on compact Euclidean spaces using multivariate Gaussian kernels, on the hypersphere using a von Mises-Fisher kernel and on the planar shape space using complex Watson kernels.
The efficiency of the second-order nonlinear least squares estimator and its extension
We revisit the second-order nonlinear least square estimator proposed in Wang and Leblanc (Anne Inst Stat Math 60:883-900, 2008) and show that the estimator reaches the asymptotic optimality concerning the estimation variability. Using a fully semiparametric approach, we further modify and extend the method to the heteroscedastic error models and propose a semiparametric efficient estimator in this more general setting. Numerical results are provided to support the results and illustrate the finite sample performance of the proposed estimator.
Hazard Function Estimation with Cause-of-Death Data Missing at Random
Hazard function estimation is an important part of survival analysis. Interest often centers on estimating the hazard function associated with a particular cause of death. We propose three nonparametric kernel estimators for the hazard function, all of which are appropriate when death times are subject to random censorship and censoring indicators can be missing at random. Specifically, we present a regression surrogate estimator, an imputation estimator, and an inverse probability weighted estimator. All three estimators are uniformly strongly consistent and asymptotically normal. We derive asymptotic representations of the mean squared error and the mean integrated squared error for these estimators and we discuss a data-driven bandwidth selection method. A simulation study, conducted to assess finite sample behavior, demonstrates that the proposed hazard estimators perform relatively well. We illustrate our methods with an analysis of some vascular disease data.
Representations of efficient score for coarse data problems based on Neumann series expansion
We derive new representations of the efficient score for coarse data problems based on Neumann series expansion. The representations can be applied to both ignorable and nonignorable coarse data. An approximation to the new representation may be used for computing locally efficient scores in such problems. We show that many of the successive approximation approaches to the computation of the locally efficient score proposed in the literature for coarse data problems can be derived as special cases of the representations. In addition, the representations lead to new algorithms for computing the locally efficient scores for the coarse data problems.
The local Dirichlet process
As a generalization of the Dirichlet process (DP) to allow predictor dependence, we propose a local Dirichlet process (lDP). The lDP provides a prior distribution for a collection of random probability measures indexed by predictors. This is accomplished by assigning stick-breaking weights and atoms to random locations in a predictor space. The probability measure at a given predictor value is then formulated using the weights and atoms located in a neighborhood about that predictor value. This construction results in a marginal DP prior for the random measure at any specific predictor value. Dependence is induced through local sharing of random components. Theoretical properties are considered and a blocked Gibbs sampler is proposed for posterior computation in lDP mixture models. The methods are illustrated using simulated examples and an epidemiologic application.
Density Estimation with Replicate Heteroscedastic Measurements
We present a deconvolution estimator for the density function of a random variable from a set of independent replicate measurements. We assume that measurements are made with normally distributed errors having unknown and possibly heterogeneous variances. The estimator generalizes the deconvoluting kernel density estimator of Stefanski and Carroll (1990), with error variances estimated from the replicate observations. We derive expressions for the integrated mean squared error and examine its rate of convergence as n → ∞ and the number of replicates is fixed. We investigate the finite-sample performance of the estimator through a simulation study and an application to real data.
Simultaneous estimation and variable selection in median regression using Lasso-type penalty
We consider the median regression with a LASSO-type penalty term for variable selection. With the fixed number of variables in regression model, a two-stage method is proposed for simultaneous estimation and variable selection where the degree of penalty is adaptively chosen. A Bayesian information criterion type approach is proposed and used to obtain a data-driven procedure which is proved to automatically select asymptotically optimal tuning parameters. It is shown that the resultant estimator achieves the so-called oracle property. The combination of the median regression and LASSO penalty is computationally easy to implement via the standard linear programming. A random perturbation scheme can be made use of to get simple estimator of the standard error. Simulation studies are conducted to assess the finite-sample performance of the proposed method. We illustrate the methodology with a real example.
Latent Class Analysis Variable Selection
We propose a method for selecting variables in latent class analysis, which is the most common model-based clustering method for discrete data. The method assesses a variable's usefulness for clustering by comparing two models, given the clustering variables already selected. In one model the variable contributes information about cluster allocation beyond that contained in the already selected variables, and in the other model it does not. A headlong search algorithm is used to explore the model space and select clustering variables. In simulated datasets we found that the method selected the correct clustering variables, and also led to improvements in classification performance and in accuracy of the choice of the number of classes. In two real datasets, our method discovered the same group structure with fewer variables. In a dataset from the International HapMap Project consisting of 639 single nucleotide polymorphisms (SNPs) from 210 members of different groups, our method discovered the same group structure with a much smaller number of SNPs.
GENERALIZED PARTIALLY LINEAR MIXED-EFFECTS MODELS INCORPORATING MISMEASURED COVARIATES
In this article we consider a semiparametric generalized mixed-effects model, and propose combining local linear regression, and penalized quasilikelihood and local quasilikelihood techniques to estimate both population and individual parameters and nonparametric curves. The proposed estimators take into account the local correlation structure of the longitudinal data. We establish normality for the estimators of the parameter and asymptotic expansion for the estimators of the nonparametric part. For practical implementation, we propose an appropriate algorithm. We also consider the measurement error problem in covariates in our model, and suggest a strategy for adjusting the effects of measurement errors. We apply the proposed models and methods to study the relation between virologic and immunologic responses in AIDS clinical trials, in which virologic response is classified into binary variables. A dataset from an AIDS clinical study is analyzed.
SEMIPARAMETRIC MARGINAL AND ASSOCIATION REGRESSION METHODS FOR CLUSTERED BINARY DATA
Clustered data arise commonly in practice and it is often of interest to estimate the mean response parameters as well as the association parameters. However, most research has been directed to inference about the mean response parameters with the association parameters relegated to a nuisance role. There is little work concerning both the marginal and association structures, especially in the semiparametric framework. In this paper, our interest centers on inference on the association parameters in addition to the mean parameters. We develop semiparametric methods for both complete and incomplete clustered binary data and establish the theoretical results. The proposed methodology is illustrated through numerical studies.
Generation of all randomizations using circuits
After a rich history in medicine, randomized control trials (RCTs), both simple and complex, are in increasing use in other areas, such as web-based A/B testing and planning and design of decisions. A main objective of RCTs is to be able to measure parameters, and contrasts in particular, while guarding against biases from hidden confounders. After careful definitions of classical entities such as contrasts, an algebraic method based on circuits is introduced which gives a wide choice of randomization schemes.
Semiparametric modelling of two-component mixtures with stochastic dominance
In this work, we studied a two-component mixture model with stochastic dominance constraint, a model arising naturally from many genetic studies. To model the stochastic dominance, we proposed a semiparametric modelling of the log of density ratio. More specifically, when the log of the ratio of two component densities is in a linear regression form, the stochastic dominance is immediately satisfied. For the resulting semiparametric mixture model, we proposed two estimators, maximum empirical likelihood estimator (MELE) and minimum Hellinger distance estimator (MHDE), and investigated their asymptotic properties such as consistency and normality. In addition, to test the validity of the proposed semiparametric model, we developed Kolmogorov-Smirnov type tests based on the two estimators. The finite-sample performance, in terms of both efficiency and robustness, of the two estimators and the tests were examined and compared via both thorough Monte Carlo simulation studies and real data analysis.