A New Paradigm for High-dimensional Data: Distance-Based Semiparametric Feature Aggregation Framework via Between-Subject Attributes
This article proposes a distance-based framework incentivized by the paradigm shift towards feature aggregation for high-dimensional data, which does not rely on the sparse-feature assumption or the permutation-based inference. Focusing on distance-based outcomes that preserve information without truncating any features, a class of semiparametric regression has been developed, which encapsulates multiple sources of high-dimensional variables using pairwise outcomes of between-subject attributes. Further, we propose a strategy to address the interlocking correlations among pairs via the U-statistics-based estimating equations (UGEE), which correspond to their unique efficient influence function (EIF). Hence, the resulting semiparametric estimators are robust to distributional misspecification while enjoying root-n consistency and asymptotic optimality to facilitate inference. In essence, the proposed approach not only circumvents information loss due to feature selection but also improves the model's interpretability and computational feasibility. Simulation studies and applications to the human microbiome and wearables data are provided, where the feature dimensions are tens of thousands.
Testing the missing at random assumption in generalized linear models in the presence of instrumental variables
Practical problems with missing data are common, and many methods have been developed concerning the validity and/or efficiency of statistical procedures. On a central focus, there have been longstanding interests on the mechanism governing data missingness, and correctly deciding the appropriate mechanism is crucially relevant for conducting proper practical investigations. In this paper, we present a new hypothesis testing approach for deciding between the conventional notions of missing at random and missing not at random in generalized linear models in the presence of instrumental variables. The foundational idea is to develop appropriate discrepancy measures between estimators whose properties significantly differ only when missing at random does not hold. We show that our testing approach achieves an objective data-oriented choice between missing at random or not. We demonstrate the feasibility, validity, and efficacy of the new test by theoretical analysis, simulation studies, and a real data analysis.
Generalizing the information content for stepped wedge designs: A marginal modeling approach
Stepped wedge trials are increasingly adopted because practical constraints necessitate staggered roll-out. While a complete design requires clusters to collect data in all periods, resource and patient-centered considerations may call for an incomplete stepped wedge design to minimize data collection burden. To study incomplete designs, we expand the metric of information content to discrete outcomes. We operate under a marginal model with general link and variance functions, and derive information content expressions when data elements (cells, sequences, periods) are omitted. We show that the centrosymmetric patterns of information content can hold for discrete outcomes with the variance-stabilizing link function. We perform numerical studies under the canonical link function, and find that while the patterns of information content for cells are approximately centrosymmetric for all examined underlying secular trends, the patterns of information content for sequences or periods are more sensitive to the secular trend, and may be far from centrosymmetric.
Statistical Inference for Cox Proportional Hazards Models with a Diverging Number of Covariates
For statistical inference on regression models with a diverging number of covariates, the existing literature typically makes sparsity assumptions on the inverse of the Fisher information matrix. Such assumptions, however, are often violated under Cox proportion hazards models, leading to biased estimates with under-coverage confidence intervals. We propose a modified debiased lasso method, which solves a series of quadratic programming problems to approximate the inverse information matrix without posing sparse matrix assumptions. We establish asymptotic results for the estimated regression coefficients when the dimension of covariates diverges with the sample size. As demonstrated by extensive simulations, our proposed method provides consistent estimates and confidence intervals with nominal coverage probabilities. The utility of the method is further demonstrated by assessing the effects of genetic markers on patients' overall survival with the Boston Lung Cancer Survival Cohort, a large-scale epidemiology study investigating mechanisms underlying the lung cancer.
Multiply robust matching estimators of average and quantile treatment effects
Propensity score matching has been a long-standing tradition for handling confounding in causal inference, however requiring stringent model assumptions. In this article, we propose novel double score matching (DSM) utilizing both the propensity score and prognostic score. To gain the protection of possible model misspecification, we posit multiple candidate models for each score. We show that the de-biasing DSM estimator achieves the multiple robustness property in that it is consistent if any one of the score models is correctly specified. We characterize the asymptotic distribution for the DSM estimator requiring only one correct model specification based on the martingale representations of the matching estimators and theory for local Normal experiments. We also provide a two-stage replication method for variance estimation and extend DSM for quantile estimation. Simulation demonstrates DSM outperforms single score matching and prevailing multiply robust weighting estimators in the presence of extreme propensity scores.
Minimax Powerful Functional Analysis of Covariance Tests: with Application to Longitudinal Genome-Wide Association Studies
We model the Alzheimer's Disease-related phenotype response variables observed on irregular time points in longitudinal Genome-Wide Association Studies as sparse functional data and propose nonparametric test procedures to detect functional genotype effects while controlling the confounding effects of environmental covariates. Our new functional analysis of covariance tests are based on a seemingly unrelated kernel smoother, which takes into account the within-subject temporal correlations, and thus enjoy improved power over existing functional tests. We show that the proposed test combined with a uniformly consistent nonparametric covariance function estimator enjoys the Wilks phenomenon and is minimax most powerful. Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, where an application of the proposed test lead to the discovery of new genes that may be related to Alzheimer's Disease.
Nonparametric bounds for the survivor function under general dependent truncation
Truncation occurs in cohort studies with complex sampling schemes. When truncation is ignored or incorrectly assumed to be independent of the event time in the observable region, bias can result. We derive completely nonparametric bounds for the survivor function under truncation and censoring; these extend prior nonparametric bounds derived in the absence of truncation. We also define a hazard ratio function that links the unobservable region in which event time is less than truncation time, to the observable region in which event time is greater than truncation time, under dependent truncation. When this function can be bounded, and the probability of truncation is known approximately, it yields narrower bounds than the purely nonparametric bounds. Importantly, our approach targets the true marginal survivor function over its entire support, and is not restricted to the observable region, unlike alternative estimators. We evaluate the methods in simulations and in clinical applications.
MULTIPLY ROBUST ESTIMATORS OF CAUSAL EFFECTS FOR SURVIVAL OUTCOMES
Multiply robust estimators of the longitudinal g-formula have recently been proposed to protect against model misspecification better than the standard augmented inverse probability weighted estimator (Rotnitzky et al., 2017; Luedtke et al., 2018). These multiply robust estimators ensure consistency if one of the models for the treatment process or outcome process is correctly specified at each time point. We study the multiply robust estimators of Rotnitzky et al. (2017) in the context of a survival outcome. Specifically, we compare various estimators of the g-formula for survival outcomes in order to 1) understand how the estimators may be related to one another, 2) understand each estimator's robustness to model misspecification, and 3) construct estimators that can be more efficient than others in certain model misspecification scenarios. We propose a modification of the multiply robust estimators to gain efficiency under misspecification of the outcome model by using calibrated propensity scores over non-calibrated propensity scores at each time point. Theoretical results are confirmed via simulation studies, and a practical comparison of these estimators is conducted through an application to the US Veterans Aging Cohort Study.
Efficiency of Naive Estimators for Accelerated Failure Time Models under Length-Biased Sampling
In prevalent cohort studies where subjects are recruited at a cross-section, the time to an event may be subject to length-biased sampling, with the observed data being either the forward recurrence time, or the backward recurrence time, or their sum. In the regression setting, assuming a semiparametric accelerated failure time model for the underlying event time, where the intercept parameter is absorbed into the nuisance parameter, it has been shown that the model remains invariant under these observed data set-ups and can be fitted using standard methodology for accelerated failure time model estimation, ignoring the length-bias. However, the efficiency of these estimators is unclear, owing to the fact that the observed covariate distribution, which is also length-biased, may contain information about the regression parameter in the accelerated life model. We demonstrate that if the true covariate distribution is completely unspecified, then the naive estimator based on the conditional likelihood given the covariates is fully efficient for the slope.
Estimation of change-point for a class of count time series models
We apply a three-step sequential procedure to estimate the change-point of count time series. Under certain regularity conditions, the estimator of change-point converges in distribution to the location of the maxima of a two-sided random walk. We derive a closed-form approximating distribution for the maxima of the two-sided random walk based on the invariance principle for the strong mixing processes, so that the statistical inference for the true change-point can be carried out. It is for the first time that such properties are provided for integer-valued time series models. Moreover, we show that the proposed procedure is applicable for the integer-valued autoregressive conditional heteroskedastic (INARCH) models with Poisson or negative binomial conditional distribution. In simulation studies, the proposed procedure is shown to perform well in locating the change-point of INARCH models. And, the procedure is further illustrated with empirical data of weekly robbery counts in two neighborhoods of Baltimore City.
On the identification of individual level pleiotropic, pure direct, and principal stratum direct effects without cross world assumptions
The analysis of natural direct and principal stratum direct effects has a controversial history in statistics and causal inference as these effects are commonly identified with either untestable cross world independence or graphical assumptions. This article demonstrates that the presence of individual level natural direct and principal stratum direct effects can be identified without cross world independence assumptions. We also define a new type of causal effect, called pleiotropy, that is of interest in genomics, and provide empirical conditions to detect such an effect as well. Our results are applicable for all types of distributions concerning the mediator and outcome.
Feature screening for case-cohort studies with failure time outcome
Case-cohort design has been demonstrated to be an economical and efficient approach in large cohort studies when the measurement of some covariates on all individuals is expensive. Various methods have been proposed for case-cohort data when the dimension of covariates is smaller than sample size. However, limited work has been done for high-dimensional case-cohort data which are frequently collected in large epidemiological studies. In this paper, we propose a variable screening method for ultrahigh-dimensional case-cohort data under the framework of proportional model, which allows the covariate dimension increases with sample size at exponential rate. Our procedure enjoys the sure screening property and the ranking consistency under some mild regularity conditions. We further extend this method to an iterative version to handle the scenarios where some covariates are jointly important but are marginally unrelated or weakly correlated to the response. The finite sample performance of the proposed procedure is evaluated via both simulation studies and an application to a real data from the breast cancer study.
Fast tensorial JADE
We propose a novel method for tensorial-independent component analysis. Our approach is based on TJADE and -JADE, two recently proposed generalizations of the classical JADE algorithm. Our novel method achieves the consistency and the limiting distribution of TJADE under mild assumptions and at the same time offers notable improvement in computational speed. Detailed mathematical proofs of the statistical properties of our method are given and, as a special case, a conjecture on the properties of -JADE is resolved. Simulations and timing comparisons demonstrate remarkable gain in speed. Moreover, the desired efficiency is obtained approximately for finite samples. The method is applied successfully to large-scale video data, for which neither TJADE nor -JADE is feasible. Finally, an experimental procedure is proposed to select the values of a set of tuning parameters. Supplementary material including the R-code for running the examples and the proofs of the theoretical results is available online.
Combined multiple testing of multivariate survival times by censored empirical likelihood
In each study testing the survival experience of one or more populations, one must not only choose an appropriate class of tests, but further an appropriate weight function. As the optimal choice depends on the true shape of the hazard ratio, one is often not capable of getting the best results with respect to a specific dataset. For the univariate case several methods were proposed to conquer this problem. However, most of the interesting datasets contain multivariate observations nowadays. In this work we propose a multivariate version of a method based on multiple constrained censored empirical likelihood where the constraints are formulated as linear functionals of the cumulative hazard functions. By considering the conditional hazards, we take the correlation between the components into account with the goal of obtaining a test that exhibits a high power irrespective of the shape of the hazard ratio under the alternative hypothesis.
Asymptotic theory and inference of predictive mean matching imputation using a superpopulation model framework
Predictive mean matching imputation is popular for handling item nonresponse in survey sampling. In this article, we study the asymptotic properties of the predictive mean matching estimator for finite-population inference using a superpopulation model framework. We also clarify conditions for its robustness. For variance estimation, the conventional bootstrap inference is invalid for matching estimators with a fixed number of matches due to the nonsmoothness nature of the matching estimator. We propose a new replication variance estimator, which is asymptotically valid. The key strategy is to construct replicates directly based on the linear terms of the martingale representation for the matching estimator, instead of individual records of variables. Simulation studies confirm that the proposed method provides valid inference.
Stochastic Functional Estimates in Longitudinal Models with Interval-Censored Anchoring Events
Timelines of longitudinal studies are often anchored by specific events. In the absence of fully observed the anchoring event times, the study timeline becomes undefined, and the traditional longitudinal analysis loses its temporal reference. In this paper, we considered an analytical situation where the anchoring events are interval-censored. We demonstrated that by expressing the regression parameter estimators as stochastic functionals of a plug-in estimate of the unknown anchoring event time distribution, the standard longitudinal models could be extended to accommodate the situation of less well-defined timelines. We showed that for a broad class of longitudinal models, the functional parameter estimates are consistent and asymptotically normally distributed with a convergence rate under mild regularity conditions. Applying the developed theory to linear mixed-effects models, we further proposed a hybrid computational procedure that combines the strengths of the Fisher's scoring method and the expectation-expectation (EM) algorithm, for model parameter estimation. We conducted a simulation study to validate the asymptotic properties and to assess the finite sample performance of the proposed method. A real data analysis was used to illustrate the proposed method. The method fills in a gap in the existing longitudinal analysis methodology for data with less well defined timelines.
Regression analysis of longitudinal data with outcome-dependent sampling and informative censoring
We consider regression analysis of longitudinal data in the presence of outcome-dependent observation times and informative censoring. Existing approaches commonly require correct specification of the joint distribution of the longitudinal measurements, observation time process and informative censoring time under the joint modeling framework, and can be computationally cumbersome due to the complex form of the likelihood function. In view of these issues, we propose a semi-parametric joint regression model and construct a composite likelihood function based on a conditional order statistics argument. As a major feature of our proposed methods, the aforementioned joint distribution is not required to be specified and the random effect in the proposed joint model is treated as a nuisance parameter. Consequently, the derived composite likelihood bypasses the need to integrate over the random effect and offers the advantage of easy computation. We show that the resulting estimators are consistent and asymptotically normal. We use simulation studies to evaluate the finite-sample performance of the proposed method, and apply it to a study of weight loss data that motivated our investigation.
GMM nonparametric correction methods for logistic regression with error contaminated covariates and partially observed instrumental variables
We consider logistic regression with covariate measurement error. Most existing approaches require certain replicates of the error-contaminated covariates, which may not be available in the data. We propose generalized methods of moments (GMM) non-parametric correction approaches that use instrumental variables observed in a calibration subsample. The instrumental variable is related to the underlying true covariates through a general nonparametric model, and the probability of being in the calibration subsample may depend on the observed variables. We first take a simple approach adopting the inverse selection probability weighting technique using the calibration subsample. We then improve the approach based on the GMM using the whole sample. The asymptotic properties are derived and the finite sample performance is evaluated through simulation studies and an application to a real data set.
An Additive-Multiplicative Mean Model for Panel Count Data with Dependent Observation and Dropout Processes
This paper discusses regression analysis of panel count data with dependent observation and dropout processes. For the problem, a general mean model is presented that can allow both additive and multiplicative effects of covariates on the underlying point process. In addition, the proportional rates model and the accelerated failure time model are employed to describe possible covariate effects on the observation process and the dropout or follow-up process, respectively. For estimation of regression parameters, some estimating equation-based procedures are developed and the asymptotic properties of the proposed estimators are established. In addition, a resampling approach is proposed for estimating covariance matrix of the proposed estimator and a model checking procedure is also provided. Results from an extensive simulation study indicate that the proposed methodology works well for practical situations and it is applied to a motivating set of real data.
Hard thresholding regression
In this paper, we propose the hard thresholding regression (HTR) for estimating high-dimensional sparse linear regression models. HTR uses a two-stage convex algorithm to approximate the -penalized regression: The first stage calculates a coarse initial estimator, and the second stage identifies the oracle estimator by borrowing information from the first one. Theoretically, the HTR estimator achieves the strong oracle property over a wide range of regularization parameters. Numerical examples and a real data example lend further support to our proposed methodology.
Learning from a lot: Empirical Bayes for high-dimensional model-based prediction
Empirical Bayes is a versatile approach to "learn from a lot" in two ways: first, from a large number of variables and, second, from a potentially large amount of prior information, for example, stored in public repositories. We review applications of a variety of empirical Bayes methods to several well-known model-based prediction methods, including penalized regression, linear discriminant analysis, and Bayesian models with sparse or dense priors. We discuss "formal" empirical Bayes methods that maximize the marginal likelihood but also more informal approaches based on other data summaries. We contrast empirical Bayes to cross-validation and full Bayes and discuss hybrid approaches. To study the relation between the quality of an empirical Bayes estimator and p, the number of variables, we consider a simple empirical Bayes estimator in a linear model setting. We argue that empirical Bayes is particularly useful when the prior contains multiple parameters, which model a priori information on variables termed "co-data". In particular, we present two novel examples that allow for co-data: first, a Bayesian spike-and-slab setting that facilitates inclusion of multiple co-data sources and types and, second, a hybrid empirical Bayes-full Bayes ridge regression approach for estimation of the posterior predictive interval.