Model-Free Conditional Independence Feature Screening For Ultrahigh Dimensional Data
Feature screening plays an important role in ultrahigh dimensional data analysis. This paper is concerned with conditional feature screening when one is interested in detecting the association between the response and ultrahigh dimensional predictors (e.g., genetic makers) given a low-dimensional exposure variable (such as clinical variables or environmental variables). To this end, we first propose a new index to measure conditional independence, and further develop a conditional screening procedure based on the newly proposed index. We systematically study the theoretical property of the proposed procedure and establish the sure screening and ranking consistency properties under some very mild conditions. The newly proposed screening procedure enjoys some appealing properties. (a) It is model-free in that its implementation does not require a specification on the model structure; (b) it is robust to heavy-tailed distributions or outliers in both directions of response and predictors; and (c) it can deal with both feature screening and the conditional screening in a unified way. We study the finite sample performance of the proposed procedure by Monte Carlo simulations and further illustrate the proposed method through two real data examples.
A selective overview of feature screening for ultrahigh-dimensional data
High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of high-dimensional data poses many challenges for statisticians. Feature selection and variable selection are fundamental for high-dimensional data analysis. The sparsity principle, which assumes that only a small number of predictors contribute to the response, is frequently adopted and deemed useful in the analysis of high-dimensional data. Following this general principle, a large number of variable selection approaches via penalized least squares or likelihood have been developed in the recent literature to estimate a sparse model and select significant variables simultaneously. While the penalized variable selection methods have been successfully applied in many high-dimensional analyses, modern applications in areas such as genomics and proteomics push the dimensionality of data to an even larger scale, where the dimension of data may grow exponentially with the sample size. This has been called ultrahigh-dimensional data in the literature. This work aims to present a selective overview of feature screening procedures for ultrahigh-dimensional data. We focus on insights into how to construct marginal utilities for feature screening on specific models and motivation for the need of model-free feature screening procedures.
Robust estimation for partially linear models with large-dimensional covariates
We are concerned with robust estimation procedures to estimate the parameters in partially linear models with large-dimensional covariates. To enhance the interpretability, we suggest implementing a noncon-cave regularization method in the robust estimation procedure to select important covariates from the linear component. We establish the consistency for both the linear and the nonlinear components when the covariate dimension diverges at the rate of [Formula: see text], where is the sample size. We show that the robust estimate of linear component performs asymptotically as well as its oracle counterpart which assumes the baseline function and the unimportant covariates were known a priori. With a consistent estimator of the linear component, we estimate the nonparametric component by a robust local linear regression. It is proved that the robust estimate of nonlinear component performs asymptotically as well as if the linear component were known in advance. Comprehensive simulation studies are carried out and an application is presented to examine the finite-sample performance of the proposed procedures.
A regression approach to ROC surface, with applications to Alzheimer's disease
We consider the estimation of three-dimensional ROC surfaces for continuous tests given covariates. Three way ROC analysis is important in our motivating example where patients with Alzheimer's disease are usually classified into three categories and should receive different category-specific medical treatment. There has been no discussion on how covariates affect the three way ROC analysis. We propose a regression framework induced from the relationship between test results and covariates. We consider several practical cases and the corresponding inference procedures. Simulations are conducted to validate our methodology. The application on the motivating example illustrates clearly the age and sex effects on the accuracy for Mini-Mental State Examination of Alzheimer's disease.
Dynamic Optimal Strategy for Monitoring Disease Recurrence
Surveillance to detect cancer recurrence is an important part of care for cancer survivors. In this paper we discuss the design of optimal strategies for early detection of disease recurrence based on each patient's distinct biomarker trajectory and periodically updated risk estimated in the setting of a prospective cohort study. We adopt a latent class joint model which considers a longitudinal biomarker process and an event process jointly, to address heterogeneity of patients and disease, to discover distinct biomarker trajectory patterns, to classify patients into different risk groups, and to predict the risk of disease recurrence. The model is used to develop a monitoring strategy that dynamically modifies the monitoring intervals according to patients' current risk derived from periodically updated biomarker measurements and other indicators of disease spread. The optimal biomarker assessment time is derived using a utility function. We develop an algorithm to apply the proposed strategy to monitoring of new patients after initial treatment. We illustrate the models and the derivation of the optimal strategy using simulated data from monitoring prostate cancer recurrence over a 5-year period.
A nonparametric estimation of the infection curve
Predicting the future course of an epidemic depends on being able to estimate the current numbers of infected individuals. However, while back-projection techniques allow reliable estimation of the numbers of infected individuals in the more distant past, they are less reliable in the recent past. We propose two new nonparametric methods to estimate the unobserved numbers of infected individuals in the recent past in an epidemic. The proposed methods are noniterative, easily computed and asymptotically normal with simple variance formulas. Simulations show that the proposed methods are much more robust and accurate than the existing back projection method, especially for the recent past, which is our primary interest. We apply the proposed methods to the 2003 Severe Acute Respiratory Syndorme (SARS) epidemic in Hong Kong.
The spreading frontiers of avian-human influenza described by the free boundary
In this paper, a reaction-diffusion system is proposed to investigate avian-human influenza. Two free boundaries are introduced to describe the spreading frontiers of the avian influenza. The basic reproduction numbers () and () are defined for the bird with the avian influenza and for the human with the mutant avian influenza of the free boundary problem, respectively. Properties of these two time-dependent basic reproduction numbers are obtained. Sufficient conditions both for spreading and for vanishing of the avian influenza are given. It is shown that if (0) < 1 and the initial number of the infected birds is small, the avian influenza vanishes in the bird world. Furthermore, if (0) < 1 and (0) < 1, the avian influenza vanishes in the bird and human worlds. In the case that (0) < 1 and (0) > 1, spreading of the mutant avian influenza in the human world is possible. It is also shown that if ( ) ⩾ 1 for any ⩾ 0, the avian influenza spreads in the bird world.
Multi-arm covariate-adaptive randomization
Simultaneously investigating multiple treatments in a single study achieves considerable efficiency in contrast to the traditional two-arm trials. Balancing treatment allocation for influential covariates has become increasingly important in today's clinical trials. The multi-arm covariate-adaptive randomized clinical trial is one of the most powerful tools to incorporate covariate information and multiple treatments in a single study. Pocock and Simon's procedure has been extended to the multi-arm case. However, the theoretical properties of multi-arm covariate-adaptive randomization have remained largely elusive for decades. In this paper, we propose a general framework for multi-arm covariate-adaptive designs which also includes the two-arm case, and establish the corresponding theory under widely satisfied conditions. The theoretical results provide new insights into the balance properties of covariate-adaptive randomization procedures and make foundations for most existing statistical inferences under two-arm covariate-adaptive randomization. Furthermore, these open a door to study the theoretical properties of statistical inferences for clinical trials based on multi-arm covariate-adaptive randomization procedures.