Estimation of multiple networks with common structures in heterogeneous subgroups
Network estimation has been a critical component of high-dimensional data analysis and can provide an understanding of the underlying complex dependence structures. Among the existing studies, Gaussian graphical models have been highly popular. However, they still have limitations due to the homogeneous distribution assumption and the fact that they are only applicable to small-scale data. For example, cancers have various levels of unknown heterogeneity, and biological networks, which include thousands of molecular components, often differ across subgroups while also sharing some commonalities. In this article, we propose a new joint estimation approach for multiple networks with unknown sample heterogeneity, by decomposing the Gaussian graphical model (GGM) into a collection of sparse regression problems. A reparameterization technique and a composite minimax concave penalty are introduced to effectively accommodate the specific and common information across the networks of multiple subgroups, making the proposed estimator significantly advancing from the existing heterogeneity network analysis based on the regularized likelihood of GGM directly and enjoying scale-invariant, tuning-insensitive, and optimization convexity properties. The proposed analysis can be effectively realized using parallel computing. The estimation and selection consistency properties are rigorously established. The proposed approach allows the theoretical studies to focus on independent network estimation only and has the significant advantage of being both theoretically and computationally applicable to large-scale data. Extensive numerical experiments with simulated data and the TCGA breast cancer data demonstrate the prominent performance of the proposed approach in both subgroup and network identifications.
Nonlinear sufficient dimension reduction for distribution-on-distribution regression
We introduce a new approach to nonlinear sufficient dimension reduction in cases where both the predictor and the response are distributional data, modeled as members of a metric space. Our key step is to build universal kernels (cc-universal) on the metric spaces, which results in reproducing kernel Hilbert spaces for the predictor and response that are rich enough to characterize the conditional independence that determines sufficient dimension reduction. For univariate distributions, we construct the universal kernel using the Wasserstein distance, while for multivariate distributions, we resort to the sliced Wasserstein distance. The sliced Wasserstein distance ensures that the metric space possesses similar topological properties to the Wasserstein space, while also offering significant computation benefits. Numerical results based on synthetic data show that our method outperforms possible competing methods. The method is also applied to several data sets, including fertility and mortality data and Calgary temperature data.
On singular values of large dimensional lag- sample auto-correlation matrices
We study the limiting behavior of singular values of a lag- sample auto-correlation matrix of large dimensional vector white noise process, the error term in the high-dimensional factor model. We establish the limiting spectral distribution (LSD) that characterizes the global spectrum of , and derive the limit of its largest singular value. All the asymptotic results are derived under the high-dimensional asymptotic regime where the data dimension and sample size go to infinity proportionally. Under mild assumptions, we show that the LSD of is the same as that of the lag- sample auto-covariance matrix. Based on this asymptotic equivalence, we additionally show that the largest singular value of converges almost surely to the right end point of the support of its LSD. Based on these results, we further propose two estimators of total number of factors with lag- sample auto-correlation matrices in a factor model. Our theoretical results are fully supported by numerical experiments as well.
Finite sample t-tests for high-dimensional means
When sample sizes are small, it becomes challenging for an asymptotic test requiring diverging sample sizes to maintain an accurate Type I error rate. In this paper, we consider one-sample, two-sample and ANOVA tests for mean vectors when data are high-dimensional but sample sizes are very small. We establish asymptotic -distributions of the proposed -statistics, which only require data dimensionality to diverge but sample sizes to be fixed and no less than 3. The proposed tests maintain accurate Type I error rates for a wide range of sample sizes and data dimensionality. Moreover, the tests are nonparametric and can be applied to data which are normally distributed or heavy-tailed. Simulation studies confirm the theoretical results for the tests. We also apply the proposed tests to an fMRI dataset to demonstrate the practical implementation of the methods.
Generating MCMC proposals by randomly rotating the regular simplex
We present the , a class of parallel MCMC methods that generate and choose from multiple proposals at each iteration. The algorithm's multiproposal randomly rotates a simplex connected to the current Markov chain state in a way that inherently preserves symmetry between proposals. As a result, the simplicial sampler leads to a simplified acceptance step: it simply chooses from among the simplex nodes with probability proportional to their target density values. We also investigate a multivariate Gaussian-based symmetric multiproposal mechanism and prove that it also enjoys the same simplified acceptance step. This insight leads to significant theoretical and practical speedups. While both algorithms enjoy natural parallelizability, we show that conventional implementations are sufficient to confer efficiency gains across an array of dimensions and a number of target distributions.
Functional delta residuals and applications to simultaneous confidence bands of moment based statistics
Given a functional central limit (fCLT) for an estimator and a parameter transformation, we construct random processes, called functional delta residuals, which asymptotically have the same covariance structure as the limit process of the functional delta method. An explicit construction of these residuals for transformations of moment-based estimators and a multiplier bootstrap fCLT for the resulting functional delta residuals are proven. The latter is used to consistently estimate the quantiles of the maximum of the limit process of the functional delta method in order to construct asymptotically valid simultaneous confidence bands for the transformed functional parameters. Performance of the coverage rate of the developed construction, applied to functional versions of Cohen's d, skewness and kurtosis, is illustrated in simulations and their application to test Gaussianity is discussed.
Inference in Functional Linear Quantile Regression
In this paper, we study statistical inference in functional quantile regression for scalar response and a functional covariate. Specifically, we consider a functional linear quantile regression model where the effect of the covariate on the quantile of the response is modeled through the inner product between the functional covariate and an unknown smooth regression parameter function that varies with the level of quantile. The objective is to test that the regression parameter is constant across several quantile levels of interest. The parameter function is estimated by combining ideas from functional principal component analysis and quantile regression. An adjusted Wald testing procedure is proposed for this hypothesis of interest, and its chi-square asymptotic null distribution is derived. The testing procedure is investigated numerically in simulations involving sparse and noisy functional covariates and in a capital bike share data application. The proposed approach is easy to implement and the R code is published online at https://github.com/xylimeng/fQR-testing.
Tangent functional canonical correlation analysis for densities and shapes, with applications to multimodal imaging data
It is quite common for functional data arising from imaging data to assume values in infinite-dimensional manifolds. Uncovering associations between two or more such nonlinear functional data extracted from the same object across medical imaging modalities can assist development of personalized treatment strategies. We propose a method for canonical correlation analysis between paired probability densities or shapes of closed planar curves, routinely used in biomedical studies, which combines a convenient linearization and dimension reduction of the data using tangent space coordinates. Leveraging the fact that the corresponding manifolds are submanifolds of unit Hilbert spheres, we describe how finite-dimensional representations of the functional data objects can be easily computed, which then facilitates use of standard multivariate canonical correlation analysis methods. We further construct and visualize canonical variate directions directly on the space of densities or shapes. Utility of the method is demonstrated through numerical simulations and performance on a magnetic resonance imaging dataset of glioblastoma multiforme brain tumors.
Biclustering analysis of functionals via penalized fusion
In biomedical data analysis, clustering is commonly conducted. Biclustering analysis conducts clustering in both the sample and covariate dimensions and can more comprehensively describe data heterogeneity. In most of the existing biclustering analyses, scalar measurements are considered. In this study, motivated by time-course gene expression data and other examples, we take the "natural next step" and consider the biclustering analysis of functionals under which, for each covariate of each sample, a function (to be exact, its values at discrete measurement points) is present. We develop a doubly penalized fusion approach, which includes a smoothness penalty for estimating functionals and, more importantly, a fusion penalty for clustering. Statistical properties are rigorously established, providing the proposed approach a strong ground. We also develop an effective ADMM algorithm and accompanying R code. Numerical analysis, including simulations, comparisons, and the analysis of two time-course gene expression data, demonstrates the practical effectiveness of the proposed approach.
From multivariate to functional data analysis: fundamentals, recent developments, and emerging areas
Functional data analysis (FDA), which is a branch of statistics on modeling infinite dimensional random vectors resided in functional spaces, has become a major research area for . We review some fundamental concepts of FDA, their origins and connections from multivariate analysis, and some of its recent developments, including multi-level functional data analysis, high-dimensional functional regression, and dependent functional data analysis. We also discuss the impact of these new methodology developments on genetics, plant science, wearable device data analysis, image data analysis, and business analytics. Two real data examples are provided to motivate our discussions.
High Dimensional Change Point Inference: Recent Developments and Extensions
Change point analysis aims to detect structural changes in a data sequence. It has always been an active research area since it was introduced in the 1950s. In modern statistical applications, however, high-throughput data with increasing dimensions are ubiquitous in fields ranging from economics, finance to genetics and engineering. For those problems, the earlier works are typically no longer applicable. As a result, the problem of testing a change point for high dimensional data sequences has been an important yet challenging task. In this paper, we first focus on models for at most one change point, and review recent state-of-art techniques for change point testing of high dimensional mean vectors and compare their theoretical properties. Based on that, we provide a survey of some extensions to general high dimensional parameters beyond mean vectors as well as strategies for testing multiple change points in high dimensions. Finally, we discuss some open problems for possible future research directions.
Nonparametric spectral methdods for multivariate spatial and spatial-temporal data
We propose computationally efficient methods for estimating stationary multivariate spatial and spatial-temporal spectra from incomplete gridded data. The methods are iterative and rely on successive imputation of data and updating of model estimates. Imputations are done according to a periodic model on an expanded domain. The periodicity of the imputations is a key feature that reduces edge effects in the periodogram and is facilitated by efficient circulant embedding techniques. In addition, we describe efficient methods for decomposing the estimated cross spectral density function into a linear model of coregionalization plus a residual process. The methods are applied to two storm datasets, one of which is from Hurricane Florence, which struck the souteastern United States in September 2018. The application demonstrates how fitted models from different datasets can be compared, and how the methods are computationally feasible on datasets with more than 200,000 total observations.
Canonical correlation analysis for elliptical copulas
Canonical correlation analysis (CCA) is a common method used to estimate the associations between two different sets of variables by maximizing the Pearson correlation between linear combinations of the two sets of variables. We propose a version of CCA for transelliptical distributions with an elliptical copula using pairwise Kendall's tau to estimate a latent scatter matrix. Because Kendall's tau relies only on the ranks of the data this method does not make any assumptions about the marginal distributions of the variables, and is valid when moments do not exist. We establish consistency and asymptotic normality for canonical directions and correlations estimated using Kendall's tau. Simulations indicate that this estimator outperforms standard CCA for data generated from heavy tailed elliptical distributions. Our method also identifies more meaningful relationships when the marginal distributions are skewed. We also propose a method for testing for non-zero canonical correlations using bootstrap methods. This testing procedure does not require any assumptions on the joint distribution of the variables and works for all elliptical copulas. This is in contrast to permutation tests which are only valid when data are generated from a distribution with a Gaussian copula. This method's practical utility is shown in an analysis of the association between radial diffusivity in white matter tracts and cognitive tests scores for six-year-old children from the Early Brain Development Study at UNC-Chapel Hill. An R package implementing this method is available at github.com/blangworthy/transCCA.
Variable selection for partially linear models via Bayesian subset modeling with diffusing prior
Most existing methods of variable selection in partially linear models (PLM) with ultrahigh dimensional covariates are based on partial residuals, which involve a two-step estimation procedure. While the estimation error produced in the first step may have an impact on the second step, multicollinearity among predictors adds additional challenges in the model selection procedure. In this paper, we propose a new Bayesian variable selection approach for PLM. This new proposal addresses those two issues simultaneously as (1) it is a one-step method which selects variables in PLM, even when the dimension of covariates increases at an exponential rate with the sample size, and (2) the method retains model selection consistency, and outperforms existing ones in the setting of highly correlated predictors. Distinguished from existing ones, our proposed procedure employs the difference-based method to reduce the impact from the estimation of the nonparametric component, and incorporates Bayesian subset modeling with diffusing prior (BSM-DP) to shrink the corresponding estimator in the linear component. The estimation is implemented by Gibbs sampling, and we prove that the posterior probability of the true model being selected converges to one asymptotically. Simulation studies support the theory and the efficiency of our methods as compared to other existing ones, followed by an application in a study of supermarket data.
Sampling Properties of color Independent Component Analysis
Independent Component Analysis (ICA) offers an effective data-driven approach for blind source extraction encountered in many signal and image processing problems. Although many ICA methods have been developed, they have received relatively little attention in the statistics literature, especially in terms of rigorous theoretical investigation for statistical inference. The current paper aims at narrowing this gap and investigates the statistical sampling properties of the colorICA (cICA) method. The cICA incorporates the correlation structure within sources through parametric time series models in the frequency domain and outperforms several existing ICA alternatives numerically. We establish the consistency and asymptotic normality of the cICA estimates, which then enables statistical inference based on the estimates. These asymptotic properties are further validated using simulation studies.
Surface Functional Models
The aim of this paper is to develop a new framework of surface functional models (SFM) for surface functional data which contains repeated observations in two domains (typically, time-location). The primary problem of interest is to investigate the relationship between a response and the two domains, where the numbers of observations in both domains within a subject may be diverging. The SFMs are far beyond the multivariate functional models with two-dimensional predictor variables. Unprecedented complexity presented in the surface functional models, such as possibly distinctive sampling designs and the dependence between the two domains, makes our models more complex than the existing ones. We provide a comprehensive investigation of the asymptotic properties of the local linear estimator of the mean function based on a general weighting scheme, including equal weight (EW), direction-to-denseness weight (DDW) and subject-to-denseness weight (SDW), as special cases. Moreover, we can mathematically categorize the surface data into nine cases according to the sampling designs (sparse, dense, and ultra-dense) of both the domains, essentially based on the relative order of the number of observations in each domain to the sample size. We derive the specific asymptotic theories and optimal bandwidth orders in each of the nine sampling design cases under all the three weighting schemes. The three weighting schemes are compared theoretically and numerically. We also examine the finite-sample performance of the estimators through simulation studies and an autism study involving white-matter fiber skeletons.
Distributed Simultaneous Inference in Generalized Linear Models via Confidence Distribution
We propose a distributed method for simultaneous inference for datasets with sample size much larger than the number of covariates, i.e., ≫ , in the generalized linear models framework. When such datasets are too big to be analyzed entirely by a single centralized computer, or when datasets are already stored in distributed database systems, the strategy of divide-and-combine has been the method of choice for scalability. Due to partition, the sub-dataset sample sizes may be uneven and some possibly close to , which calls for regularization techniques to improve numerical stability. However, there is a lack of clear theoretical justification and practical guidelines to combine results obtained from separate regularized estimators, especially when the final objective is simultaneous inference for a group of regression parameters. In this paper, we develop a strategy to combine bias-corrected lasso-type estimates by using confidence distributions. We show that the resulting combined estimator achieves the same estimation efficiency as that of the maximum likelihood estimator using the centralized data. As demonstrated by simulated and real data examples, our divide-and-combine method yields nearly identical inference as the centralized benchmark.
Generalized Linear Mixed Models with Gaussian Mixture Random Effects: Inference and Application
We propose a new class of generalized linear mixed models with Gaussian mixture random effects for clustered data. To overcome the weak identifiability issues, we fit the model using a penalized Expectation Maximization (EM) algorithm, and develop sequential locally restricted likelihood ratio tests to determine the number of components in the Gaussian mixture. Our work is motivated by an application to nationwide kidney transplant center evaluation in the United States, where the patient-level post-surgery outcomes are repeated measures of the care quality of the transplant centers. By taking into account patient-level risk factors and modeling the center effects by a finite Gaussian mixture model, the proposed model provides a convenient framework to study the heterogeneity among the transplant centers and controls the false discovery rate when screening for transplant centers with non-standard performance.
Model-based clustering of time-evolving networks through temporal exponential-family random graph models
Dynamic networks are a general language for describing time-evolving complex systems, and discrete time network models provide an emerging statistical technique for various applications. It is a fundamental research question to detect a set of nodes sharing similar connectivity patterns in time-evolving networks. Our work is primarily motivated by detecting groups based on interesting features of the time-evolving networks (e.g., stability). In this work, we propose a model-based clustering framework for time-evolving networks based on discrete time exponential-family random graph models, which simultaneously allows both modeling and detecting group structure. To choose the number of groups, we use the conditional likelihood to construct an effective model selection criterion. Furthermore, we propose an efficient variational expectation-maximization (EM) algorithm to find approximate maximum likelihood estimates of network parameters and mixing proportions. The power of our method is demonstrated in simulation studies and empirical applications to international trade networks and the collaboration networks of a large research university.
Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model
With the development of high-throughput technologies, principal component analysis (PCA) in the high-dimensional regime is of great interest. Most of the existing theoretical and methodological results for high-dimensional PCA are based on the spiked population model in which all the population eigenvalues are equal except for a few large ones. Due to the presence of local correlation among features, however, this assumption may not be satisfied in many real-world datasets. To address this issue, we investigate the asymptotic behavior of PCA under the generalized spiked population model. Based on our theoretical results, we propose a series of methods for the consistent estimation of population eigenvalues, angles between the sample and population eigenvectors, correlation coefficients between the sample and population principal component (PC) scores, and the shrinkage bias adjustment for the predicted PC scores. Using numerical experiments and real data examples from the genetics literature, we show that our methods can greatly reduce bias and improve prediction accuracy.
Graph-based sparse linear discriminant analysis for high-dimensional classification
Linear discriminant analysis (LDA) is a well-known classification technique that enjoyed great success in practical applications. Despite its effectiveness for traditional low-dimensional problems, extensions of LDA are necessary in order to classify high-dimensional data. Many variants of LDA have been proposed in the literature. However, most of these methods do not fully incorporate the structure information among predictors when such information is available. In this paper, we introduce a new high-dimensional LDA technique, namely graph-based sparse LDA (GSLDA), that utilizes the graph structure among the features. In particular, we use the regularized regression formulation for penalized LDA techniques, and propose to impose a structure-based sparse penalty on the discriminant vector . The graph structure can be either given or estimated from the training data. Moreover, we explore the relationship between the within-class feature structure and the overall feature structure. Based on this relationship, we further propose a variant of our proposed GSLDA to utilize effectively unlabeled data, which can be abundant in the semi-supervised learning setting. With the new regularization, we can obtain a sparse estimate of and more accurate and interpretable classifiers than many existing methods. Both the selection consistency of estimation and the convergence rate of the classifier are established, and the resulting classifier has an asymptotic Bayes error rate. Finally, we demonstrate the competitive performance of the proposed GSLDA on both simulated and real data studies.