Asymptotic Performance of PCA for High-Dimensional Heteroscedastic Data


Principal Component Analysis (PCA) is a classical method for reducing the dimensionality of data by projecting them onto a subspace that captures most of their variation. Effective use of PCA in modern applications requires understanding its performance for data that are both high-dimensional (i.e., with dimension comparable to or larger than the number of samples) and heteroscedastic (i.e., with noise that has non uniform variance across samples such as outliers). This paper analyzes the statistical performance of PCA in this setting, that is, for high-dimensional data drawn from a low-dimensional subspace and degraded by heteroscedastic noise. We provide simple expressions for the asymptotic PCA recovery of the underlying subspace, subspace amplitudes and subspace coefficients; the expressions enable both easy and efficient calculation and reasoning about the performance of PCA. We exploit the structure of these expressions to show that asymptotic recovery for a fixed average noise variance is maximized when the noise variances are equal (i.e., when the noise is in fact homoscedastic). Hence, while average noise variance is often a practically convenient measure for the overall quality of data, it gives an overly optimistic estimate of the performance of PCA for heteroscedastic data.

Under Review