Motivation: A Problem in Psychometrics

Ever wondered how we can sort people into different personality groups? Starting with all the survey answers—a large table of numbers—researchers need to figure out which personality type best fits each person. It is like having a basket of mixed fruits and trying to sort them by kind, but instead of color and shape, we are working with numbers that represent how people answered questions about themselves.

Survey methodology diagram
Survey data encodes personality signals across many respondents and dimensions.

The mathematical challenge is clear: given all the survey answers from many people (data points collected in a matrix $X \in \mathbb{R}^{d \times n}$), we must determine which personality type corresponds to each individual (recover the labels $y \in \{1, \ldots, k\}$). This is naturally formulated through a linear model:

$$X = FW^\top$$

where $F$ represents the latent factors (personality types) and $W$ denotes the factor loadings, capturing how each dimension contributes.

The Radial Structure Observation

Radial structure visualization

Remarkably, in psychometrics the data often exhibits radial structure. Intuitively, the factorization looks like

$$X_i = Fw_i = [F_1, F_2, \ldots, F_k] \begin{pmatrix} 1 \\ \varepsilon_1 \\ \vdots \\ \varepsilon_{k-1} \end{pmatrix}$$

and most information can be captured by one factor at a time.

Why PCA Falls Short

Your first instinct might be to reach for principal component analysis (PCA). But there is a fundamental catch: when the data has radial structure, the covariance matrix ends up looking nearly like a scaled identity:

$$\mathbb{E}\left[XX^\top\right] \approx \lambda\, \mathrm{Id}$$

When the covariance is nearly proportional to the identity, the principal components point in essentially random directions— they provide no information about the true latent structure.

PCA limitations on radial data

Classical Remedy: Varimax

The classical approach to handling radial structure is the Varimax rotation method. Starting from the PCA basis, it applies a rotation chosen to maximize sparsity of the factor loadings. Formally, it solves:

$$R^\star \in \operatorname{argmax}_{R\in O(k)}\sum_{j=1}^k \frac{1}{d}\sum_{i=1}^d\left([\Lambda R]_{ij}^4 - \left(\frac{1}{d}\sum_{l=1}^d[\Lambda R]_{li}^2\right)^2\right)$$

where $\Lambda$ contains the principal components. By fitting to fourth-order moments, Varimax can capture richer structure than second-moment PCA alone.

Varimax performance limitations

Recent theoretical work by Rohe (2022) shows that Varimax's estimation error is bounded by $O_p(\Delta_n^{-0.24}\log^{3.75} n)$. While theoretically sound under several assumptions, practical applications often reveal its limitations—as shown in the figure.

Subspace Clustering and its Drawbacks

A more expressive alternative is subspace clustering, which recovers a union of subspaces capturing most of the data's variation. Our problem is a special case: find $k$ one-dimensional subspaces representing the $k$ factors.

Subspace clustering illustration
Subspace clustering reveals structure in data lying near a union of linear subspaces.

However, even the leading approach—Robust Subspace Clustering (Soltanolkotabi, 2014)—has significant drawbacks. It requires constructing an $n^2$ similarity matrix and then extracting top-$k$ eigenvectors in a second stage, making it computationally prohibitive at scale. Moreover, the recovered subspaces are not required to be orthogonal, so factors may correlate in undesirable ways.

Can we develop a method that is both scalable and statistically robust? Yes—this is the motivation for Geometric Factor Analysis (GFA).

PCA vs Varimax vs GFA comparison
Comparison of PCA, Varimax, and GFA on synthetic data with radial structure.

Method

Our key insight is that the data appears to be supported on a cross—a union of $k$ one-dimensional subspaces through the origin. This motivates finding the distribution in that family closest to the observed data under the Wasserstein-2 metric.

Let $\mathscr{P}_\times$ denote the set of probability distributions supported on a cross with $k$ factors. We seek:

$$\min_{\nu \in \mathscr{P}_\times} W_2^2(\mu, \nu)$$

While this is initially infinite-dimensional, a key reduction makes it tractable. Using the Stiefel manifold $\mathrm{St}(d,k) := \{U \in \mathbb{R}^{d \times k} : U^\top U = I_{k \times k}\}$, we can reformulate the objective equivalently as:

$$\min_{U \in \mathrm{St}(d,k)} W_2^2\!\left(\mu,\, P_U{}_\# \mu\right)$$

In practice, with finite samples $x_1, \ldots, x_n$ drawn i.i.d. from $\mu$, this becomes a concrete finite-dimensional Riemannian optimization problem:

$$\min_{U \in \mathrm{St}(d,k)} \sum_{i=1}^n \|X_i - \Pi_{X_U}(X_i)\|^2$$
GFA methodology illustration
Projecting data onto the optimal cross structure via Stiefel manifold optimization.

Optimization Guarantees

We analyze the loss $\mathscr{E}(U) = W_2^2(\mu, P_U{}_\#\mu)$ under a structural assumption on the data distribution.

Assumption

Let $\mu \in \mathscr{P}_{2,ac}(\mathbb{R}^d)$ be a convolved measure $\mu = \nu \star \tau\Phi$, where $\nu$ is supported on the cross $\bigcup_{i=1}^{k}\mathrm{span}\{U_i^*\}$ for some $U^* \in \mathrm{St}(d,k)$, and $\Phi$ is isotropic.

Theorem

Under this assumption:

(i) The loss $\mathscr{E}(U)$ is smooth;

(ii) $U^*$ is a local minimizer of $\mathscr{E}(U)$; and

(iii) in a neighborhood $N \subseteq \mathrm{St}(d,k)$ around $U^*$, $\mathscr{E}(U)$ is geodesically strongly convex.

More precisely, the Riemannian Hessian is bounded above and below in the positive semidefinite order:

$$L\,\mathrm{Id} \succeq \mathrm{Hess}(\mathscr{E}(U)) \succeq m\,\mathrm{Id} \quad \text{in } N$$

The lower bound gives geodesic strong convexity; the upper bound gives $L$-Lipschitz Riemannian gradients. Together they guarantee linear convergence of Riemannian gradient descent in the neighborhood $N$.

Practical Performance

Convergence behavior

Empirically we observe a rapid initial decay in the loss, consistent with the theoretical linear rate. Crucially, this behavior remains stable as the sample size $n$ grows—the number of iterations required does not increase with data volume.

Computational Efficiency

GFA significantly outperforms Robust Subspace Clustering in computational cost. As the ambient dimension grows, our method incurs only a linear increase in computation time, while the classical approach suffers exponential growth—making GFA the only practical option for high-dimensional problems.

Computation time comparison 1
Computation time comparison 2

Statistical Guarantees

Wasserstein estimation

A fundamental question: how can we estimate the Wasserstein distance between $\mu$ and its projection from finite samples? Given $x_1, \ldots, x_n \sim \mu$ i.i.d., we form the empirical measure $\mu_n = \frac{1}{n}\sum_i \delta_{x_i}$ and similarly $\nu_n$ for the projected distribution.

In general, $W_2(\mu_n, \nu_n)$ is not a reliable estimator of $W_2(\mu, \nu)$—for measures absolutely continuous with respect to Lebesgue measure, we have the discouraging bound $W_2(\mu, \mu_n) \gtrsim n^{-1/d}$ almost surely, a manifestation of the curse of dimensionality.

The key question is whether adding structure to our problem lets us escape this curse. The answer is yes.

Proposition

Under the assumption that there exists $\mu \in \mathscr{P}_2(\mathbb{R}^d)$ and $U^* \in O(d)$ such that $W_2^2(\mu, P_{U^*}{}_\#\mu) \leq \epsilon$ for some fixed $\epsilon > 0$:

$\mathbb{E}(W_2^2(\mu, \mu_n)) \leq 6\epsilon + 3d^2 n^{-1}$

Our goal is to recover $U^*$ by solving $U_n \in \operatorname{argmin}_{U \in O(d)} W_2^2(\mu_n, P_U{}_\#\mu_n)$. Connecting the estimation error $\mathrm{dist}(U_n, U^*)$ to the Wasserstein approximation rate yields the main statistical result:

Theorem

Under the assumptions of the previous proposition, for any minimizer $U_n$:

$\mathbb{E}(\mathrm{dist}(U_n, U^*)^2) \leq \dfrac{1}{C_\mu}(16\epsilon + 6d^2 n^{-1})$

Dimensionality analysis

Looking at the normalized error $\delta = \mathrm{dist}^2(U_n, U^*) / \|U^*\|^2$, the rate is $O(d^2 / n)$—polynomial in $d$ rather than exponential. We have escaped the curse of dimensionality.

Real-World Applications

EOF Analysis in Climate Science

When studying Earth's climate, researchers track how spatial patterns evolve through time—a problem involving complex multidimensional time series. We focus on sea surface temperature (SST) anomalies, organized as vector fields and studied through their covariance:

$X = [X_1,\ldots,X_d]^\top$, $\quad \mathrm{Cov}(X) = \mathbb{E}XX^\top$
World map SST
El Niño pattern La Niña pattern

Applied to SST data, GFA cleanly identifies the El Niño warming pattern in the central and eastern Pacific as one factor, while orthogonality forces the second factor to capture the La Niña cooling signature—recovering the two dominant modes of ENSO variability from data alone.

Understanding Human Faces

On the Extended Yale-B face dataset, GFA produces more interpretable facial basis elements than either PCA or Varimax.

PCA
PCA original PCA processed
Varimax
Varimax original Varimax processed
GFA (Ours)
GFA original GFA processed

GFA's basis elements correspond more closely to recognizable facial parts, reflecting the radial structure assumption that aligns with how faces vary across illumination and pose.

References