\documentclass[12pt]{report}
\setlength{\textwidth}{6.5in}
\setlength{\textheight}{8.5in}
\setlength{\topmargin}{0.0in}
\setlength{\oddsidemargin}{0.0in}
\usepackage{graphics}
\usepackage{curves}
\usepackage{tikz}
\begin{document}
Version: \today
\Huge
\begin{center}
\textbf{Notes on Factor Analysis}
\end{center}
\Large
\bigskip
The first question we need to address is why go to the trouble of developing a specific factor analysis model when principal components and ``Little Jiffy'' seem to get at this same problem of defining factors:
\bigskip
(1) In a principal component approach, the emphasis is completely on linear combinations of the observable random variables. There is no underlying (latent) structure of the variables that I try to estimate. Statisticians generally love models and find principal components to be somewhat inelegant and nonstatistical.
\bigskip
(2) The issue of how many components should be extracted is always an open question. With explicit models having differing numbers of ``factors,'' we might be able to see which of the models fits ``best'' through some formal statistical mechanism.
\bigskip
(3) Depending upon the scale of the variables used (i.e., the variances), principal components may vary and there is no direct way of relating the components obtained on the correlation matrix and the original variance-covariance matrix. With some forms of factor analysis, such as maximum likelihood (ML), it is possible to go between the results obtained from the covariance matrix and the correlations by dividing or multiplying by the standard deviations of the variables. In other words, we can have a certain type of ``scale invariance'' if we choose, for example, the maximum likelihood approach.
\bigskip
(4) If one wishes to work with a correlation matrix and have a means of testing whether a particular model is adequate or to develop confidence intervals and the like, it is probably preferable to use the ML approach. In PCA on a correlation matrix, the results that are usable for statistical inference are limited and very strained generally (and somewhat suspect).
\bigskip
To develop the factor analysis model, assume the $p$ \emph{observable} random variables, $\mathbf{X}' = [X_{1},\ldots,X_{p}]$, are MVN(\mbox{\boldmath $\mu$}, \mbox{\boldmath $\Sigma$}). Without loss of generality, we can assume that \mbox{\boldmath $\mu$} is the zero vector. Also, suppose that each $X_{i}$ can be represented by a linear combination of some $m$ \emph{unobservable} or latent random variables, $\mathbf{Y}' = [Y_{1},\ldots,Y_{m}]$, plus an error term, $e_{i}$:
\[ X_{i} = \lambda_{i1} Y_{1} + \cdots + \lambda_{im}Y_{m} + e_{i}, \ \mathrm{for} \ 1 \le i \le p \ .\] Here, $Y_{1}, \ldots, Y_{m}$ are the common factor variables; $e_{1}, \ldots, e_{p}$ are the specific factor variables; $\lambda_{ij}$ is the \emph{loading} (i.e., the covariance) of the $i^{th}$ response variable, $X_{i}$, on the $j^{th}$ common factor variable.
\bigskip
If $\mathbf{e}' = [e_{1},\ldots,e_{p}]$, then $\mathbf{X} = \mbox{\boldmath $\Lambda$} \mathbf{Y} + \mathbf{e}$,
where \[ \mbox{\boldmath $\Lambda$} = \left( \begin{array}{ccc}
\lambda_{11} & \cdots & \lambda_{1m} \\
\vdots & & \vdots \\
\lambda_{p1} & \cdots & \lambda_{pm} \\
\end{array} \right) \ . \]
\bigskip
For notation, we let the variance of $e_{i}$ be $\psi_{i}$, $1 \le i \le p$, and refer to $\psi_{i}$ as the \emph{specific variance} of the $i^{th}$ response variable; $e_{i} \sim $ N(0,$\psi_{i}$) and all the $e_{i}$s are independent of each other; $Y_{i} \sim $ N(0,1) and all the $Y_{i}$s are independent of each other and of the $e_{i}$s. Also, we define the diagonal matrix containing the specific variances to be \[ \mbox{\boldmath $\Psi$} = \left( \begin{array}{ccc}
\psi_{1} & \cdots & 0 \\
\vdots & & \vdots \\
0 & \cdots & \psi_{p} \\
\end{array} \right) \ . \]
\bigskip
\[ \mathrm{Var}(X_{i}) = \mathrm{Var}(\lambda_{i1}Y_{1} + \cdots \lambda_{im}Y_{m} + e_{i}) = \]
\[\mathrm{Var}(\lambda_{i1}Y_{1}) + \cdots + \mathrm{Var}(\lambda_{im}Y_{m}) + \mathrm{Var}(e_{i}) = \]
\[ \lambda_{i1}^{2} + \cdots + \lambda_{im}^{2} + \psi_{i} \ . \] The expression, $\sum_{j=1}^{m} \lambda_{ij}^{2}$, is called the communality of the $i^{th}$ variable, $X_{i}$.
\bigskip
Because terms involving different unobservable and specific variables are zero because of independence, we have
\[ \mathrm{Cov}(X_{i},X_{j}) = \mathrm{Cov}(\lambda_{i1}Y_{1} + \cdots \lambda_{im}Y_{m} + e_{i}, \lambda_{j1}Y_{1} + \cdots \lambda_{jm}Y_{m} + e_{j}) = \]
\[ \lambda_{i1}\lambda_{j1} + \cdots + \lambda_{im}\lambda_{jm} \ .\]
\bigskip
As a way of summarizing the results just given for the variances and covariances of the observable variables in terms of the loadings and specific variances, the factor analytic model is typically written as
\[ \mbox{\boldmath $\Sigma$}_{p \times p} = \mbox{\boldmath $\Lambda$}_{p \times m} \mbox{\boldmath $\Lambda$}_{m \times p}' + \mbox{\boldmath $\Psi$}_{p \times p} \ .\] There is a degree of indeterminacy in how this model is phrased, because for any $m \times m$ orthogonal matrix $\mathbf{T}$, we have the same type of decomposition of \mbox{\boldmath $\Sigma$} as\[ \mbox{\boldmath $\Sigma$}_{p \times p} = (\mbox{\boldmath $\Lambda$}\mathbf{T})_{p \times m} (\mbox{\boldmath $\Lambda$}\mathbf{T})_{m \times p}' + \mbox{\boldmath $\Psi$}_{p \times p} \ .\] Thus, we have a rotation done by $\mathbf{T}$ to generate a new loading matrix, $\mbox{\boldmath $\Lambda$}\mathbf{T}$.
\section{Iterated Principal (Axis) Factor Analysis}
Suppose I assume the factor analytic model to hold for the population correlation matrix, $\mathbf{P} = \mbox{\boldmath $\Lambda$} \mbox{\boldmath $\Lambda$}' + \mbox{\boldmath $\Psi$}$, and am given the sample correlation matrix, $\mathbf{R}$. The Guttman lower bound to the communality of a variable is the squared multiple correlation of that variable with the others, and can be used to give an initial estimate, $\hat{ \mbox{\boldmath $\Psi$}}$, of the matrix of specific variances by subtracting these lower bounds from 1.0 (the main diagonal entries in $\mathbf{R}$).
A component analysis (with $m$ components) is carried out on $\mathbf{R} - \hat{ \mbox{\boldmath $\Psi$}}$ and then normalized to produce a factoring, say, $\mathbf{B}\mathbf{B}'$. We estimate \mbox{\boldmath $\Psi$} by using the diagonal of $\mathbf{R} - \mathbf{B}\mathbf{B}'$, and iterate the process until convergence. (Little Jiffy (the principal component solution to the factor analysis model) could be viewed as a ``one shot'' process, with specific variances set at 0.0.)
\section{Maximum Likelihood Factor Analysis (MLFA)}
The method of MLFA holds out the hope of being a scale-invariant method, implying that the results from a correlation or the covariance matrix can be transformed into each other though simple multiplications by the variable standard deviations. So if $\lambda_{ij}$ is a loading from a (population) correlation matrix, then $\lambda_{ij} \sigma_{i}$ is the corresponding loading from the (population) covariance matrix.
\bigskip
MLFA begins with the assumption that $\mathbf{X}_{p \times 1} \sim \mathrm{MVN}(\mbox{\boldmath $0$}, \mbox{\boldmath $\Sigma$}_{p \times p} = \mbox{\boldmath $\Lambda$}_{p \times m} \mbox{\boldmath $\Lambda$}_{m \times p}' + \mbox{\boldmath $\Psi$}_{p \times p})$. If there is a unique diagonal matrix, \mbox{\boldmath $\Psi$}, with positive elements such that the $m$ largest roots (eigenvalues) of $\mbox{\boldmath $\Sigma$}^{*} = \mbox{\boldmath $\Psi$}^{-1/2} \mbox{\boldmath $\Sigma$}\mbox{\boldmath $\Psi$}^{-1/2}$ are distinct and greater than unity, and the $p-m$ remaining roots are each unity (this is true if the model holds), then $\mbox{\boldmath $\Lambda$} = \mbox{\boldmath $\Psi$}^{1/2} \mbox{\boldmath $\Omega$}\mbox{\boldmath $\Delta$}^{1/2}$, where $\mbox{\boldmath $\Sigma$}^{*} - \mathbf{I} = \mbox{\boldmath $\Omega$}_{p \times m} \mbox{\boldmath $\Delta$}_{m \times m} \mbox{\boldmath $\Omega$}_{m \times p}'$. In other words, once you get \mbox{\boldmath $\Psi$}, you are ``home free'' because \mbox{\boldmath $\Lambda$} comes along by a formula.
\bigskip
So, we start with some \mbox{\boldmath $\Psi$} (and generating \mbox{\boldmath $\Lambda$} automatically), and improve upon this initial value by maximizing the log-likelihood \[ \ell(\mbox{\boldmath $\Lambda$},\mbox{\boldmath $\Psi$}) = -\frac{n}{2}(\ln|\mbox{\boldmath $\Sigma$}| + \mathrm{Tr}(\mathbf{S}\mbox{\boldmath $\Sigma$}^{-1})) + \mathrm{constant} \ . \] Equivalently, we can minimize \[ \mathrm{F}(\mbox{\boldmath $\Lambda$},\mbox{\boldmath $\Psi$}) = \ln|\mbox{\boldmath $\Sigma$}| + \mathrm{Tr}(\mathbf{S}\mbox{\boldmath $\Sigma$}^{-1})) - \ln|\mathbf{S}| - p \ . \] The particular iterative optimization procedure used to obtain better and better values for \mbox{\boldmath $\Psi$} is typically the Davidon-Fletcher-Powell method.
\bigskip
In practice, one has a large sample likelihood ratio test available of \[ H_{0}: \mbox{\boldmath $\Sigma$} = \mbox{\boldmath $\Lambda$}\mbox{\boldmath $\Lambda$}' + \mbox{\boldmath $\Psi$} \ ,\] using a test statistic of $(n - (2p+5)/6 - 2m/3)\mathrm{F}(\hat{\mbox{\boldmath $\Lambda$}},\hat{\mbox{\boldmath $\Psi$}})$, compared to a chi-squared random variable with $\frac{1}{2}[(p-m)^{2} - (p+m)]$ degrees of freedom. Generally, the residuals one gets from an MLFA tend to be smaller than from a PCA, even though the cumulative variance explained in a PCA is usually larger; these are somewhat different criteria of fit.
\bigskip
In MLFA, one typically needs a rotation (oblique or orthogonal) to make the originally generated factors intelligible. Also, we now have various forms of confirmatory factor analysis (CFA) where some of the loadings might be fixed and others free to vary. CFA seems to be all the rage in scale development, but I would still like to see what a PCA tells you in an exploratory and optimized context. Finally, and although we talked about using and plotting component scores on our subjects in PCA, the comparable factor scores here should \emph{not} be used. There has been an enormous controversy about their indeterminacy; among people who are thinking straight (e.g., SYSTAT and Leland Wilkinson), factor scores are just not given.
\bigskip
When one allows correlated factors (e.g., using an oblique rotation), the factor analytic model is generalized to \[ \mbox{\boldmath $\Sigma$} = \mbox{\boldmath $\Lambda$}\mbox{\boldmath $\Phi$} \mbox{\boldmath $\Lambda$}' + \mbox{\boldmath $\Psi$}\] where \mbox{\boldmath $\Phi$} is the $m \times m$ covariance matrix among the $m$ factors. In terms of terminology, the matrix, \mbox{\boldmath $\Lambda$}, is called the factor \emph{pattern} matrix; \mbox{\boldmath $\Lambda$}\mbox{\boldmath $\Phi$} is called the factor \emph{structure} matrix and contains the covariances between the observed variables and the $m$ common factors.
\bigskip
There is one property of MLFA that sometimes (in fact, often) rears its ugly head, involving what are called Heywood cases (or improper solutions) in which the optimization procedure wants to make some of the $\psi_{i}$s go negative. When this appears to be happening, the standard strategy is to remove the set of variables for which the $\psi_{i}$s want to go negative, set them equal to zero exactly; the removed set is then subjected to a principal component analysis, and a ``kluge'' made of the principal components and the results from an MLFA on a covariance matrix residualized from the removed set. Obviously, the nice scale invariance of a true MLFA approach disappears when these improper solutions are encountered. You can tell immediately that you have this kind of hybrid solution when some of the specific variances are exactly zero.
\end{document}