\documentclass[12pt]{report}

\setlength{\textwidth}{6.5in}
\setlength{\textheight}{8.5in}
\setlength{\topmargin}{0.0in}
\setlength{\oddsidemargin}{0.0in}


\usepackage{graphics}
\usepackage{curves}
\usepackage{tikz}

\begin{document}

Version: \today

\Huge

\begin{center}

\textbf{Notes on Discrimination and Classification}

\end{center}


\Large



\bigskip

The term ``discrimination'' (in a nonpejorative statistical sense) refers to the task of discrimination among groups through linear combinations of variables that maximize some criterion, usually $F$-ratios.  The term ``classification'' refers to the task of allocating observations to existing groups, typically to minimize the cost and/or probability of misclassification.  These two topics are intertwined, but it is most convenient to start with the topic of classification.

\bigskip

In the picture to follow, we have two populations, called $\pi_{1}$ and $\pi_{2}$; $\pi_{1}$ is characterized by a normal distribution with mean $\mu_{1}$, and variance $\sigma_{X}^{2}$ (the density is denoted by $f_{1}(x)$);  $\pi_{2}$ is characterized by a normal distribution with mean $\mu_{2}$, and (common) variance $\sigma_{X}^{2}$ (the density is denoted by $f_{2}(x)$).  I have an observation, say $x_{0}$, and wish to decide where it should go, either to $\pi_{1}$ or $\pi_{2}$.  Assuming implicitly that $\mu_{1} \le \mu_{2}$, we choose a criterion point, $c$, and allocate to $\pi_{1}$ if $x_{0} \le c$, and to $\pi_{2}$ if $> c$. The probabilities of misclassification can be given in the following chart (and in the figure):

\bigskip

\begin{tabular}{cc|cc|}

 & & True State& \\
  & & $\pi_{1}$ & $\pi_{2}$ \\ \hline
   & $\pi_{1}$ & $1 - \alpha$ & $\beta$ \\
 Decision   & & & \\
     & $\pi_{2}$ & $\alpha$ & $1 - \beta$ \\ \hline


      \end{tabular}


\begin{figure}

\centerline{\includegraphics {classification_figure.pdf}}

\end{figure}

If I want to choose $c$ so that $\alpha + \beta$ is smallest, I would select the point at which the densities are equal.  A more complicated way of saying this decision rule is to allocate to $\pi_{1}$ if $f_{1}(x_{0})/f_{2}(x_{0}) \ge 1$; if $< 1$, then allocate to $\pi_{2}$.  Suppose now that the prior probabilities of being drawn from $\pi_{1}$ and $\pi_{2}$ are $p_{1}$ and $p_{2}$, where $p_{1} + p_{2} = 1$.  I  wish to choose $c$ so the Total Probability of Misclassification (TPM) is minimized, i.e., $p_{1}\alpha + p_{2}\beta$.  The rule would be to allocate to $\pi_{1}$ if $f_{1}(x_{0})/f_{2}(x_{0}) \ge p_{2}/p_{1}$; if $< p_{2}/p_{1}$, then allocate to $\pi_{2}$.  Finally, if we include costs of misclassification, $c(1|2)$ (for assigning to $\pi_{1}$ when actually coming from $\pi_{2}$), and $c(2|1)$ (for assigning to $\pi_{2}$ when actually coming from $\pi_{1}$), we can choose $c$ to minimize the Expected Cost of Misclassification (ECM), $c(2|1)p_{1}\alpha + c(1|2)p_{1}\beta$, with the associated rule of  allocating to $\pi_{1}$ if $f_{1}(x_{0})/f_{2}(x_{0}) \ge (c(1|2)/c(2|1))(p_{2}/p_{1})$; if $<  (c(1|2)/c(2|1))(p_{2}/p_{1})$, then allocate to $\pi_{2}$.


\bigskip

Using logs, the last rule can be restated:  allocate to $\pi_{1}$ if $\log(f_{1}(x_{0})/f_{2}(x_{0})) \ge \log((c(1|2)/c(2|1))(p_{2}/p_{1}))$.  The left-hand-side is equal to $(\mu_{1} - \mu_{2})(\sigma_{X}^{2})^{-1}x_{0} - (1/2)(\mu_{1} - \mu_{2})(\sigma_{X}^{2})^{-1}(\mu_{1} + \mu_{2})$, so the rule can be  restated further: allocate to $\pi_{1}$ if \[ x_{0} \le \{(1/2)(\mu_{1} - \mu_{2})(\sigma_{X}^{2})^{-1}(\mu_{1} + \mu_{2}) \] \[ - \log((c(1|2)/c(2|1))(p_{2}/p_{1}))\}\{\frac{\sigma_{X}^{2}}{-(\mu_{1} - \mu_{2})} \} \] or \[ x_{0} \le \{ (1/2)(\mu_{1} + \mu_{2}) \ - \ \log((c(1|2)/c(2|1))(p_{2}/p_{1}))\}\{\frac{\sigma_{X}^{2}}{(\mu_{2} - \mu_{1})} \} = c \ . \]

\bigskip

If the costs of misclassification are equal (i.e., $c(1|2) = c(2|1)$), then the allocation rule is based on classification functions: allocate to $\pi_{1}$ if \[ [\frac{\mu_{1}}{\sigma_{X}^{2}}x_{0} - (1/2)\frac{\mu_{1}^{2}}{\sigma_{X}^{2}} + \log(p_{1})] - [\frac{\mu_{2}}{\sigma_{X}^{2}}x_{0} - (1/2)\frac{\mu_{2}^{2}}{\sigma_{X}^{2}} + \log(p_{2})] \ge 0 \ . \]

\bigskip

Moving toward the multivariate framework, suppose population $\pi_{1}$ is characterized by a $p \times 1$ vector of random variables, $\mathbf{X} \sim \mathrm{MVN}(\mbox{\boldmath $\mu_{1}$},\mbox{\boldmath $\Sigma$})$; population $\pi_{2}$ is characterized by a $p \times 1$ vector of random variables, $\mathbf{X} \sim \mathrm{MVN}(\mbox{\boldmath $\mu_{2}$}, \mbox{\boldmath $\Sigma$})$.  We have a similar allocation rule as in the univariate case: allocate to $\pi_{1}$ if $\mathbf{a}\mathbf{x}_{0} - \mathbf{a}[(\mbox{\boldmath $\mu_{1}$} + \mbox{\boldmath $\mu_{2}$})/2] \ge  (c(1|2)/c(2|1))(p_{2}/p_{1})$, where  \[ \mathbf{a} = (\mbox{\boldmath $\mu_{1}$} - \mbox{\boldmath $\mu_{2}$})' \mbox{\boldmath $\Sigma$}^{-1} \ . \]  Or, if the misclassification costs are equal, allocate to $\pi_{1}$ if $\mathbf{a}\mathbf{x}_{0} - \mathbf{a}[(\mbox{\boldmath $\mu_{1}$} + \mbox{\boldmath $\mu_{2}$})/2] \ge  [\log(p_{2}) - \log(p_{1})]$.   In effect, we define regions of classification, say $R_{1}$ and $R_{2}$; if an observation falls into region $R_{i}$, it is allocated to group $i$, for $i = 1,2$  There are a number of ways of restating this last rule (assuming equal misclassification costs, this is choosing to minimize the Total Probability of Misclassification (TPM)):

\bigskip

A) Evaluate the classification functions for both groups and assign according to which is higher: allocate to $\pi_{1}$ if \[  [\mbox{\boldmath $\mu$}_{1}'  \mbox{\boldmath $\Sigma$}^{-1}\mathbf{x}_{0}  -(1/2)\mbox{\boldmath $\mu$}_{1}  \mbox{\boldmath $\Sigma$}^{-1}\mbox{\boldmath $\mu$}_{1}) + \log(p_{1})] -  \] \[[\mbox{\boldmath $\mu$}_{2}'  \mbox{\boldmath $\Sigma$}^{-1}\mathbf{x}_{0}  -(1/2)\mbox{\boldmath $\mu$}_{2}  \mbox{\boldmath $\Sigma$}^{-1}\mbox{\boldmath $\mu$}_{2}) + \log(p_{2})]  \ge 0 \ . \]

\bigskip

B) Define the posterior probability of being in group $i$, for $i = 1,2$, $P(\pi_{i}| \mathbf{x}_{0})$ as $(f_{i} p_{i})/(f_{1} p_{1} + f_{2} p_{2})$.  We allocate to the group with the largest posterior probability.

\bigskip

C) We can restate our allocation rule according to Mahalanobis distances: define the squared Mahalanobis distance of $\mathbf{x}_{0}$ to $\mu_{i}, i = 1,2$, as \[ (\mathbf{x}_{0} - \mbox{\boldmath $\mu$}_{i})' \mbox{\boldmath $\Sigma$}^{-1}  (\mathbf{x}_{0} - \mbox{\boldmath $\mu$}_{i}) \ . \]  Allocate to $\pi_{i}$ for the largest quantity of the form: \[ -(1/2)[(\mathbf{x}_{0} - \mbox{\boldmath $\mu$}_{i})' \mbox{\boldmath $\Sigma$}^{-1}  (\mathbf{x}_{0} - \mbox{\boldmath $\mu$}_{i})] + \log(p_{i}) \ . \]

\bigskip

When the covariance matrices are not equal in the two populations (i.e., $\mbox{\boldmath $\Sigma$}_{1} \ne \mbox{\boldmath $\Sigma$}_{2}$), the allocation rules get a little more complicated.  The classification rules are now called ``quadratic'', and may produce regions of allocation that may not be contiguous.  This is a little strange, but it can be done, and we can still split the allocation rule into two classification functions (assuming, as usual, equal costs of misclassification):

\bigskip

Assign to $\pi_{1}$ if \[ -(1/2)\mathbf{x}_{0}'(\mbox{\boldmath $\Sigma$}_{1}^{-1} - \mbox{\boldmath $\Sigma$}_{2}^{-1})\mathbf{x}_{0} +
(\mbox{\boldmath $\mu_{1}$}' \mbox{\boldmath $\Sigma$}_{1}^{-1} - \mbox{\boldmath $\mu_{2}$}' \mbox{\boldmath $\Sigma$}_{1}^{-1})\mathbf{x}_{0} \ - \ k \ge \] \[ \log((c(1|2)/c(2|1))(p_{2}/p_{1})) \ , \] where \[ k = (1/2) \log(\frac{|\mbox{\boldmath $\Sigma$}_{1}|}{|\mbox{\boldmath $\Sigma$}_{2}|}) +
(1/2)( \mbox{\boldmath $\mu_{1}$}' \mbox{\boldmath $\Sigma$}_{1}^{-1} \mbox{\boldmath $\mu_{1}$} - \mbox{\boldmath $\mu_{2}$}' \mbox{\boldmath $\Sigma$}_{2}^{-1} \mbox{\boldmath $\mu_{2}$}) \ . \]

\bigskip

Moving to the sample, we could just use estimated quantities and hope our rule does well --- we use $\mathbf{S}_{pooled}$, assuming equal covariance matrices in the two populations, and sample means, $\hat{\mbox{\boldmath $\mu_{1}$}}$ and $\hat{\mbox{\boldmath $\mu_{2}$}}$.  In fact, we can come up with the misclassification table based on the given sample and how they allocate the given $n$ observations to the two groups:

\bigskip

\begin{tabular}{cc|cc|}

  &  & Group  &  \\

    &  & $\pi_{1}$ & $\pi_{2}$ \\ \hline

      & $\pi_{1}$ & $a$   & $b$ \\

 Decision       &  &  &  \\

          & $\pi_{2}$ & $c$  & $d$ \\ \hline

            &  & $n_{1}$  & $n_{2}$    \\

            \end{tabular}

 \bigskip

 The apparent error rate (APR) is $(b + c)/n$, which is overly optimistic because it is optimized with respect to \emph{this} sample.  To cross-validate, we could use a ``hold out one-at-a-time'' strategy (i.e., a sample reuse procedure commonly referred to as the ``jackknife''):

 \bigskip

\begin{tabular}{cc|cc|}

  &  & Group  &  \\

    &  & $\pi_{1}$ & $\pi_{2}$ \\ \hline

      & $\pi_{1}$ & $a^{*}$   & $b^{*}$  \\

 Decision       &  &  &  \\

          & $\pi_{2}$ & $c^{*}$   & $d^{*}$  \\ \hline

            &  & $n_{1}$  & $n_{2}$   \\

            \end{tabular}

 \bigskip

 \noindent To estimate the actual error rate (AER), we would use $(b^{*} + c^{*})/n$.


\bigskip

Suppose we have $g$ groups; $p_{i}$ is the a priori probability of group $i$, $1 \le i \le g$; $c(k|i)$ is the cost of classifying an $i$ as a $k$. The decision rule that minimizes the expected cost of misclassification (ECM) is: allocate $\mathbf{x}_{0}$ to population $\pi_{k}$, $1 \le k \le g$, if \[ \sum_{i=1; i \ne k}^{g} p_{i} f_{i}(\mathbf{x}_{0})c(k|i) \] is smallest.

\bigskip

There are, again, alternative ways of stating this allocation rule; we will assume for convenience that the costs of misclassification are equal:

\bigskip

Allocate to group $k$ if the posterior probability, \[ P(\pi_{k}|\mathbf{x}_{0}) = \frac{p_{k}f_{k}(\mathbf{x}_{0})}{\sum_{i=1}^{g} p_{i}f_{i}(\mathbf{x}_{0})} \ , \] is largest.

\bigskip

If in population $k$, $\mathbf{X} \sim \mathrm{MVN}(\mbox{\boldmath $\mu$}_{k}, \mbox{\boldmath $\Sigma$}_{k})$, we allocate to group $k$ if $\log(p_{k}f_{k}(\mathbf{x}_{0})) = $ \[ -(1/2)\log(|\mbox{\boldmath $\Sigma$}_{k}|) - (1/2) (\mathbf{x}_{0} - \mbox{\boldmath $\mu$}_{k})'  \mbox{\boldmath $\Sigma$}_{k}^{-1} (\mathbf{x}_{0} - \mbox{\boldmath $\mu$}_{k}) + \log(p_{i}) + \mathrm{constant} \ , \] is largest.

 \bigskip

 If all the $ \mbox{\boldmath $\Sigma$}_{k} =  \mbox{\boldmath $\Sigma$}$ for all $k$, then we allocate to $\pi_{k}$ if \[ \mbox{\boldmath $\mu$}_{k}' \mbox{\boldmath $\Sigma$}_{k}^{-1}\mathbf{x}_{0} - (1/2)\mbox{\boldmath $\mu$}_{k}' \mbox{\boldmath $\Sigma$}_{k}^{-1}\mbox{\boldmath $\mu$}_{k} + \log(p_{k}) \ , \] is largest.


 \bigskip

 It is interesting that we can do this in a pairwise way as well: allocate to $\pi_{k}$ if \[(\mbox{\boldmath $\mu$}_{k} - \mbox{\boldmath $\mu$}_{i})' \mbox{\boldmath $\Sigma$}_{k}^{-1}\mathbf{x}_{0} - (1/2)(\mbox{\boldmath $\mu$}_{k} - \mbox{\boldmath $\mu$}_{i})' \mbox{\boldmath $\Sigma$}_{k}^{-1}(\mbox{\boldmath $\mu$}_{k} + \mbox{\boldmath $\mu$}_{i}) \ge \log(p_{i}/p_{k}) \ , \] for all $i = 1, \ldots, g$.

 \subsection{Discriminant Analysis}

Suppose we have a one-way analysis-of-variance (ANOVA) layout with
$J$ groups ($n_{j}$ subjects in group $j$, $1 \le j \le J$), and $p$
measurements on each subject.  If $x_{ijk}$ denotes person $i$, in
group $j$, and the observation of variable $k$ ($1 \le i \le n_{j}$;
$1 \le j \le J$; $1 \le k \le p$), then define the
Between-Sum-of-Squares matrix

\[\mathbf{B}_{p \times p} =
\{\sum_{j=1}^{J} n_{j}(\bar{x}_{\cdot jk} - \bar{x}_{\cdot \cdot
k})(\bar{x}_{\cdot jk'} - \bar{x}_{\cdot \cdot k'})\}_{p \times p}
\]

\noindent and the Within-Sum-of-Squares matrix

 \[\mathbf{W}_{p
\times p} = \{ \sum_{j=1}^{J} \sum_{i=1}^{n_{j}} (x_{ijk} -
 \bar{x}_{\cdot jk})(x_{ijk'} - \bar{x}_{\cdot jk'}) \}_{p \times p} \]

 For the matrix product $\mathbf{W}^{-1}\mathbf{B}$, let
 $\lambda_{1}, \ldots, \lambda_{T} \ge 0$ be the eigenvectors ($T =
 \min(p, J-1)$, and $\mathbf{p}_{1}, \ldots, \mathbf{p}_{T}$ the
 corresponding normalized eigenvectors.  Then, the linear
 combination

 \[ \mathbf{p}_{k}'\left(
                                 \begin{array}{c}
                                   X_{1} \\
                                   \vdots \\
                                   X_{p} \\
                                 \end{array}
                               \right) \]

                               \noindent is called the
                               $k^{th}$ \emph{discriminant
                               function}.
It has the valuable property of maximizing the univariate $F$-ratio
subject to being uncorrelated with the earlier linear combinations.

\bigskip

There are a number of points to make about (Fisher's) Linear Discriminant Functions:

\bigskip

A) Typically, we define a sample pooled variance-covariance matrix, $\mathbf{S}_{pooled}$, as $(\frac{1}{n-J})\mathbf{W}$. And generally,  the eigenvalues are scaled so that $\mathbf{p}_{k}'\mathbf{S}_{pooled}\mathbf{p}_{k} = 1$.

\bigskip

B) When $J = 2$, the eigenvector, $\mathbf{p}_{1}'$, is equal to $(\hat{\mbox{\boldmath $\mu_{1}$}} - \hat{\mbox{\boldmath $\mu_{2}$}})'\mathbf{S}_{pooled}$. This set of weights maximized the square of the $t$ ratio in a two-group separation problem (i.e., discriminating between the two groups).  We also maximize the square of the effect size for this linear combination; the maximum for such an effect size is \[ (\bar{\mathbf{x}}_{1} - \bar{\mathbf{x}}_{2})' \mathbf{S}_{pooled}^{-1} (\bar{\mathbf{x}}_{1} - \bar{\mathbf{x}}_{2})' \ , \] where $\bar{\mathbf{x}}_{1}$ and $\bar{\mathbf{x}}_{2}$ are the sample centroids in groups 1 and 2 for the $p$ variables.
 Finally, if we define $Y = 1$ if an observation falls into group 1, and $ = 0$ if in group 2, the set of weights in $\mathbf{p}_{1}'$ is proportional to the regression coefficients in predicting $Y$ from $X_{1}, \ldots, X_{p}$.

 \bigskip

 C) The classification rule based on Mahalanobis distance (and assuming equal prior probabilities and equal misclassification values), could be restated equivalently using plain Euclidean distances in discriminant function space.



\end{document} 