som_norm_variable
Purpose
Initialize, apply and undo normalizations on a given vector of
scalar values.
Syntax
xnew = som_norm_variable(x,method,operation)
xnew = som_norm_variable(x,sNorm,operation)
[xnew,sNorm] = som_norm_variable(...)
Description
This function is used to initialize, apply and undo normalizations
on scalar variables. It is the low-level function that upper-level
functions SOM_NORMALIZE and SOM_DENORMALIZE utilize to actually (un)do
the normalizations.
Normalizations are typically performed to control the variance of
vector components. If some vector components have variance which is
significantly higher than the variance of other components, those
components will dominate the map organization. Normalization of
the variance of vector components (method 'var') is used to prevent
that. In addition to variance normalization, other methods have
been implemented as well (see list below).
Usually normalizations convert the variable values so that they no
longer make any sense: the values are still ordered, but their range
may have changed so radically that interpreting the numbers in the
original context is very hard. For this reason all implemented methods
are (more or less) revertible. The normalizations are monotonic
and information is saved so that they can be undone. Also, the saved
information makes it possible to apply the EXACTLY SAME normalization
to another set of values. The normalization information is determined
with 'init' operation, while 'do' and 'undo' operations are used to
apply or revert the normalization.
The normalization information is saved in a normalization struct,
which is returned as the second argument of this function. Note that
normalization operations may be stacked. In this case, normalization
structs are positioned in a struct array. When applied, the array is
gone through from start to end, and when undone, in reverse order.
method description
'var' Variance normalization. A linear transformation which
scales the values such that their variance=1. This is
convenient way to use Mahalanobis distance measure without
actually changing the distance calculation procedure.
'range' Normalization of range of values. A linear transformation
which scales the values between [0,1].
'log' Logarithmic normalization. In many cases the values of
a vector component are exponentially distributed. This
normalization is a good way to get more resolution to
(the low end of) that vector component. What this
actually does is a non-linear transformation:
x_new = log(x_old - m + 1)
where m=min(x_old) and log is the natural logarithm.
Applying the transformation to a value which is lower
than m-1 will give problems, as the result is then complex.
If the minimum for values is known a priori,
it might be a good idea to initialize the normalization with
[dummy,sN] = som_norm_variable(minimum,'log','init');
and normalize only after this:
x_new = som_norm_variable(x,sN,'do');
'logistic' or softmax normalization. This normalization ensures
that all values in the future, too, are within the range
[0,1]. The transformation is more-or-less linear in the
middle range (around mean value), and has a smooth
nonlinearity at both ends which ensures that all values
are within the range. The data is first scaled as in
variance normalization:
x_scaled = (x_old - mean(x_old))/std(x_old)
and then transformed with the logistic function
x_new = 1/(1+exp(-x_scaled))
'histD' Discrete histogram equalization. Non-linear. Orders the
values and replaces each value by its ordinal number.
Finally, scales the values such that they are between [0,1].
Useful for both discrete and continuous variables, but as
the saved normalization information consists of all
unique values of the initialization data set, it may use
considerable amounts of memory. If the variable can get
more than a few values (say, 20), it might be better to
use 'histC' method below. Another important note is that
this method is not exactly revertible if it is applied
to values which are not part of the original value set.
'histC' Continuous histogram equalization. Actually, a partially
linear transformation which tries to do something like
histogram equalization. The value range is divided to
a number of bins such that the number of values in each
bin is (almost) the same. The values are transformed
linearly in each bin. For example, values in bin number 3
are scaled between [3,4[. Finally, all values are scaled
between [0,1]. The number of bins is the square root
of the number of unique values in the initialization set,
rounded up. The resulting histogram equalization is not
as good as the one that 'histD' makes, but the benefit
is that it is exactly revertible - even outside the
original value range (although the results may be funny).
'eval' With this method, freeform normalization operations can be
specified. The parameter field contains strings to be
evaluated with 'eval' function, with variable name 'x'
representing the variable itself. The first string is
the normalization operation, and the second is a
denormalization operation. If the denormalization operation
is empty, it is ignored.
Input arguments
x (vector) The scalar values to which the normalization
operation is applied.
method The normalization specification.
(string) Identifier for a normalization method: 'var',
'range', 'log', 'logistic', 'histD' or 'histC'.
Corresponding default normalization struct is created.
(struct) normalization struct
(struct array) of normalization structs, applied to
x one after the other
(cellstr) of length
(cellstr array) first string gives normalization operation, and
the second gives denormalization operation, with x
representing the variable, for example:
{'x+2','x-2}, or {'exp(-x)','-log(x)'} or {'round(x)'}.
Note that in the last case, no denorm operation is
defined.
note: if the method is given as struct(s), it is
applied (done or undone, as specified by operation)
regardless of what the value of '.status' field
is in the struct(s). Only if the status is
'uninit', the undoing operation is halted.
Anyhow, the '.status' fields in the returned
normalization struct(s) is set to approriate value.
operation (string) The operation to perform: 'init' to initialize
the normalization struct, 'do' to perform the
normalization, 'undo' to undo the normalization,
if possible. If operation 'do' is given, but the
normalization struct has not yet been initialized,
it is initialized using the given data (x).
Output arguments
x (vector) Appropriately processed values.
sNorm (struct) Updated normalization struct/struct array. If any,
the '.status' and '.params' fields are updated.
Examples
To initialize and apply a normalization on a set of scalar values:
[x_new,sN] = som_norm_variable(x_old,'var','do');
To just initialize, use:
[dummy,sN] = som_norm_variable(x_old,'var','init');
To undo the normalization(s):
x_orig = som_norm_variable(x_new,sN,'undo');
Typically, normalizations of data structs/sets are handled using
functions SOM_NORMALIZE and SOM_DENORMALIZE. However, when only the
values of a single variable are of interest, SOM_NORM_VARIABLE may
be useful. For example, assume one wants to apply the normalization
done on a component (i) of a data struct (sD) to a new set of values
(x) of that component. With SOM_NORM_VARIABLE this can be done with:
x_new = som_norm_variable(x,sD.comp_norm{i},'do');
Now, as the normalizations in sD.comp_norm{i} have already been
initialized with the original data set (presumably sD.data),
the EXACTLY SAME normalization(s) can be applied to the new values.
The same thing can be done with SOM_NORMALIZE function, too:
x_new = som_normalize(x,sD.comp_norm{i});
Or, if the new data set were in variable D - a matrix of same
dimension as the original data set:
D_new = som_normalize(D,sD,i);
See also