SOM Toolbox Online documentation http://www.cis.hut.fi/projects/somtoolbox/

som_norm_variable

Purpose

 Initialize, apply and undo normalizations on a given vector of
 scalar values.

Syntax

Description

 This function is used to initialize, apply and undo normalizations
 on scalar variables. It is the low-level function that upper-level
 functions SOM_NORMALIZE and SOM_DENORMALIZE utilize to actually (un)do
 the normalizations.

 Normalizations are typically performed to control the variance of 
 vector components. If some vector components have variance which is
 significantly higher than the variance of other components, those
 components will dominate the map organization. Normalization of 
 the variance of vector components (method 'var') is used to prevent 
 that. In addition to variance normalization, other methods have
 been implemented as well (see list below). 

 Usually normalizations convert the variable values so that they no 
 longer make any sense: the values are still ordered, but their range 
 may have changed so radically that interpreting the numbers in the 
 original context is very hard. For this reason all implemented methods
 are (more or less) revertible. The normalizations are monotonic
 and information is saved so that they can be undone. Also, the saved
 information makes it possible to apply the EXACTLY SAME normalization
 to another set of values. The normalization information is determined
 with 'init' operation, while 'do' and 'undo' operations are used to
 apply or revert the normalization. 

 The normalization information is saved in a normalization struct, 
 which is returned as the second argument of this function. Note that 
 normalization operations may be stacked. In this case, normalization 
 structs are positioned in a struct array. When applied, the array is 
 gone through from start to end, and when undone, in reverse order.

    method  description

    'var'   Variance normalization. A linear transformation which 
            scales the values such that their variance=1. This is
            convenient way to use Mahalanobis distance measure without
            actually changing the distance calculation procedure.

    'range' Normalization of range of values. A linear transformation
            which scales the values between [0,1]. 

    'log'   Logarithmic normalization. In many cases the values of
            a vector component are exponentially distributed. This 
            normalization is a good way to get more resolution to
            (the low end of) that vector component. What this 
            actually does is a non-linear transformation: 
               x_new = log(x_old - m + 1) 
            where m=min(x_old) and log is the natural logarithm. 
            Applying the transformation to a value which is lower 
            than m-1 will give problems, as the result is then complex.
            If the minimum for values is known a priori, 
            it might be a good idea to initialize the normalization with
              [dummy,sN] = som_norm_variable(minimum,'log','init');
            and normalize only after this: 
              x_new = som_norm_variable(x,sN,'do');

    'logistic' or softmax normalization. This normalization ensures
            that all values in the future, too, are within the range
            [0,1]. The transformation is more-or-less linear in the 
            middle range (around mean value), and has a smooth 
            nonlinearity at both ends which ensures that all values
            are within the range. The data is first scaled as in 
            variance normalization: 
               x_scaled = (x_old - mean(x_old))/std(x_old)
            and then transformed with the logistic function
               x_new = 1/(1+exp(-x_scaled))

    'histD' Discrete histogram equalization. Non-linear. Orders the 
            values and replaces each value by its ordinal number. 
            Finally, scales the values such that they are between [0,1].
            Useful for both discrete and continuous variables, but as 
            the saved normalization information consists of all 
            unique values of the initialization data set, it may use
            considerable amounts of memory. If the variable can get
            more than a few values (say, 20), it might be better to
            use 'histC' method below. Another important note is that
            this method is not exactly revertible if it is applied
            to values which are not part of the original value set.

    'histC' Continuous histogram equalization. Actually, a partially
            linear transformation which tries to do something like 
            histogram equalization. The value range is divided to 
            a number of bins such that the number of values in each
            bin is (almost) the same. The values are transformed 
            linearly in each bin. For example, values in bin number 3
            are scaled between [3,4[. Finally, all values are scaled
            between [0,1]. The number of bins is the square root
            of the number of unique values in the initialization set,
            rounded up. The resulting histogram equalization is not
            as good as the one that 'histD' makes, but the benefit
            is that it is exactly revertible - even outside the 
            original value range (although the results may be funny).

    'eval'  With this method, freeform normalization operations can be 
            specified. The parameter field contains strings to be 
            evaluated with 'eval' function, with variable name 'x'
            representing the variable itself. The first string is 
            the normalization operation, and the second is a 
            denormalization operation. If the denormalization operation
            is empty, it is ignored.

Input arguments

   x          (vector) The scalar values to which the normalization      
                       operation is applied.

   method              The normalization specification.
              (string) Identifier for a normalization method: 'var', 
                       'range', 'log', 'logistic', 'histD' or 'histC'. 
                       Corresponding default normalization struct is created.
              (struct) normalization struct 
              (struct array) of normalization structs, applied to 
                       x one after the other
              (cellstr) of length 
              (cellstr array) first string gives normalization operation, and 
                       the second gives denormalization operation, with x 
                       representing the variable, for example: 
                       {'x+2','x-2}, or {'exp(-x)','-log(x)'} or {'round(x)'}.
                       Note that in the last case, no denorm operation is 
                       defined. 

               note: if the method is given as struct(s), it is
                     applied (done or undone, as specified by operation)
                     regardless of what the value of '.status' field
                     is in the struct(s). Only if the status is
                     'uninit', the undoing operation is halted.
                     Anyhow, the '.status' fields in the returned 
                     normalization struct(s) is set to approriate value.

   operation  (string) The operation to perform: 'init' to initialize
                       the normalization struct, 'do' to perform the 
                       normalization, 'undo' to undo the normalization, 
                       if possible. If operation 'do' is given, but the
                       normalization struct has not yet been initialized,
                       it is initialized using the given data (x).

Output arguments

   x        (vector) Appropriately processed values. 

   sNorm    (struct) Updated normalization struct/struct array. If any,
                     the '.status' and '.params' fields are updated.

Examples

  To initialize and apply a normalization on a set of scalar values: 

    [x_new,sN] = som_norm_variable(x_old,'var','do'); 

  To just initialize, use: 

    [dummy,sN] = som_norm_variable(x_old,'var','init'); 

  To undo the normalization(s): 

    x_orig = som_norm_variable(x_new,sN,'undo');

  Typically, normalizations of data structs/sets are handled using
  functions SOM_NORMALIZE and SOM_DENORMALIZE. However, when only the
  values of a single variable are of interest, SOM_NORM_VARIABLE may 
  be useful. For example, assume one wants to apply the normalization
  done on a component (i) of a data struct (sD) to a new set of values 
  (x) of that component. With SOM_NORM_VARIABLE this can be done with: 

    x_new = som_norm_variable(x,sD.comp_norm{i},'do'); 

  Now, as the normalizations in sD.comp_norm{i} have already been 
  initialized with the original data set (presumably sD.data), 
  the EXACTLY SAME normalization(s) can be applied to the new values.
  The same thing can be done with SOM_NORMALIZE function, too: 

    x_new = som_normalize(x,sD.comp_norm{i}); 

  Or, if the new data set were in variable D - a matrix of same
  dimension as the original data set: 

    D_new = som_normalize(D,sD,i);

See also

som_normalize Add/apply/redo normalizations for a data struct/set.
som_denormalize Undo normalizations of a data struct/set.