Self-Organizing Map in web mining Technical report

بسم الله الرحمن الرحيم

Self-Organizing Map Technical report

Prepared by :

Mogtapa Eltayip Abd-Elrhman Mohammed

Index (12)

Supervised: Abdelhamid Salih Ph.D

Abstract

Neural Networks are a computational approach which is based on a large collection of neural units loosely modeling the way the brain solves problems with large clusters of biological neurons connected by axons. Each neural unit is connected with many others, and links can be enforcing or inhibitory in their effect on the activation state of connected neural units

INTRODUCTION

The Self-Organizing Map is one of the most popular neural network models. It belongs to the category of competitive learning networks. The Self-Organizing Map is based on unsupervised learning, which means that no human intervention is needed during the learning and that little needs to be known about the characteristics of the input data. We could, for example, use the SOM for clustering data without knowing the class memberships of the input data. The SOM can be used to detect features inherent to the problem and thus has also been called SOFM, the Self-Organizing Feature Map. The Self-Organizing Map was developed by professor Kohonen. The SOM has been proven useful in many applications.

The SOM algorithm is based on unsupervised, competitive learning. It provides a topology preserving mapping from the high dimensional space to map units. Map units, or neurons, usually form a two-dimensional lattice and thus the mapping is a mapping from high dimensional space onto a plane. The property of topology preserving means that the mapping preserves the relative distance between the points. Points that are near each other in the input space are mapped to nearby map units in the SOM. The SOM can thus serve as a cluster analyzing tool of high-dimensional data. Also, the SOM has the capability to generalize. Generalization capability means that the network can recognize or characterize inputs it has never encountered before

Methodology:

######################################

The SOM Algorithm

The aim is to learn a feature map from the spatially continuous input space, in which our input vectors live, to the low dimensional spatially discrete output space, which is formed by arranging the computational neurons into a grid. The stages of the SOM algorithm that achieves this can be summarized as follows:

1. Initialization – Choose random values for the initial weight vectors wj.

2. Sampling – Draw a sample training input vector x from the input space.

3. Matching – Find the winning neuron I(x)

4. 4.update all weight vectors of all neurons i in the neighborhood of neuron

5. If convergence criterion met, STOP.

The Architecture a Self-Organizing Map

We shall concentrate on the SOM system known as a Kohonen Network. This has a feed-forward structure with a single computational layer of neurons arranged in rows and columns. Each neuron is fully connected to all the source units in the input layer:

A one-dimensional map will just have a single row or column in the computational layer.

Properties of the Feature Map

Once the SOM algorithm has converged, the feature map displays important statistical characteristics of the input space. Given an input vector x, the feature map Φ provides a winning neuron I(x) in the output space, and the weight vector wI(x) provides the coordinates of the image of that neuron in the input space.

The feature map has four important properties that we shall look at in turn.

Property 1: Approximation of the Input Space

The feature map Φ represented by the set of weight vectors {wi } in the output space, provides a good approximation to the input space. We can state the aim of the SOM as storing a large set of input vectors {x} by finding a smaller set of prototypes {wi} so as to provide a good approximation to the original input space. The theoretical basis of this idea is rooted in vector quantization theory, the motivation of which is dimensionality reduction or data compression. In effect, the goodness of the approximation is given by the total squared distance

Which we wish to minimize. If we work through gradient descent style mathematics we do end up with the SOM weight update algorithm, which confirms that it is generating a good approximation to the input space.

Property 2: Topological Ordering

The feature map Φ computed by the SOM algorithm is topologically ordered in the sense that the spatial location of a neuron in the output lattice/grid corresponds to a particular domain or feature of the input patterns.

The topological ordering property is a direct consequence of the weight update equation that forces the weight vector wI(x) of the winning neuron I(x) to move toward the input vector x. The crucial factor is that the weight updates also move the weight vectors wj of the closest neighboring neurons j along with the winning neuron I(x). Together these weight changes because the whole output space to become appropriately ordered. We can visualize the feature map Φ as an elastic or virtual net with a grid like topology. Each output node can be represented in the input space at coordinates given by their weights. Then if the neighboring nodes in output space have their corresponding points in input space connected together, the resulting image of the output grid reveals directly the topological ordering at each stage of the network training.

Property 3: Density Matching

The feature map Φ reflects variations in the statistics of the input distribution: regions in the input space from which the sample training vectors x are drawn with high probability of occurrence are mapped onto larger domains of the output space, and therefore with better resolution than regions of input space from which training vectors are drawn with low probability. We need to relate the input vector probability distribution p(x) to the magnification factor m(x) of the feature map. Generally, for two-dimensional feature maps the relation cannot be expressed as a simple function, but in one dimension we can show that

So the SOM algorithm doesn’t match the input density exactly, because of the power of 2/3 rather than 1. Computer simulations indicate similar approximate density matching in general, always with the low input density regions slightly over-represented.

Property 4: Feature Selection

Given data from an input space with a non-linear distribution, the self-organizing map is able to select a set of best features for approximating the underlying distribution. This property is a natural culmination of properties 1 through 3. Remember how Principal Component Analysis (PCA) is able to compute the input dimensions which carry the most variance in the training data. It does this by computing the eigenvector associated with the largest eigenvalue of the correlation matrix. PCA is fine if the data really does form a line or plane in input space, but if the data forms a curved line or surface (e.g. a semi-circle), linear PCA is no good, but a SOM will overcome the approximation problem by virtue of its topological ordering property. The SOM provides a discrete approximation of finding so-called principal curves or principal surfaces, and may therefore be viewed as a non-linear generalization of PCA.

SELF-ORGANIZING MAPS IN WEB MINING

· Real problem:

Applying SOM on natural language data means doing data mining on text data, for instance Web documents the role of SOM is to cluster numerical vectors given at input and to produce a topologically ordered result. The main problem of SOM as applied to natural language is the Need to handle essentially symbolic input such as words. If we want SOM to have words as input, then SOM will arrange the words into word categories. But what about the input (training) vector associated to each input word? What Should be the vector components, i.e. the attributes of a word? Similarity in word appearance is not related to the word meaning, e.g. “window”, “glass”, “widow”.

We have chosen to classify words by SOM, creating thus word category maps. The attributes of the words in our experiments were the count of the word occurrences in each document in a collection of documents. Consequently, we have chosen to represent the meaning of each word as related to the meanings of text passages (documents) containing the word and, symmetrically, the Semantic content of a document as a bag-of-words style function of the meanings of the words in the document.

· Proposed solution

Word category maps (also called self-organizing semantic maps are SOMs that have been organized according to word

Similarities, measured by the similarity of the short contexts of the words. The SOM algorithm is based on competitive learning: the artificial neurons of the network gradually become sensitive to different input categories. The neurons Form a regular usually two-dimensional array, i.e., a map. When an input vector x is processed, the best-matching neuron on the map, i.e., the neuron that is closest in the (normally Euclidean) metric wins the competition and it is updated as well as its neighborhood according to the adaptation rule (Kohonen 1995).

· Analysis of solution

The English translation of Grimm fairy tales. In the resulting map the verbs form an area of their own in The top of the map whereas the nouns can be found in the opposite corner. The modal verbs are in one area. Some semantically oriented ordering can also be found: for instance, the inanimate and animate nouns form separate clusters. Conceptually interrelated words tend to fall into the same or neighboring map nodes may thus be viewed as word categories. The overall

organization of a word category map reflects the syntactic categorization of the words. The SOM performs a non-linear mapping from the original multi-dimensional space into the Two-dimensional lattice. To visualize the nonlinearities.

Conclusion

Self-Organization Maps (Unsupervised Neuron Network) can be used to cluster the WEB documents in a set of groups depends on their sematic context by applying the self-organizing semantic maps according to word similarities, measured by the similarity of the short contexts of the words.