Self-Organizing Map in web mining Technical report
بسم
الله الرحمن الرحيم
Self-Organizing Map Technical report
Prepared by :
Mogtapa Eltayip
Abd-Elrhman Mohammed
Index (12)
Supervised: Abdelhamid
Salih Ph.D
Abstract
Neural Networks are a computational approach which
is based on a large collection of neural units loosely modeling the way the
brain solves problems with large clusters of biological neurons connected by
axons. Each neural unit is connected with many others, and links can be
enforcing or inhibitory in their effect on the activation state of connected
neural units
INTRODUCTION
The Self-Organizing Map is one of the most popular
neural network models. It belongs to the category of competitive learning
networks. The Self-Organizing Map is
based on unsupervised learning, which means that no human intervention is
needed during the learning and that little needs to be known about the
characteristics of the input data. We could, for example, use
the SOM for clustering data without knowing the class memberships of the input
data. The SOM can be used to
detect features inherent to the problem and thus has also been called SOFM, the
Self-Organizing Feature Map. The Self-Organizing Map was developed by professor
Kohonen. The SOM has been proven useful in many applications.
The SOM algorithm is based on unsupervised,
competitive learning. It provides a topology preserving mapping from the high
dimensional space to map units. Map units, or neurons, usually form a
two-dimensional lattice and thus the mapping is a mapping from high dimensional
space onto a plane. The property of topology preserving means that the mapping
preserves the relative distance between the points. Points that are near each
other in the input space are mapped to nearby map units in the SOM. The SOM can
thus serve as a cluster analyzing tool of high-dimensional data. Also, the SOM
has the capability to generalize. Generalization capability means that the
network can recognize or characterize inputs it has never encountered before
Methodology:
######################################
The SOM Algorithm
The aim is to learn a feature map from the
spatially continuous input space, in which our input vectors live, to the low
dimensional spatially discrete output space, which is formed by arranging the
computational neurons into a grid. The stages of the SOM algorithm that
achieves this can be summarized as follows:
1.
Initialization – Choose random values for the
initial weight vectors wj.
2.
Sampling – Draw a sample training input vector x
from the input space.
3.
Matching – Find the winning neuron I(x)
4. 4.update all weight vectors of
all neurons i in the neighborhood of neuron
5.
If convergence criterion met, STOP.
The Architecture a Self-Organizing Map
We shall concentrate on the SOM system known as a
Kohonen Network. This has a feed-forward structure with a single computational
layer of neurons arranged in rows and columns. Each neuron is fully connected
to all the source units in the input layer:
A one-dimensional map will just have a single row
or column in the computational layer.
Properties of the Feature Map
Once the SOM algorithm has converged, the feature
map displays important statistical characteristics of the input space. Given an
input vector x, the feature map Φ provides a winning neuron
I(x) in the output space, and the weight vector wI(x) provides the coordinates
of the image of that neuron in the input space.
The feature map has four important properties that
we shall look at in turn.
Property 1: Approximation of the Input Space
The feature map Φ represented by the set of
weight vectors {wi } in the output space, provides a good approximation to the
input space. We can state the aim of the SOM as storing a large set of input
vectors {x} by finding a smaller set of prototypes {wi} so as to provide a good
approximation to the original input space. The theoretical basis of this idea
is rooted in vector quantization theory, the motivation of which is
dimensionality reduction or data compression. In effect, the goodness of the
approximation is given by the total squared distance
Which we wish to minimize. If we work through
gradient descent style mathematics we do end up with the SOM weight update
algorithm, which confirms that it is generating a good approximation to the
input space.
Property 2: Topological Ordering
The feature map Φ computed by the SOM
algorithm is topologically ordered in the sense that the spatial location of a
neuron in the output lattice/grid corresponds to a particular domain or feature
of the input patterns.
The topological ordering property is a direct
consequence of the weight update equation that forces the weight vector wI(x)
of the winning neuron I(x) to move toward the input vector x. The crucial
factor is that the weight updates also move the weight vectors wj of the
closest neighboring neurons j along with the winning neuron I(x). Together
these weight changes because the whole output space to become appropriately
ordered. We can visualize the feature map Φ as an elastic or virtual
net with a grid like topology. Each output node can be represented in the input
space at coordinates given by their weights. Then if the neighboring nodes in
output space have their corresponding points in input space connected together,
the resulting image of the output grid reveals directly the topological
ordering at each stage of the network training.
Property 3: Density Matching
The feature map Φ reflects variations in the statistics of the input distribution:
regions in the input space from which the sample training vectors x are drawn
with high probability of occurrence are mapped onto larger domains of the
output space, and therefore with better resolution than regions of input space
from which training vectors are drawn with low probability. We need to relate
the input vector probability distribution p(x) to the magnification factor m(x)
of the feature map. Generally, for two-dimensional feature maps the relation
cannot be expressed as a simple function, but in one dimension we can show that
So the SOM algorithm doesn’t
match the input density exactly, because of the power of 2/3 rather than 1.
Computer simulations indicate similar approximate density matching in general,
always with the low input density regions slightly over-represented.
Property 4: Feature Selection
Given data from an input
space with a non-linear distribution, the self-organizing map is able to select
a set of best features for approximating the underlying distribution. This
property is a natural culmination of properties 1 through 3. Remember how
Principal Component Analysis (PCA) is able to compute the input dimensions
which carry the most variance in the training data. It does this by computing
the eigenvector associated with the largest eigenvalue of the correlation
matrix. PCA is fine if the data really does form a line or plane in input
space, but if the data forms a curved line or surface (e.g. a semi-circle), linear
PCA is no good, but a SOM will overcome the approximation problem by virtue of
its topological ordering property. The SOM provides a discrete approximation of
finding so-called principal curves or principal surfaces, and may therefore be
viewed as a non-linear generalization of PCA.
SELF-ORGANIZING MAPS IN WEB MINING
·
Real problem:
Applying SOM on natural
language data means doing data mining on text data, for instance Web documents
the role of SOM is to cluster numerical vectors given at input and to produce a
topologically ordered result. The main problem of SOM as applied to natural
language is the Need to handle essentially symbolic input such as words. If we
want SOM to have words as input, then SOM will arrange the words into word
categories. But what about the input (training) vector associated to each input
word? What Should be the vector components, i.e. the attributes of a word?
Similarity in word appearance is not related to the word meaning, e.g.
“window”, “glass”, “widow”.
We have chosen to classify
words by SOM, creating thus word category maps. The attributes of the words in
our experiments were the count of the word occurrences in each document in a
collection of documents. Consequently, we have chosen to represent the meaning
of each word as related to the meanings of text passages (documents) containing
the word and, symmetrically, the Semantic content of a document as a
bag-of-words style function of the meanings of the words in the document.
·
Proposed solution
Word category maps (also
called self-organizing semantic maps are SOMs that have been organized
according to word
Similarities, measured by
the similarity of the short contexts of the words. The SOM algorithm is based
on competitive learning: the artificial neurons of the network gradually become
sensitive to different input categories. The neurons Form a regular usually
two-dimensional array, i.e., a map. When an input vector x is processed, the
best-matching neuron on the map, i.e., the neuron that is closest in the
(normally Euclidean) metric wins the competition and it is updated as well as
its neighborhood according to the adaptation rule (Kohonen 1995).
·
Analysis of solution
The English translation of
Grimm fairy tales. In the resulting map the verbs form an area of their own in
The top of the map whereas the nouns can be found in the opposite corner. The
modal verbs are in one area. Some semantically oriented ordering can also be
found: for instance, the inanimate and animate nouns form separate clusters.
Conceptually interrelated words tend to fall into the same or neighboring map
nodes may thus be viewed as word categories. The overall
organization of a word
category map reflects the syntactic categorization of the words. The SOM
performs a non-linear mapping from the original multi-dimensional space into the
Two-dimensional lattice. To visualize the nonlinearities.
Conclusion
Self-Organization Maps (Unsupervised Neuron
Network) can be used to cluster the WEB documents in a set of groups depends on
their sematic context by applying the self-organizing semantic maps according
to word similarities, measured by the similarity of the short contexts of the words.
Reference






تعليقات
إرسال تعليق