Relational topological clustering

Relational Topological Clustering

Lazhar Labiod, Nistor Grozavu and Younes Bennani

Abstract— This paper introduces a new topological clusteringformalism, dedicated to categorical data arising in the form ofa binary matrix or a sum of binary matrices. The proposedapproach is based on the principle of the Kohonen’s model(conservation of topological order) and uses the RelationalAnalysis formalism by optimizing a cost function defined asa Condorcet criterion. We propose an hybrid algorithm, whichdeals linearly with large datasets, provides a natural clustersidentification and allows a visualization of the clustering resulton a two dimensional grid while preserving the a prioritopological order of the data. The proposed approach calledRTC was validated on several datasets and the experimentalresults showed very promising performances.

I. I NTRODUCTION

In the exploratory data analysis of high dimensional data,one of the main tasks is the formation of a simplified, usuallyvisual, overview of data sets. This can be achieved throughsimplified description or summaries, which should providethe possibility to discover most relevant features or patterns.Clustering and projection are among the examples of usefulmethods to achieve this task. On one hand classical clusteringalgorithms produce a grouping of the data according to achosen criterion. Projection methods, on the other hand,represent the data points in a lower dimensional space insuch a way that the clusters and the metric relations of thedata items are preserved as faithfully as possible. In this field,most algorithms use similarity measures based on Euclideandistance. However there are several types of data where theuse of this measure is not adequate. This is the case whenusing categorical data since, generally, there is no knownordering between the feature values. In this work, we presenta new formalism that can be applied to this type of dataand simultaneously achieves the both tasks, clustering andvisualization.

The goal of the Relational Analysis approach ([18],[17]) asa clustering technique, like Kohonen’s self-organizing map(SOM) [9] is to provide a solution which summarizes thedataset. Most of clustering techniques seek to gather similardata but not all of them allow a clustering visualization. Theconcept of conservation of the a priori topological data struc-ture constitutes the contribution of the self-organizing mapin the field of clustering which responds to this challenge.

In order to visualize the partition obtained by the Rela-tional Analysis approach, Marcotorchino proposed a method-ology called ”Relational Factorial Analysis” [19] whichcombines the relational analysis for clustering and the fac-torial analysis for the visualization of the partition on thefactorial designs. It is a juxtaposition of the both methods, the

Lazhar Labiod, Grozavu Nistor and Younes Bennani are with the LIPNLab, Universite Paris 13, 99, avenue Jean-Baptiste Clement, 93430 Villeta-neuse, France, email: [email protected] ).

methodology presented here combines the relational analysisapproach and the SOM principle determined by a specificformalism to this methodology. The proposed model allowssimultaneously, to achieve data clustering and visualization,indeed, it automatically provided a natural partition (i.ewithout fixing a priori the number of clusters and the size ofeach cluster) and a self-organization of the clusters on a two-dimensional map while preserving the a priori topologicaldata structure (i.e two close clusters on the map consist ofclose observations in the input space).

The various alternatives of the Kohonen’s algorithm arebased on the Euclidean distance which is not adapted tobinary data. We present in this article a new model adaptedto binary data, which takes as a starting point the principleof Kohonen’s model (conservation of topological order) anduses the Relational Analysis formalism by optimizing amodified Condorcet criterion. The Condorcet criterion isbased on the principle of taking into account of commonsimilarities and dissimilarities for each pair of objects

Various methods based on the principle of SOM modelwere proposed in the literature for binary data processing:probabilistic methods and others quantization techniques.Most of these methods operate on the data after a pre-liminary transformation step in order to find a continuousrepresentation of the data, and then apply SOM model, likeKACM [16] and the approach suggested by Leich and al[15]. These methods destroy the binary nature of the data, inother words, they violate the structure of the data to meet therequirements of the method. In [12] the authors propose BTMmethod (binary topological map) which operates directly onbinary data based on the Hamming distance. In [13],[14] aprobabilistic version of the SOM model is proposed, basedon the Bernoulli distribution adapted to the binary data(BeSOM).

This paper is organized in the following way: in section2 we present the Relational Analysis approach and theCondorcean clustering technique, and the section 3 presentsthe classical self-organizing maps model. We show in sec-tion 4 the formalism of the topological clustering problemin a relational framework and the proposed ”Batch RTC”algorithm. The section 5 shows the experimental results andsome perspectives related to the proposed approach.

II. RELATIONAL ANALYSIS APPROACH

Relational Analysis was developed in 1977 by F. Marco-torchino and P. Michaud, inspired by the work of Marquisde Condorcet, which was interested in the 18th century withthe result of collective vote starting from individual votes.This methodology is based on the relational representation(pairwise comparison) of data objects and the optimization

WCCI 2010 IEEE World Congress on Computational Intelligence July, 18-23, 2010 - CCIB, Barcelona, Spain IJCNN

978-1-4244-8126-2/10/$26.00 c©2010 IEEE 3493

under constraints of the Condorcet criterion. Generally, theobjective function corresponds to the criterion which measurethe adequacy of the solution to the data. The choice of thiscriterion is a fundamental point since it induces the natureof the resemblances intensity which we want to emphasis.Among a vast range of criteria, the relational approach makespossible to choose the best one answering to the problemarising from the involved data. Some criteria operate onbinary data, others are appropriate to frequencies data; butthe most of them are based on majority rule which determinethe level of the threshold value. Beyond this threshold it isconsidered that two objects are assigned to the same cluster.Let us recall that one of the major advantages of the relationalapproach resides in the fact that the number of clustersshould not be fixed a priori. This parameter characteristicof the solution is directly resulting from the processing (inan unsupervised way).

A. Condorcean clustering approach

Condorcean clustering is the first problem which washandled by the Relational Analysis. From the definition ofindividual relational tables, the basic matrixC of the pairwisecomparison between objects called ”Condorcet’s matrix” isequal to the sum of the individual relational tables. In thistype of matrix, the only information taken into account fortwo objects is the presence or the absence of the samemodality for a given variable. To simplify the presentation ofthe Condorcean clustering method, let us consider the case ofa complete disjunctive table. Suppose, we have a dataset witha setI of N objects described by the setV of M categoricalattributes (or variables)V 1, V 2., V K , .., V M each one havingm1, ..,mK , ..,mM modalities respectively. Let us denote bym the full number of modalities of all variables. We putthese data in the form of a complete disjunctive tableKwith general termkij such as:

kij ={

1 if the objecti has the modalityj0 otherwise

(1)

1) Condorcet Table:Each variableV k induces an equiv-alence relation on the setI of objects. This equivalencerelation can be represented in the shape of a binary relationaltable Ck whose general term is:ck

ii′ =∑mk

j=1 kijki′j . ThetableC is equal to the sum of the individual tables(Ck, k =1, .., M) and its general term denotedcii′ is defined by:cii′ =

∑Mk=1 ck

ii′ =∑m

j=1 kijki′j , C = KK ′. WhereK ′

indicates the transposedK matrix. This table has the propertythat cii′ ≤ min(cii, ci′i′) = M , wherecii (respectivelyci′i′ )is the self similarity of the objectsi and i′. This expressionmeans that the similarity between two objectsi, i′ is loweror equal to the minimum of their own similarities. We willdenote byC the complementary table to the maximum ofpossible similarity between two objectsi, i′ and his generalterm cii′ = M − cii′ .

The table 1 shows different coding forms for a qualita-tive dataset containing5 objects measured on a qualitativevariable with3 modalities

V1i1 1i2 2i3 1i4 2i5 3

V 11 V 2

1 V 31

i1 1 0 0i2 0 1 0i3 1 0 0i4 0 1 0i5 0 0 1

i1 i2 i3 i4 i5i1 1 0 1 0 0i2 0 1 0 1 0i3 1 0 1 0 0i4 0 1 0 1 0i5 0 0 0 0 1

TABLE I

L INEAR CODING - DISJUNCTIVE CODING- RELATIONAL CODING

2) Program to solve: Mathematically, the problem ofCondorcean clustering of a setI into L clusters arises inthe form of the following integer linear programming:

maxX

RRA(C, X)

with X = {xii′}i,i′:1...N an equivalence relation definedon I × I.

where,

RRA(C, X) =N∑

i=1

N∑i′=1

(cii′ − cii′)xii′ +N∑

i=1

N∑i′=1

cii′

=N∑

i=1

N∑i′=1

(2cii′ − cii + ci′i′

2)xii′ + β (2)

whereβ = (MN2−∑Ni=1

∑Ni′=1 cii′) is a constant term,

and X is the reached solution wich models a partition ina relational space (an equivalence relation), and must checkthe following properties:

xii = 1, ∀i reflexivityxii′ − xi′i = 0, ∀(i, i′) symmetryxii′ + xi′i′′ − xii′′ ≤ 1, ∀(i, i′, i′′) transitivityxii′ ∈ {0, 1}, ∀(i, i′) binarity

From this last writing of Condorcet criterion, we propose aparameterized version by introducing the similarity threshold(α ∈ [0, 1]) in the following way:

RRA(C, X) =N∑

i=1

N∑i′=1

[cii′ − α(cii + ci′i′

2)]xii′ (3)

Let us recall that Condorcet criterion rests on the conceptof majority, i.e. two objectsi, i′ will be a priori assigned tothe same cluster if and only if:

cii′ ≥ α(cii + ci′i′

2) (4)

That means, that two objectsi, i′ have a big probabilityto be in the same cluster if their similaritycii′ is greater orequal toα multiplied by the mean of their own similaritiesα( cii+ci′i′

2 ). This new formulation of Condorcet criterion ismotivated by the fact that, whenα = 1

2 , this condition is notalways easy to reach. Indeed when the number of variablesis very high compared to the number of objects i.eM >> Nthe Relational Analysis heuristic have tendency to provide ahigh number of clusters for the final partition which deprivethe clustering of its practical purpose.

3494

3) Contributions computation:Let us considerC ={C1, ..Cl, ...CL} a partition of the setI into L clusters, theCondorcet criterion breaks up into terms of contributionswhere the contributioncont(i, l) of an objecti in a clusterCl of the partition is written:

cont(i, l) =∑i′∈Cl

[cii′ − α(cii + ci′i′

2)] (5)

from where,

RRA(C,X) =N∑

i=1

L∑l=1

cont(i, l) (6)

that we can express in terms of the object profileKi

representing theith row of the complete disjunctive tableK andPl the prototype of clusterCl in the following way:

cii′ =< Ki,Ki′ > and Pl =∑i′∈Cl

Ki′ (7)

Then, we have

cont(Ki, Pl) =< Ki, Pl > −αSil (8)

Where Sil =|Cl|<Ki,Ki>+

∑i′∈Cl

<Ki′ ,Ki′>2 . This new

formula of the contribution avoid the computation of squarematricesC andC (Condorcet matrix and its complementary)which reduce considerably the computational cost related tothe contributions computation.

The Condorcean clustering problem is related to (0-1)linear programming. The number of variables isN ×N , thenumber of constrains is in the order ofO(N3). Theoretically,the problem can be solved exactly by a linear programmingtechnique, but unfortunately only for the small problemswhere the sizeN ≤ 100; hence only the heuristic approachcan be deal with large dataset.

B. Relational Analysis heuristic

The heuristic process consists in starting from an initialcluster (a singleton cluster) and build in an incrementalway, a partition of the setI by accentuating the valueof Condorcet criterionRRA(C, X) at each assignment.We give below the description of the relational analysisalgorithm which was used by the Relational Analysismethodology (see Marcotorchino and Michaud for furtherdetails). The presented algorithm aims at maximizing thecriterion given in (3) based on the contribution computation.

Algorithm1: RA heuristicInputs:Lmax= maximal number of clusters,Niter= numberof iterations, N= number of examples (objects),α=similarity threshold- take the first object as the first element of the firstcluster.- l = 1 wherel is the current number of clustersfor t=1 to Niter do

for i = 1 to N dofor j = 1 to l do

Compute the contribution of objecti :cont(i, j)

end forl∗ = arg maxj cont(i, j),

wherel∗ is the cluster id which has the highest contri-bution with the objecti

cont(i, l∗) ← the computed contributionif cont(i, l∗) < 0 and l < Lmax then

create a new cluster where the objecti is thefirst element

l ← l + 1else

assign objecti to clusterCl∗

endifendfor

endforOutput :at mostLmax clusters

We have to fix a number of iterations and the similaritythreshold in order to have an approximate solution in areasonable processing time. Besides, it is also required amaximum number of clusters, but since we don’t need tofix this parameter, we put by defaultLmax = N . Basically,this algorithm hasO(Niter ×Lmax ×N) computation cost.In general term, we can assume thatNiter << N , but notLmax << N . Thus, in the worst case, the algorithm hasO(Lmax ×N) computation cost.

III. SELF ORGANIZING MAP

The model called Kohonen’s self organizing map (SOM)is an artificial neural network, which learns to model adata space(Z, zi ∈ Rd) also called set of observations(objects) by a set of prototypes(W,wl ∈ Rd) (the neurons),observations and neurons are vectors of the input space.If the network consists of L neurons, the SOM techniqueprovides a partition into L clusters of the input space wherethe number of observationsN >> L. Each neuronl isassociated with a vector of weightwl which belongs to theinput space . Thus, for a set of observations the networklearns the position in this space ofL centers. For examplein the trivial case whereL = N , the best possible partitionis obviously a discrete partition where each observation isisolated in a cluster (the center of each cluster correspondsto the observation forming the cluster), which minimizesthe distance to all data objects.

3495

The modelling quality depends on the used metric distancein a vector space. We use the Euclidean distance to measurethe distance between an observation and a prototype (twovectors). In addition to model inputs through prototypes,a self-organizing mapC allows to build a graphG forstructuring this space and provides a visualization in oneor two dimensions the topological links between clusters.It should be remembered that the Kohonen network is not asimple clustering algorithm, it is a model that seeks to projectmultidimensional observations on a discrete space (the mapC) of small dimensions (usually 1, 2 or 3). This projectionhas to respect the property of ”conservation” of topologyof data, ie two neuronsl, r which are neighbors over thediscrete topological map must be associated with two closeprototypeswl, wr compared to the Euclidean distance in theobservation space.The mapC is in the form of an undirected graphG = (C,A),where C refers to theL vertices (neurons) andA the setof edges that gives the organization of neurons on the mapC. Thus, two neuronsl, r are directly connected neighborsin the map if a(c, r) ∈ A. This graph induces a discretedistanceδ on the map: for any pair of neurons(l, r) of themap the distanceδ(l, r) is defined as being the length ofthe shortest path betweenl and r. For every neuronl, thisdistance determines the Neighborhood of orderd of c asfollowing:

Vc(d) = {l ∈ C, δ(c, l) ≤ d} (9)

This notion of neighborhood can be formalized using akernel functionK defined fromR+ in R+, and decreasingsuch thatK(0) = 1 and limx→∞K(x) = 0 (in practicewe useK(x) = e−x2

). This function generates a family offunctionsKT , defined byKT (x) = K( x

T ). The parameterT is analogous to a temperature, whenT is hight KT (x)remains close to1 even for large values ofx; contrarily alow value produces aKT function which decreases quicklyto 0. The role ofKT is to transform the discrete distanceδ induced by the structure of the graph into a regularneighborhood parameterized byT . We will use KT

(δ(l,r))

as a measure of effective closeness between neuronsl andr. During the SOM algorithm, the value ofT decreases tostabilize the solution.

The quality of the partition and topology conservation ismeasured using the objective functionRT

SOM (ϕ,W ), whichmust be as low as possible.

RTSOM (ϕ,W ) =

N∑i=1

L∑l=1

KT(δ(ϕ(i),l))||zi − wl||2 (10)

where ϕ represent the assignment function such that:ϕ(i) = l if i ∈ Cl.To simplify the analysis we consider here the version ofthe batch SOM algorithm, the assignment step minimizesthe functionRT

SOM (ϕ,W ) by considering the prototypes

as fixed. The representation step minimizes the samefunction by considering the clusters as specified. For a fixedtemperatureT , the minimization is carried out in two stepsto be realized alternately during successive iterations.

The Batch algorithm of self-organizing map is thereforeas follows:

Algorithm2: Batch SOM algorithm with a fixed T

Randomly generate a set of prototypesW 0 and fixNiter{Initialization }for t=1 to Niter do

for i = 1 to N do{Assignment}assign observationi according to:

ϕt(i) = arg minl

L∑r=1

KT(δ(r,l))||zi − wt−1

r ||2

endforfor l = 1 to L do{Update}

Update prototypes:

wtl =

∑Ni=1 KT

(δ(ϕt(i),l))zi∑Ni=1 KT

(δ(ϕt(i),l))

endforendfor

IV. RELATIONAL TOPOLOGICAL CLUSTERING (RTC)

Similarly to the traditional model of self-organizing map(SOM), we use for the proposed RTC model an artificialneural network with an entry layer for the observations(data) and a mapC having a topological order for theexit. The topology of the map is defined via an undirectedgraph. Like the SOM algorithm, the RTC model includesthe vector quantization procedure. During this procedure,each neuron of the map which is the index of a prototypefor required quantization will be represented by a vectorof the same dimension than the observations. Contrarily toSOM approach, quantization is done by means of assignmentfunction ϕ adapted to binary data, the choice of prototypesand the assignment function is done by maximizing theobjective function denotedRT

RTC(ϕ,P ). Maximization mustallow on one hand, to define prototypes making possible tocarry out a conservation of the data topology (defined by ameasurement of contribution) and to carry out, on the otherhand, a partition of setI into homogeneous sub sets.

The basic idea of the RTC approach is to maximize a newobjective function defined from the classical RA criterionRRA by adding a regularization termRTopo, which intro-duces a topological constraint. The RTC objective functionis the follows:

RTRTC(ϕ,X ) = RRA(C, X) +RTopo(ϕ,X ) (11)

3496

where

RRA(C, X) =N∑

i,i′=1

Ψii′xii′ (12)

and

RTopo(ϕ,X ) =N∑

i,i′=1

Ψii′

L∑l=1

KT(δ(ϕ(i),l))Xi′lXil (13)

where∀i, i′ Ψii′ = cii′ − α( cii+ci′i′2 ), Xil is the general

term of the partition matrixX of set I into L clusters suchthat Xil ∈ {0, 1}, ∑L

l=1 Xil = 1, Xil = 1 − Xil. and∀i, i′; xii′ =

∑l XilXi′l, which is the general term of the

equivalence relationX.This function breaks up into two terms, the first one

corresponds to the Condorcet criterionRRA(C,X) whosemaximization makes possible to obtain a partition ofI morecompact possible within the meaning of the Condorcet cri-terion. The second term makes possible to take into accountthe influence of neighborhood between a neuron and itsneighbors on the mapC. It makes possible to bring closer thepartitions corresponding to two different neurons on the mapin order to preserve the topological order between the variouspartitions. Indeed, the second term imposes to the prototypeof the neuronl to represent objects belonging to nearbyneurons, i.e. a small value[

∑Ni′=1 Ψii′KT

(δ(ϕ(i),l))Xi′l], morepenalizes the maximization of the objective function if theneuronl is close to the neuronϕ(i) on the mapC.

The temperatureT adjusts the relative importance grantedto both terms. Indeed, for the large values of temperature,the second term is dominating and in this case the priorityis given to the topology. MoreT is small, more the firstterm is taken into account and the priority is given tothe determination of prototypes representing the compactpartition. The RTC approach acts in this case exactly likethe Condorcean method. It is thus possible to say that therelational topological map model makes possible to obtaina regularized solution of that obtained by the Condorceanmethod where the regularization is obtained by the respectof the a priori topological data structure.

The development of the both terms (12) and (13) leads tothe following expression of the objective function:

RTRTC(ϕ,X ) =

N∑i,i′=1

Ψii′

L∑l=1

KT(δ(l,l))Xi′lXil

+N∑

i,i′=1

Ψii′

L∑l=1

KT(δ(ϕ(i),l))Xi′lXil

=N∑

i,i′=1

Ψii′

L∑l=1

KT(δ(ϕ(i),l))Xi′l(Xil + Xil)

=N∑

i,i′=1

Ψii′

L∑l=1

KT(δ(ϕ(i),l))Xi′l (14)

A. A new writing of the objective function

The objective function above can be expressed using theprofilesKi of each object and the prototypePl of each cellof the mapC as following:

RTRTC(ϕ,X ) =

N∑i=1

L∑l=1

KT(δ(ϕ(i),l))

N∑i′=1

Ψii′Xi′l︸︷︷︸cont(i,l)

(15)

Replacing the contributioncont(i, l) by cont(Ki, Pl) givesthe following writing:

RTRTC(ϕ,P ) =

N∑i=1

L∑l=1

KT (δ(ϕ(i),l)) [< Ki, Pl > −αSil] (16)

=N∑

i=1

contT (Ki, Pϕ(i)) (17)

where

contT (Ki, Pϕ(i)) =L∑

l=1

KT(δ(ϕ(i),l)) [< Ki, Pl > −αSil]

(18)is the regularized contribution of the objecti to his winner

neuronϕ(i). We observe that the regularized contribution ofthe objecti to ϕ(i) is a weighted sum of the contributionsof i to all prototypesPl(l = 1, ...L) in the influence neigh-borhood ofϕ(i).We can rewrite this contribution in the following simplifiedform:

contT (Ki, Pϕ(i)) = < Ki, PTl > −α

∑Ll=1 KT

(δ(ϕ(i),l))Sil

(19)

where

PTϕ(i) =

L∑l=1

KT(δ(ϕ(i),l))Pl =

L∑l=1

KT(δ(ϕ(i),l))

∑i′∈Cl

Ki′ (20)

is the regularized prototype of the winner neuronϕ(i),that could be seens as a weighted sum of the prototypesPl(l = 1, ...L) in the influence neighborhood ofϕ(i).

B. RTC heuristic

In this section, we will give an algorithm suitable to theRTC’s formalism. We consider here the batch SOM : theassignment step maximizes the objective function by consid-ering all prototypesP fixed; representation step maximizesthe same function considering the clusters set fixed (theassignment functionϕ fixed). For a fixed temperatureT ,the maximization occurs in two alternating phases duringsuccessive iterations. We summarize this algorithm in thefollowing points:

3497

Step 1. Initialization: Initialize the mapC using the rela-tional analysis approachStep 2. Assignment:The RT

RTC(ϕ, P ) is expressed as asum of independent terms (regularized contributions) andwe can replace the both optimization problems by a setof simple equivalent problems. Indeed,RT

RTC(ϕ, P ) canbe decomposed in terms of individual contributions of eachi ∈ I in each cell of the mapC. It is assumed at this stage thatall prototypes are fixed and remains constant by maximizingthe functionRT

RTC(ϕ,P ) compared toϕ. It is easy to seethat this maximum is reached for an assignment functiondefined by:

∀i; ϕ(i) = arg maxl

contT (Ki, Pl) (21)

Step 3. Maximization: The maximization step consist inmaximizing the objective function overP by setting theassignmentϕ in it’s constant definition. In others words,maximization step consists in updating each regularizedprototypePT

l (t) of neuronCl at each iterationt accordingto the following rule:

∀l; PTl (t) =

L∑r=1

KT (δ(r,l))(t)∑

i′∈Cr(t)

Ki′ (22)

The proposed Batch RTC algorithm is presented inAlgorithm3:

C. Computational cost of the Batch RTC algorithm

Analyzing the computational cost of the Batch RTC al-gorithm, we find that for the initialization step, the cost isthe same as for the RA heuristic, ieO(Lmax ×Niter ×N).For the remaining iterations, the Batch RTC algorithm keepsthe same cost computing principle as for the RA whileintroducing at each iteration the computation of proximitybetween the neurons of the mapC (ie compute theL2 valuesof the Neighborhood). Thus the cost of Batch RTC algorithmremains linear:O(Niter × Lmax × (N + Lmax).

V. EXPERIMENTATIONS AND VALIDATION

There are many ways to measure the accuracy of clusteringalgorithm. One of the ways of measuring the quality of aclustering solution is the cluster purity. Let there beL clustersof the datasetI and size of clusterCl be |Cl|. The purity ofthis cluster is given by purity(Cl)= 1

|Cl| maxk(|Cl|cluster=k)where|Cl|cluster=k denote the number of items for the clusterk assigned to clusterl. The overall purity of a clusteringsolution could be expressed as a weighted sum of individualcluster purities

purity =L∑

l=1

|Cl||I| purity(Cl) (23)

Algorithm3: Batch RTC algorithm with a fixed T:InputsC0= initial map with Lmax neurons.Niter= number

of iterations.N= number of observations.α= similaritythreshold.KT = neighborhood matrixInitialization: Initialize the mapC using RA heuristic- Run the RA heuristic on theK matrix- Randomly place the resulting clusters on the mapC0

- Compute the initial prototypes:

∀l;PTl (0) ←

Lmax∑r=1

KT(δ(r,l))(0)

∑i′∈Cr(0)

Ki′

for t=1 to Niter dofor i = 1 to N do{Assignment}

assign the observationi to its closest neuronwithin the sens of contribution:

ϕ(i)(t) = arg max{l=1,.....,Lmax}

cont(Ki, Pl(t− 1))

end forfor l = 1 to Lmax do{Maximization}

update prototypes according to

PTl (t) =

Lmax∑r=1

KT(δ(r,l))(t)

∑i′∈Cr(t)

Ki′

endforendforOutputsa map ofLmax cells.

In general, if the values of purity are larger, the clusteringsolution is better.

A. The datasets for validation

In this section, we evaluate the performance of the RTCheuristic on several databases available at the UC IrvineMachine Learning Repository [1]. The used datasets are:

1) Zoo dataset This dataset contains 101 animals describedwith 16 qualitative variables: 15 of the variables are binaryand one is numeric with 6 possible values. Each animal islabelled 1 to 7 according to its class.

2) Nursery Database Nursery Database was derived from ahierarchical decision model originally developed to rankapplications for nursery schools. The Nursery Databasecontains examples with the structural information removed,i.e., directly relates NURSERY to the eight input attributes:parents, hasnurs, form, children, housing, finance, social,health. This dataset has 12960 observations and 8 variableslabbeled in 6 classes.

3) Car Evaluation Database. Car Evaluation Database wasderived from a simple hierarchical decision model originallydeveloped for the demonstration of DEX (M. Bohanec, V.Rajkovic: Expert system for decision making. Sistemica 1(1),pp. 145-157, 1990.). The model evaluates cars according theirconcept structure. The dataset represent a 4 classes problemcontaining 1728 observations and 6 variables.

3498

4) Postoperative Patient DataThe classification task of thisdatabase is to determine where patients in a postoperativerecovery area should be sent to next. Because hypothermia isa significant concern after surgery (Woolery, L. et. al. 1991),the attributes correspond roughly to body temperature mea-surements. The dataset has 90 observations and 8 variablesclassified in 3 classes.

5) SPECTF heart dataset The dataset describes diagnosingof cardiac Single Proton Emission Computed Tomography(SPECT) images. Each of the patients is classified intotwo categories: normal and abnormal. The database of 267SPECT image sets (patients) was processed to extract fea-tures that summarize the original SPECT images. As a result,44 continuous feature pattern was created for each patient.The pattern was further processed to obtain 22 binary featurepatterns. SPECT is a good data set for testing ML algorithms;it has 267 instances that are descibed by 23 binary attributes

B. Results on Zoo dataset:

We use the zoo dataset to show the good performanceof the RTC algorithm. Using disjunctive coding for thequalitative variable with 6 possible values, the data setconsists of a101 × 21 binary data matrix. All 101 animalsare used for learning with a map with size5× 5 cells. Thelearning algorithm provides a profile prototype for eachcell. At the end of the learning phase, each observation,corresponding to an animal, is assigned to the cell with thehighest contribution by taking into account the neighborhoodrelation.

Fig. 1. Initialization map using Relational Analysis algorithm

The RTC algorithm starts with the initialization of the gridby distributing the observations using Relational Analysisapproach. The figure 1 shows the class of animals distributedafter the initialization step of the RTC algorithm. We usethe animals names used in original dataset. To visualize thecoherence of the map with the labelling of animals, this figureshows the class number corresponding to each cell after theapplication of the majority rule in each cell. We remind thatduring this learning step, the neighborhood information is notconsidered (the neighborhood functionK is not computed).On the initialization grid (figure 1) the observations are notwell distributed, there is two set of observations labelled with7 which are separated by 2 empty cells; we can find also foursets of animals labelled as 1 which are dispersed on the map:

two sets on the left top corner, one set is situated on the leftbottom corner, and the last one, on the right bottom part ofthe map. This map demonstrates that the classical RA doesn’tuse a topological information during the clustering processwhich could allow a better distribution of the observations.

After the initialization step, the RTC algorithm willcontinue the learning process by taking into account theneighborhood relation between all the cells. Figure 2 showsanimals names collected by each cell. The map shows thatthe same class of animals is assigned to cells close to eachother.

Fig. 2. Relational Topological Clustering : zoo database

We can observe that the animals corresponding to the class1 are clustered in the cells situated on the left bottom ofthe map (figure 2); the birds which correspond to the class2 are in the right bottom part of the map. Also, we cananalyze that fruitbat from the class 1 situated nearest to thecell containing the birds (class 2), this is explained that thefruitbat has nearest characterization with the birds even itcomes from another family.

On the middle of the map there is a cell containing 2observations from two different classes: the frog (class 5)and penguin (class 2). The RTC algorithm put these twoobservations in the same cell because the frog and thepenguin has very closest specifications even the penguinbelongs to birds family and frog, from the amfibia family.

3499

Moreover, on the left of this cell there is a cell containingthe animals from class 5, and on the right, a cell labelledas class 2. We have the same situation for the cell labelledas 3.5 where the toad and the tortoise has highly correlatedfeatures, and the both cells labelled as 5 and 1 are borderedon the right from this cell. The same type of analysis can beapplied to the remaining clusters. To give a global view of thehomogeneous clustering, we compute the clustering purityfor the zoo map and we obtain a purity value of97.84%.

In order to show the good performance of the RelationalTopological Clustering (RTC) approach we use severalbinary datasets of different sizes. For each dataset welearned a map of different size (from 4x4 to 10x10) andwe indicate in the table II the purity of clustering after thefirst iteration using the classical RA and the map purity atthe end of the map learning with the RTC technique. Theresults illustrate that the proposed technique increase thepurity index compared to the classical RA and allows toobtain a topological map by computing the neighborhoodfunction between the cells.

TABLE II

EXPERIMENTATION RESULTS ON DIFFERENT DATASETS USINGRTC

APPROACH.

DB DB size Map size RA purity RTC purityZoo 101x17 5x5 69.08% 97.84%Car 1728x6 10x10 70.31% 80.17%

Nursery 12960x8 6x6 50.47% 78.69%SPECTF 267x22 4x4 57.14% 81.82%

Pos-Operative 90x8 5x5 71.59% 78.21%

VI. CONCLUSION

We have proposed in this paper a new Relational Topolog-ical model for multidimensional categorical data clusteringand visualization, inspired from the SOM principle and theRelational Analysis formalism. It combines the advantagesof both methods, indeed it allows a natural clusters iden-tification without a priori fixing the number of clusters,and simultaneously provides a clusters visualization on alow dimensional lattice while preserving as faithfully aspossible the topological data structure. However, this modeladdresses in the same way the variables describing the data, itignores their internal structures, the number of modalities pervariable and frequency of each modality. For this problem,we will propose a weighted model to take into account theseinformations of variables.

REFERENCES

[1] Asuncion, A. Newman, D.J. (2007). ”UCI Machine Learning Reposi-tory [http://www.ics.uci.edu/ mlearn/MLRepository.html]. Irvine”, CA:University of California, School of Information and Computer Science.

[2] El Golli, A., B. Conan-Guez, et F. Rossi (2004b). ”A self organizingmap for dissimilarity data”. In D. Banks, L. House, F. R. McMorris, P.Arabie, et W. Gaul (Eds.), Classification, Clustering, and Data MiningApplications (Proceedings of IFCS 2004), Chicago, Illinois (USA), pp.61-68. IFCS : Springer.

[3] Barbara Hammer, Alexander Hasenfu, Fabrice Rossi and Marc Strick-ert.”Topographic Processing of Relational Data”. In Proceedings ofthe 6th Workshop on Self-Organizing Maps (WSOM 07), Bielefeld,Germany. September 2007.

[4] H. Benhadda,Similarite regularisee et ses applications en classificationautomatique(PHD thesis of Paris 6 University, 1998.)

[5] Cottrell, M. Letrmy P. 2003. ”Analyzing surveys using the Kohonen al-gorithm”, Proc. ESANN 2003, Bruges, 2003, M.Verleysen Ed., EditionsD Facto, Bruxelles, 85-92.

[6] E.W. Forgy, Cluster analysis of multivariate data : efficiency versusinterpretability of classification., inBiometrics, vol 21, 1965, pp768-780.

[7] N. A. Idrissi, Contribution l’unification de criteres d’association(PHDthesis of Paris 6 University, 2000.)

[8] Kohonen and Somervuo , How to make large self-organizing maps fornonvectorial data.Neural Networks,vol 15 (8). pp. 945-952 .

[9] T. Kohonen, Self-Organizing Maps.Springer Series in InformationSciences,vol 30, Springer. .

[10] L. Labiod, Contribution au formalisme relationnel des classificationssimultanees de deux ensembles.(PHD thesis of Paris 6 University,2008.)

[11] M. Lebbah,Cartes topologiques pour donnees qualitatives : Applica-tion la reconnaissance automatique du trafic routier.(PHD thesis ofVersailles Saint Quentin en Yvelines University, 2003.)

[12] M. Lebbah, F. Badran and S. Thiria, Topological map for binary data.,in ESANN 2000.

[13] M. Lebbah, N. Rogovschi and Y. Bennani, BeSOM: Bernoulli on Self-Organizing Map, inInternational Joint Conference on Neural networks,IJCNN, August 2007.

[14] M. Lebbah, Y. Bennani and N. Rogovschi, A Probabilistic Self-Organizing Map for Binary Data Topographic Clustering, inInterna-tional Journal of Computational Intteligence and Applications, WorldScientific Publishing Compagny.p 363-383, Vol7, No.4. 2008.

[15] F. Leich, A. Weingessel and E. Dimitriadou, Competitive Learning forBinary Data., inProc of ICANN’98, septembre 2-4. Springer Verlag,1998.

[16] P. Letremy, Traitement de donnees qualitatives par des algorithmesfondes sur l’algorithme de Kohonen,SAMOS-MATISSE UMR 8595,Universit de Paris 1, 2005.

[17] J. F. Marcotorchino, Relational analysis theory as a general approachto data analysis and data fusion, inCognitive Systems with interactivesensors, 2006.

[18] J. F. Marcotorchino, P. Michaud,Optimisation en analyse ordinale desdonnees. (In Masson, 1978.)

[19] J. F. Marcotorchino,L’analyse factorielle relationnelle: partie I et II.(Etude du CEMAP, IBM France, vol MAP-03, decembre 1991)

[20] J. F. Marcotorchino,Dualite Burt-Condorcet : relation entre analysefactorielle des correspondances et analyse relationnelle.(Etude duCEMAP, IBM France, in l’analyse des correspondances et les tech-niques connexes. Springer 2000.)

3500

Date post:	11-Nov-2023
Category:	Documents
Upload:	univ-paris13
View:	0 times
Download:	0 times

Relational topological clustering

Documents