K Means Under The Hood

K Means Under The Hood

K Means Under The Hood

In this post, I attempt at explaining the principles of a very popular Clustering Algorithm i.e. K-Means.
I expect my readers to know about the basics of Unsupervised Learning and a bit
of Linear Algebra. Lets start with K Means Under The Hood.

So, to get things started, let me
introduce K-Means. As you may know, it belongs to class of Unsupervised
Learning algorithms and has manifold usage.

I will explain the algorithm with a
simple dataset as follows. For these types of algorithms to work we need a
Training dataset with which we will train our K-Means model and to evaluate the
performance of the model (trained) we need a Test Dataset. Let’s make our own
Training Dataset:

OB 1

1

4

1

OB 2

1

2

2

OB 3

1

4

2

OB 4

2

1

2

OB 5

1

1

1

OB 6

2

4

2

OB 7

1

1

2

OB 8

2

1

1

The above dataset contains 8
objects with their X, Y and Z coordinates. Our task is to cluster these objects
into two clusters (here we defining the value of K (of K-Means) i.e. to be 2).

So, the algorithm works by taking
any two Centroids (as we took 2 as K hence the number of Centroids also 2) in
its account initially. After choosing the Centroids (say C1 and C2) the data
points (Coordinates here) are assigned to any of the Clusters (let’s take
Centroids = Clusters for the time being) depending upon the distance between
them and the Centroids.

Assume that the algorithm chose
OB2 (1,2,2) and OB6 (2,4,2) as Centroids and Cluster 1 and Cluster 2 as well.
For measuring the distances the following Distance Measurement function (also
termed as Similarity Measurement function) is chosen:

K Means Under The Hood

d = |x2 – x1|
+ | y2 – y1| + |z2 – z1| (This is
also known as Taxicab distance or Manhattan Distance).

K Means Under The Hood

The following table shows the
calculation of distances between the data points and centroids chosen:

Objects

Coordinates

Distance from C1

Distance from C2

OB 1

1

4

1

3

2

OB 2

1

2

2

0

3

OB 3

1

4

2

2

1

OB 4

2

1

2

2

3

OB 5

1

1

1

2

5

OB 6

2

4

2

3

0

OB 7

1

1

2

1

4

OB 8

2

1

1

3

4

Now, the data points are sorted
in the following manner:

OB 1 has a shorter distance to C2
than C1 so it will be assigned to C2. In this way the algorithm will proceed.
And after all the Data Point assignments are done it will look something like
this figure.

Cluster 1 (C1)

OB 2

OB 4

OB 5

OB 5

OB 8

Cluster 2 (C2)

OB1

OB 3

OB 6

Now the algorithm will continue updating Cluster Centers
until they cannot be updated anymore. The

Updation takes place in the
following manner: For Cluster 1, it will be ((1+2+1+1+2)/5,
(2+1+1+1+1)/5,(2+2+1+2+1)/5) = (1.4,1.2,1.6). And for Cluster 2 it will be
((1+1+2)/3, (4+4+4)/3, (1+2+2)/3) = (1.33, 4, 1.66).

After this, the algorithm again
starts finding the distances between the Data Points and newly derived Cluster
Centers (Centers and Centroids are same here.) So the new distance will be like
following:

Objects

Coordinates

Distance from new C1

Distance from new C2

OB
1

1

4

1

3.8

1

OB
2

1

2

2

1.6

2.66

OB
3

1

4

2

3.6

0.66

OB
4

2

1

2

1.2

4

OB
5

1

1

1

1.2

4

OB
6

2

4

2

3.8

1

OB
7

1

1

2

1

3.66

OB
8

2

1

1

1.4

4.33

And the new assignments of the Data
Points with respect to the clusters will be:

New

Cluster
C1

OB 2

OB 4

OB 5

OB 7

OB 8

New

Cluster
C2

OB 1

OB 3

OB 6

And the updated cluster centers
will be C1 (2.4,1.2,1.6) and C2 (1.33,4,1.66) (Following the same rule as
mentioned earlier.)

K Means Under The Hood

The same steps will be repeated
until none of the Cluster assignments change i.e. the Data Points belonging to
Cluster 1 i.e. C1 will remain in C1 and the Data Points belonging to Cluster 2
i.e. C2 will remain in C2. Here in this Dataset, after 3 iterations, the
Cluster assignments do not change and the final cluster with centers are:

C1 = (2.4,1.2,1.6), C2 = (1.33,
4.0, 1.66)

And the Data Point assignment
will be:

C1 -> OB 2, OB 4, OB 5, OB 7,
OB 8

C2 -> OB 1, OB 3, OB 6

Now when the training (till this
phase the process is referred to as Training) we need a Test set and we will
generate that on our own like following:

OB 1

2

4

1

OB 2

2

2

2

OB 3

1

2

1

OB 4

2

2

1

For this we will simply calculate
the distances and according to that will assign the Data Points to the
respective clusters. The final result will be like following:

Cluster C1: OB 2, OB 3, OB 4

Cluster C2: OB 1

Summary:

Here, we reach almost the end of my article K Means Under The Hood. There is a healthy number of
Research Work regarding the value of K in K Means. That is all for this post.
Thanks for your time. Hope you have enjoyed.

 

Comments

comments

Add a Comment

Your email address will not be published. Required fields are marked *

CommentLuv badge