# K Means Under The Hood

__K Means Under The Hood__

## K Means Under The Hood

In this post, I attempt at explaining the principles of a very popular Clustering Algorithm i.e. K-Means.

I expect my readers to know about the basics of Unsupervised Learning and a bit

of Linear Algebra. Lets start with K Means Under The Hood.

So, to get things started, let me

introduce K-Means. As you may know, it belongs to class of Unsupervised

Learning algorithms and has manifold usage.

I will explain the algorithm with a

simple dataset as follows. For these types of algorithms to work we need a

Training dataset with which we will train our K-Means model and to evaluate the

performance of the model (trained) we need a Test Dataset. Let’s make our own

Training Dataset:

OB 1 | 1 | 4 | 1 |

OB 2 | 1 | 2 | 2 |

OB 3 | 1 | 4 | 2 |

OB 4 | 2 | 1 | 2 |

OB 5 | 1 | 1 | 1 |

OB 6 | 2 | 4 | 2 |

OB 7 | 1 | 1 | 2 |

OB 8 | 2 | 1 | 1 |

The above dataset contains 8

objects with their X, Y and Z coordinates. Our task is to cluster these objects

into two clusters (here we defining the value of K (of K-Means) i.e. to be 2).

So, the algorithm works by taking

any two Centroids (as we took 2 as K hence the number of Centroids also 2) in

its account initially. After choosing the Centroids (say C1 and C2) the data

points (Coordinates here) are assigned to any of the Clusters (let’s take

Centroids = Clusters for the time being) depending upon the distance between

them and the Centroids.

Assume that the algorithm chose

OB2 (1,2,2) and OB6 (2,4,2) as Centroids and Cluster 1 and Cluster 2 as well.

For measuring the distances the following Distance Measurement function (also

termed as Similarity Measurement function) is chosen:

K Means Under The Hood

d = |x_{2 }– x_{1}|

+ | y_{2 }– y_{1}| + |z_{2 }– z_{1}| (This is

also known as Taxicab distance or Manhattan Distance).

The following table shows the

calculation of distances between the data points and centroids chosen:

Objects | Coordinates | Distance from C1 | Distance from C2 | ||

OB 1 | 1 | 4 | 1 | 3 | 2 |

OB 2 | 1 | 2 | 2 | 0 | 3 |

OB 3 | 1 | 4 | 2 | 2 | 1 |

OB 4 | 2 | 1 | 2 | 2 | 3 |

OB 5 | 1 | 1 | 1 | 2 | 5 |

OB 6 | 2 | 4 | 2 | 3 | 0 |

OB 7 | 1 | 1 | 2 | 1 | 4 |

OB 8 | 2 | 1 | 1 | 3 | 4 |

Now, the data points are sorted

in the following manner:

OB 1 has a shorter distance to C2

than C1 so it will be assigned to C2. In this way the algorithm will proceed.

And after all the Data Point assignments are done it will look something like

this figure.

Cluster 1 (C1) |

OB 2 |

OB 4 |

OB 5 |

OB 5 |

OB 8 |

Cluster 2 (C2) |

OB1 |

OB 3 |

OB 6 |

Now the algorithm will continue updating Cluster Centers

until they cannot be updated anymore. The

Updation takes place in the

following manner: For Cluster 1, it will be ((1+2+1+1+2)/5,

(2+1+1+1+1)/5,(2+2+1+2+1)/5) = (1.4,1.2,1.6). And for Cluster 2 it will be

((1+1+2)/3, (4+4+4)/3, (1+2+2)/3) = (1.33, 4, 1.66).

After this, the algorithm again

starts finding the distances between the Data Points and newly derived Cluster

Centers (Centers and Centroids are same here.) So the new distance will be like

following:

Objects | Coordinates | Distance from new C1 | Distance from new C2 | ||

OB | 1 | 4 | 1 | 3.8 | 1 |

OB | 1 | 2 | 2 | 1.6 | 2.66 |

OB | 1 | 4 | 2 | 3.6 | 0.66 |

OB | 2 | 1 | 2 | 1.2 | 4 |

OB | 1 | 1 | 1 | 1.2 | 4 |

OB | 2 | 4 | 2 | 3.8 | 1 |

OB | 1 | 1 | 2 | 1 | 3.66 |

OB | 2 | 1 | 1 | 1.4 | 4.33 |

And the new assignments of the Data

Points with respect to the clusters will be:

New Cluster |

OB 2 |

OB 4 |

OB 5 |

OB 7 |

OB 8 |

New |

Cluster |

OB 1 |

OB 3 |

OB 6 |

And the updated cluster centers

will be C1 (2.4,1.2,1.6) and C2 (1.33,4,1.66) (Following the same rule as

mentioned earlier.)

The same steps will be repeated

until none of the Cluster assignments change i.e. the Data Points belonging to

Cluster 1 i.e. C1 will remain in C1 and the Data Points belonging to Cluster 2

i.e. C2 will remain in C2. Here in this Dataset, after 3 iterations, the

Cluster assignments do not change and the final cluster with centers are:

C1 = (2.4,1.2,1.6), C2 = (1.33,

4.0, 1.66)

And the Data Point assignment

will be:

C1 -> OB 2, OB 4, OB 5, OB 7,

OB 8

C2 -> OB 1, OB 3, OB 6

Now when the training (till this

phase the process is referred to as Training) we need a Test set and we will

generate that on our own like following:

OB 1 | 2 | 4 | 1 |

OB 2 | 2 | 2 | 2 |

OB 3 | 1 | 2 | 1 |

OB 4 | 2 | 2 | 1 |

For this we will simply calculate

the distances and according to that will assign the Data Points to the

respective clusters. The final result will be like following:

Cluster C1: OB 2, OB 3, OB 4

Cluster C2: OB 1

**Summary:**

Here, we reach almost the end of my article **K Means Under The Hood. **There is a healthy number of

Research Work regarding the value of K in K Means. That is all for this post.

Thanks for your time. Hope you have enjoyed.

** **