K-means Clustering is a unsupervised machine learning technique that solve the well known clustering problem, i.e. portioning ‘n’ number of objects into ‘k’ clusters in which each object belongs to the cluster with the nearest mean. Let’s discuss this in detail.
What is Unsupervised Machine Learning?
This is a term used in Machine learning which people come across quite often. Instead of diving into the definition, let’s first have the intuition of what it actually is? And then we will talk about K-means Clustering.
Let’s say we have a box of toys. Now, if you start looking into it, you realize that there are 3 toys in the shape of star, 4 toys in shape of square and 2 in the shape of circles. One more scenario in the same box , u might see 3 green toys , 6 red toys. Just to add one more scenario, you also observed there 3 marvel superheroes , 2 DC superheroes and 4 Disney toys. The point here is that you are starting to group the stuff that you have seen based on aspects without anyone telling you what it actually is? ( based on what you know ).
So unsupervised learning has the same concept behind it. Here given some data , it tries to learn a pattern in the data all by itself and tries to group the data points based on the pattern.( We don’t provide any labels for this task ).
Now if we want the machine to learn from this data , just like us it will start grouping them by shape or color or production house or a combination of them. That depends on what type of algorithm you are using.
Let’s try to understand K – means Clustering by an example.
Problem statement : The burger franchise wants to know where to accommodate their new franchise , so that it remains quite distant from previous and in a well connected zone. So, for this , you have the data of x , y coordinates of all the burger franchises.
For this problem , now you have many x, y coordinates for each location. So, now our aim is to put a new franchise at a location which is distant from the already available franchises.
For this we can do a brute force algorithm , where we calculate the distance with respect to each x, y coordinates. Now, this will lead to many calculations and will consume so much of time . If the space is huge , it’s not possible.
So , what is the best thing to do in this case.
Hmm…why don’t we group all this franchises in to finite clusters and then calculate the distance of our x1,y1 coordinate with respect to to these centers of these clusters. It basically implies that we have reduced our huge locations into multiple small locations.
This grouping of all the locations in to clusters is done by the k – means clustering.
For a given x,y points in space , we form clusters with center for these data points based on the distance between these data points.
Working of K-means Clustering
- Choose the number of clusters K.
- Randomly assign n centers to the given sample set.
- From each centre calculate the Euclidean distance with the closest available k points and consider close ones as a separate cluster.
- The next step is to compute the centroids of newly formed clusters.
- Do this multiple times. You will be left with centers which have converged with respect to the cluster it formed. As shown in the figure below.
Going back to our problem, with the help of k – means ( with 3 centers) let’s say we form 3 clusters and then with help of center of these 3 clusters we calculate the best x1, y1 to place the franchise depending on other factors like connectivity of the area and movement in the area.
When K – means Clustering Fails:
K- means clustering fails for categorical data. It does not handle noisy data and outliers. Also, fails to perform on non-linear data set.
- Introduced the term unsupervised learning.
- Picked an algorithm used to do unsupervised learning i.e, k-means.
- Explained k-means clustering with the help of a problem.
Written by :