Task 10 , K-Means Clustering and its real use case in Security Domain.

Abhinav Shreyash
5 min readJul 19, 2021

Clustering -:

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups.

Its the most frequently used Exploratory Data Analysis tool , used to get the intuition about the data .

Its an unsupervised machine learning algorithm since it does not require the label in the data .

Types of Clustering

Clustering is a type of unsupervised learning wherein data points are grouped into different sets based on their degree of similarity.

There are two types of clustering :

  • Hierarchical clustering
  • Partitioning clustering

Hierarchical clustering is of two types :

  • Agglomerative clustering
  • Divisive clustering

Partitioning Clustering is of two types :

  • K-means clustering
  • Fuzzy C-Means clustering

K-Means Clustering Algorithm —

K-means Clustering is one of the many unsupervised learning algorithms, where the input data is unlabeled .Which means there is no ‘y’ parameter to compare with the output of the ‘y^’ (“y hat “).

About k-means clustering

  • K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.
  • The term ‘K’ is a number. You need to tell the system how many clusters you need to create.
  • The number of clusters to be made can be guessed based don the type of data , for instance , if we have a data for titanic , then we make two partitions , one for the people who lived and the other for those who died.

Working Of K-means Clustering

Let’s say we have a data set that looks like the following figure when plotted out:

To us humans, this data looks like it perfectly fits within three groups (i.e. clusters). However, machines can’t see that, as those points are actual data “points” whose values are just numbers that cannot be sensible to the machine.

k-Means clustering is all about putting the training points we have into clusters. But the purpose of it follows the same idea. We want to know which data points belong together without having any labels for any of them.

We start the algorithm by placing k different averages (i.e. means) whose values are either initialized randomly or set to real data points on the plane. Let’s start with k=3, as the data “seems” to be falling into three groups . For explanation-related purposes, lets randomly initialize the values (i.e. positions) of the averages:

Now, the algorithm goes through the data points one-by-one, measuring the distance between each point and the three centroids (A, B and C). The algorithm then groups the the data point with the closest centroid (i.e. closest in distance).

For instance, data point number 21 will belong to group A in the green color, merely because it’s closer in distance to centroid A:

As soon as we’re done associating each data point with its closest centroid, we re-calculate the means — the values of the centroids; the new value of a centroid is the sum of all the points belonging to that centroid divided by the number of points in the group.

We keep doing the above until no centroid changes its value in re-calculation. This means that each centroid has centered itself in the middle of its cluster, which is surrounded by its own circular decision boundary:

Its Use Case in The Security Domain :

Since Clustering means grouping the data points into similar featured data points groups , hence the word clustering , so . . .. ……

In this i explain a hypothetical model , which is either yet to be applied or already applied .

Suppose we are in a big company , or in a big firm most, notably in a military computer , here we have logs being collected every now and then about the daily or maybe hourly working of the systems.

The point is .. we are prone to false positive in this , and here false positive could also result in country wide conflicts , and so to avoid those conflicts , what i do is ,

I make a clustering model , wherein i cluster all the logs into four parts, where each part must mean something in its own as a part of the whole group system , so lets consider those.

We then test each of the parts for the nature and the type of answer the model is giving and clustering with 4 groups , as this is unsupervised machine learning, once we know the nature of cluster output in small scale then the same output with the same and consistent data type will be applicable in the large datasets , and here large means large number of log files , and we will most likely be able to avoid the false positive (or minimize such cases).

Further explained, the inside manipulations are going to be complex so i will omit those parts , since we all know why it is called unsupervised learning because we have no control over the end result the model is going to produce hence we can not devise the required type of answers from it but if we put some constraints on the K-means model somewhat like RBM(Restricted boltzman Machine), while ensuring that those constraints produce four different parts of the data signifying four different types of error , the false positive error , the false negative, the true positive and the true negatives, then we can achieve our objective.

Thanks for reading.

--

--