Recep Arda Kaya - 435060

Clustering Algorithms Comparison on IBM HR Data

Compared Algorithms:

* K-Means Clustering
* Agglomerative Clustering
* DBSCAN Clustering
* Mean-Shift Clustering
* BIRCH Clustering
* Mini-batch k-means

Objective

Clustering aims to maximize intra-cluster similarity and minimize inter-cluster similarity.

Each clustering problems requires own unique solutions. According to my observation, most of tutorials and guidebooks focus on K-means clustering and the data preparation process before. I want to introduce other clustering algorithms and  to inform when do we need other algorithms. 

Data

Due to more practical explanation, I am going to use IBM HR Analytics Employee Attrition & Performance Dataset

Table of Contents

1) Review of the Data

1) a - Missing Value Check

1) b - Descriptive Statistics of Data

1) c - Univariate Variable Analysis: Numerical Data

1) d - Univariate Variable Analysis: Categorical Data

1) e - Basic Data Analysis: Categorical Data

1) f - Outlier Detection

1) g - Correlation Matrix: Categorical Data

1) h - Correlation Matrix: Numerical Data

2) Clustering Algorithms

2) a - Evaluation Metrics

2) b - K-Means Clustering

2) c - Agglomerative Clustering

2) d - DBSCAN Clustering

2) e - Mean-Shift Clustering

2) f - BIRCH Clustering

2) g - Mini batch K-Means Clustering

3) Results

4) References

Review of the Data

Missing Value Check

Descriptive Statistics of Data

Univariate Variable Analysis: Numerical Data

Univariate Variable Analysis: Categorical Data

Attrition

Business Travel

Department

Education Field

Gender

Job Role

Marital Status

Over Time

Basic Data Analysis: Categorical Data

* Over Time - Attrition
* Job Role - Attrition
* Marital Status - Attrition

Outlier Detection

Correlation Matrix With Categorical Data

According to correlation matrix above, there is meaningful relationship between attrition and over time.

Correlation Matrix With Numerical Data

According to correlation matrix above, income, age and job level negatively correlated with attrition.

As a result of data analysis above, I decided to move on with " Over Time, Monthly Income, Total Working Years " features in my dataset.

Clustering Algorithms

1) Centroid Based

    Cluster represented by central reference vector which may not be a part of the original data e.g k-means clustering

    * K-means Clustering

2) Hierarchical

    Connectivity based clustering based on the core idea that points are connected to points close by rather than 
    further away. A cluster can be defined largely by the maximum distance needed to connect different parts of the
    cluster. Algorithms do not partition the dataset but instead construct a tree of points which are typically 
    merged together.

    * Agglomerative Clustering
    * BIRCH Clustering

3) Distribution Based

    Built on statistical distribution models - objects of a cluster are the ones which belong likely to same 
    distribution. Tend to be complex clustering models which might be prone to overfitting on data points

    * Gausssian mixture models

4) Density Based

    Create clusters from areas which have a higher density of data points. Objects in sparse areas, which seperate 
    clusters, are considered noise and border points.

    * DBSCAN Clustering
    * Mean-shift Clustering

2021-01-06_23-38-57.png

Evaluation Metrics

Homogeneity Score
Clustering satisfies homogeneity if all of its clusters contains only points which are members of a single class.
The actual label values do not matter i.e the fact that actual label 1 corresponds to cluster label 2 does
not affect this score
Completeness Score
Clustering satisfies completeness if all the points that are members of the same class belong to the same cluster
V Measure Score
Harmonic mean of homogeneity and completeness score - usually used to find the avarage of rates
Adjusted Rand Score
Similarity measure between clusters which is adjusted for chance i.e random labeling of data points
Close to 0: data was randomly labeled
Exact 1: actual and predicted clusters are identical 
Adjusted Mutual Information Score
Information obtained about one random variable by observing another random variable adjusted to account for chance
Close to 0: data was randomly labeled
Exact 1: actual and predicted clusters are identical
Silhouette Score
Uses a distance metric to measure how similar a point is to its own cluster and how dissimilar the point is from
points in other clusters. Ranges between -1 and 1 and positive values closer to 1 indicate that the clustering
was good

K-Means Clustering

To process the learning data, the K-means algorithm in data mining starts with a first group of randomly
selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) 
calculations to optimize the positions of the centroids.It halts creating and optimizing clusters when either:
The centroids have stabilized — there is no change in their values because the clustering has been successful.
The defined number of iterations has been achieved.


Contrasting K-Means and Hierarchical Clustering

K-Means
* Need distance measure as well as way to aggregate points in a cluster
* Must represent data as vectors in N-dimensional hyperspace
* Data representation can be difficult for complex data types
* Variants can efficiently deal with very large datasets on disk
Hierarchical
* Only need distance measure; do not need way to combine points in cluster
* No need to express data as vectors in N-dimensional hyperspace
* Relatively simple to represent even complex documents
* Even with careful construction too computationaly expensive for large datasets on disk
According to scores our k-means cluster perform did not well. Let's try other algorithms and hope an accuracy improvement

Agglomerative Clustering

Agglomerative Clustering is bottom-up hierarchical clustering. We will have tree representation of our data points and merge that our data points to another ones. Each step of agglomerative clustering merges the two clusters nearest to each other

What is the metric for nearness ?
    There are a few different approaches to solve this problem. Euclidian, L1, Cosine, Precomputed.

How is nearness measured ? 
    Linkage criterion determines the distance to be minimized when merging clusters. There are four linkage criterion.
       * Single: Minimum of the distances between all points in two clusters.
       * Complete: Maximum of the distances between all points in two clusters
       * Average: Average distance between points in clusters.
       * Ward: Minimizes the variances of the data points in the two clusters.
Bottom-up hierarchical clustering approach which recursively merges pairs of clusters, starting with single point 
clusters.
Each merge tries to minimally increase the linkage distance between pairs of clusters.
The default linkage criterion is ward which minimizes the variances of clusters being merged.
According to scores our agglomerative cluster perform did not well. Let's try other algorithms and hope an accuracy improvement

DBSCAN Clustering

As I show before, when we have large datasets and moderate cluster count we might think to use k-means or DBSCAN
K-means for even cluster sizes and flat surfaces
DBSCAN for uneven cluster sizes and manifolds

DBSCAN = Density Based Spatial Clustering of Applications with Noise
    Density-based clustering groups together closely packed points
    Points with few near neighbours are marked as outliers
    Not as good as BIRCH at dealing with noise and outliers

There are two main parameters that we need to consider and specify
    eps = Minimum distance -points closer than this are neighbors-
       * If eps is too small most of the data will not be clustered
       * Unclustered points will be considered to be outliers
       * If eps is too large clustering will be too coarse
       * Most of the points will be in the same cluster

    min_samples = Minimum number of points to form a dense region 
       * Generally this should be greater than number of dimensions in the data
       * Large values better for noisy data points, will form significant clusters
Groups points that are close to each other based on a distance measure and a minimum number of points
All points within maximum distance are considered neighbors
eps value will determine what we consider a dense region - smaller values are preferred
min_samples in the neighborhood for the point to be a core point
According to results, Overall, DBSCAN is not a good model to compare k-means or aggloremative clustering for this dataset.

Mean-shift Clustering

Starts with a set of points in space
One particular point define a neighborhood for each point and performs this actions for all data points
Each point will have its own neighborhood
For each point, calculate a function based on all points in the neighborhood. It is called kernel

Mean-shift has different kind of kernels
    * Flat kernal: Sum of all points in neighborhood. Each point gets the same weight
    * Gaussian kernel: Probability-weighted sum of points. Distribution is Gaussian

Contrasting K-Means and Mean-shift Clustering

K-means:
    * Need to specify number of clusters as hyperparameter
    * Can not handle some complex non-linear data
    * Less hyperparameter tuning needed
    * Computationally less intensive
    * O(N) in number of data points 
    * Struggles with outliers

Mean-shift:
    * No need to specify number of clusters upfront as hyperparameter
    * Uses density function to handle even complex non-linear data like pixels
    * Hyperparameter tuning very important
    * Computationally very intensive
    * O(N2) in number of data points
    * Copes better with outliers
Algorithm tries to discover blobs in a smooth cluster of data points
Original seeds of the cluster are determined using a binning technique
According to results, Mean-shift is not a good model to compare k-means or aggloremative clustering for this dataset.

BIRCH Clustering

As I show before, when we have large datasets and many cluster count we might think to use agglomerative or BIRCH
BIRCH detects and removes outliers
Also incrementally processes incoming data and updates clusters

BIRCH = Balanced Iterative Reducing and Clustering using Hierarchies
    * Very effective at handling noise and outliers
    * Very memory and time efficient 
    * Entire dataset need not be loaded into memory
    * Incrementally clusters incoming data points
    * Updates clusters as new data arrives
    * Can deal with online streaming data !!
As a result, BIRCH did well according to silhouette score among last three algorithms. By far, our best scores came from k-means, agglomerative and BIRCH

Mini-batch K-means Clustering

When we have large datasets and moderate cluster count we might think to use Mini-batch K-means
Perform K-means on a randomly sampled subsets
Iteratively performed on batches called mini-batches
Far faster than full K-means
Performance usually only slightly worse
According to results, our mini batch k-means model performed slightly worse than original k-means algorithm

Results

    In this project, my main objective was introduce different clustering techniques and compare their performances
 based on our dataset. Every clustering technique has designed for different type of data and clustering sizes.
 Because of this, every dataset, every different data cleaning techniques and every different hyper parameter
 tuning techniques are likely to change our results.

    To sum up;
        Small dataset and many number of clusters = Mean-shift, Affinity Propagation
        Medium dataset and few number of clusters = Spectral
        Large dataset and moderate number of clusters = K-means, DBSCAN
        Large dataset and many number of clusters = BIRCH, Aggloremative

    With our dataset, best efficient clustering techniques were K-means, Aggloremative and BIRCH. Because our dataset
is relatively big and our natural cluster sizes are a little more than avarage.

So why our other metrics than silhoutte were given us very bad scores? 
    Probably I needed to much better data preparation solutions like PCA and I did statistical mistakes.

What did I learn?
    Clustering is much deeper than most of tutorial on the internet. For different scenarios we have to optimize 
    our clustering algorithm and to pick right one.

Which one did I like most?
    I like BIRCH clustering most because its ability to handling online streaming data makes it a lot useful for
big data platforms and online machine learning services. Personally I like dynamic ML solutions a lot. I might think
to create a web project with BIRCH algorithm.

References

* https://www.pluralsight.com/
* https://cloud.google.com/bigquery-ml/docs/kmeans-tutorial
* https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68