* K-Means Clustering
* Agglomerative Clustering
* DBSCAN Clustering
* Mean-Shift Clustering
* BIRCH Clustering
* Mini-batch k-means
Clustering aims to maximize intra-cluster similarity and minimize inter-cluster similarity.
Each clustering problems requires own unique solutions. According to my observation, most of tutorials and guidebooks focus on K-means clustering and the data preparation process before. I want to introduce other clustering algorithms and to inform when do we need other algorithms.
Due to more practical explanation, I am going to use IBM HR Analytics Employee Attrition & Performance Dataset
1) b - Descriptive Statistics of Data
1) c - Univariate Variable Analysis: Numerical Data
1) d - Univariate Variable Analysis: Categorical Data
1) e - Basic Data Analysis: Categorical Data
1) g - Correlation Matrix: Categorical Data
1) h - Correlation Matrix: Numerical Data
2) c - Agglomerative Clustering
!pip install -U scikit-learn
Collecting scikit-learn Using cached scikit_learn-0.24.1-cp38-cp38-win_amd64.whl (6.9 MB) Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (1.5.2) Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (1.19.2) Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (2.1.0) Requirement already satisfied, skipping upgrade: joblib>=0.11 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (0.17.0) Installing collected packages: scikit-learn Attempting uninstall: scikit-learn Found existing installation: scikit-learn 0.23.2 Uninstalling scikit-learn-0.23.2:
ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'c:\\programdata\\anaconda3\\lib\\site-packages\\scikit_learn-0.23.2.dist-info\\COPYING' Consider using the `--user` option or check the permissions.
import sklearn
from sklearn import metrics
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from collections import Counter
import warnings
warnings.filterwarnings("ignore")
plt.style.use("seaborn-whitegrid")
data = pd.read_csv(r'C:\Users\Arda\Downloads\WA_Fn-UseC_-HR-Employee-Attrition.csv')
data.head()
Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | ... | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | ... | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | ... | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | ... | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | ... | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
5 rows × 35 columns
data.dtypes
Age int64 Attrition object BusinessTravel object DailyRate int64 Department object DistanceFromHome int64 Education int64 EducationField object EmployeeCount int64 EmployeeNumber int64 EnvironmentSatisfaction int64 Gender object HourlyRate int64 JobInvolvement int64 JobLevel int64 JobRole object JobSatisfaction int64 MaritalStatus object MonthlyIncome int64 MonthlyRate int64 NumCompaniesWorked int64 Over18 object OverTime object PercentSalaryHike int64 PerformanceRating int64 RelationshipSatisfaction int64 StandardHours int64 StockOptionLevel int64 TotalWorkingYears int64 TrainingTimesLastYear int64 WorkLifeBalance int64 YearsAtCompany int64 YearsInCurrentRole int64 YearsSinceLastPromotion int64 YearsWithCurrManager int64 dtype: object
data.isnull().sum()
Age 0 Attrition 0 BusinessTravel 0 DailyRate 0 Department 0 DistanceFromHome 0 Education 0 EducationField 0 EmployeeCount 0 EmployeeNumber 0 EnvironmentSatisfaction 0 Gender 0 HourlyRate 0 JobInvolvement 0 JobLevel 0 JobRole 0 JobSatisfaction 0 MaritalStatus 0 MonthlyIncome 0 MonthlyRate 0 NumCompaniesWorked 0 Over18 0 OverTime 0 PercentSalaryHike 0 PerformanceRating 0 RelationshipSatisfaction 0 StandardHours 0 StockOptionLevel 0 TotalWorkingYears 0 TrainingTimesLastYear 0 WorkLifeBalance 0 YearsAtCompany 0 YearsInCurrentRole 0 YearsSinceLastPromotion 0 YearsWithCurrManager 0 dtype: int64
data.describe()
Age | DailyRate | DistanceFromHome | Education | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.0 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | ... | 1470.000000 | 1470.0 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 |
mean | 36.923810 | 802.485714 | 9.192517 | 2.912925 | 1.0 | 1024.865306 | 2.721769 | 65.891156 | 2.729932 | 2.063946 | ... | 2.712245 | 80.0 | 0.793878 | 11.279592 | 2.799320 | 2.761224 | 7.008163 | 4.229252 | 2.187755 | 4.123129 |
std | 9.135373 | 403.509100 | 8.106864 | 1.024165 | 0.0 | 602.024335 | 1.093082 | 20.329428 | 0.711561 | 1.106940 | ... | 1.081209 | 0.0 | 0.852077 | 7.780782 | 1.289271 | 0.706476 | 6.126525 | 3.623137 | 3.222430 | 3.568136 |
min | 18.000000 | 102.000000 | 1.000000 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 30.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 80.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 30.000000 | 465.000000 | 2.000000 | 2.000000 | 1.0 | 491.250000 | 2.000000 | 48.000000 | 2.000000 | 1.000000 | ... | 2.000000 | 80.0 | 0.000000 | 6.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 |
50% | 36.000000 | 802.000000 | 7.000000 | 3.000000 | 1.0 | 1020.500000 | 3.000000 | 66.000000 | 3.000000 | 2.000000 | ... | 3.000000 | 80.0 | 1.000000 | 10.000000 | 3.000000 | 3.000000 | 5.000000 | 3.000000 | 1.000000 | 3.000000 |
75% | 43.000000 | 1157.000000 | 14.000000 | 4.000000 | 1.0 | 1555.750000 | 4.000000 | 83.750000 | 3.000000 | 3.000000 | ... | 4.000000 | 80.0 | 1.000000 | 15.000000 | 3.000000 | 3.000000 | 9.000000 | 7.000000 | 3.000000 | 7.000000 |
max | 60.000000 | 1499.000000 | 29.000000 | 5.000000 | 1.0 | 2068.000000 | 4.000000 | 100.000000 | 4.000000 | 5.000000 | ... | 4.000000 | 80.0 | 3.000000 | 40.000000 | 6.000000 | 4.000000 | 40.000000 | 18.000000 | 15.000000 | 17.000000 |
8 rows × 26 columns
sns.set(style='white',font_scale=1.3, rc={'figure.figsize':(20,20)})
ax=data.hist(bins=20,color='blue')
def bar_plot(variable):
var = data[variable]
varValue = var.value_counts()
plt.figure(figsize=(9,3))
plt.bar(varValue.index,varValue)
plt.xticks(varValue.index, varValue.index.values, rotation='vertical')
plt.ylabel('Frequency')
plt.title(variable)
plt.show()
print("{}: \n {}".format(variable,varValue))
category_attrition = ['Attrition']
for c in category_attrition:
bar_plot(c)
Attrition: No 1233 Yes 237 Name: Attrition, dtype: int64
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()
data['BusinessTravel_Encoded'] = ord_enc.fit_transform(data[['BusinessTravel']])
### TRAVEL RARELY = 2
### TRAVEL FREQUENTLY = 1
### NON-TRAVEL = 0
category_businesstravel = ['BusinessTravel']
for c in category_businesstravel:
bar_plot(c)
BusinessTravel: Travel_Rarely 1043 Travel_Frequently 277 Non-Travel 150 Name: BusinessTravel, dtype: int64
data['Department_Encoded'] = ord_enc.fit_transform(data[['Department']])
category_department = ['Department']
### SALES = 2
### RESEARCH = 1
### HR = 0
for c in category_department:
bar_plot(c)
Department: Research & Development 961 Sales 446 Human Resources 63 Name: Department, dtype: int64
data['EducationField_Encoded'] = ord_enc.fit_transform(data[['EducationField']])
### TECHNICAL DEGREE = 5
### OTHER = 4
### MEDICAL = 3
### MARKETING = 2
### LIFE SCIENCES = 1
### HR = 0
category_education = ['EducationField']
for c in category_education:
bar_plot(c)
EducationField: Life Sciences 606 Medical 464 Marketing 159 Technical Degree 132 Other 82 Human Resources 27 Name: EducationField, dtype: int64
data['Gender_Encoded'] = ord_enc.fit_transform(data[['Gender']])
### MALE = 1
### FEMALE = 0
category_gender = ['Gender']
for c in category_gender:
bar_plot(c)
Gender: Male 882 Female 588 Name: Gender, dtype: int64
data['JobRole_Encoded'] = ord_enc.fit_transform(data[['JobRole']])
### SALES REPRESENTATIVE = 8
### SALES EXECUTIVE = 7
### RESEARCH SCIENTIST = 6
### RESEARCH DIRECTOR = 5
### MANUFACTORING DIRECTOR = 4
### MANAGER = 3
### LABRATORY TECHNICAN = 2
### HR = 1
### HEALTHCARE REPRESANTATIVE = 0
category_jobrole = ['JobRole']
for c in category_jobrole:
bar_plot(c)
JobRole: Sales Executive 326 Research Scientist 292 Laboratory Technician 259 Manufacturing Director 145 Healthcare Representative 131 Manager 102 Sales Representative 83 Research Director 80 Human Resources 52 Name: JobRole, dtype: int64
data['MaritalStatus_Encoded'] = ord_enc.fit_transform(data[['MaritalStatus']])
### SINGLE = 2
### MARRIED = 1
### DIVORCED = 0
category_maritalstatus = ['MaritalStatus']
for c in category_maritalstatus:
bar_plot(c)
MaritalStatus: Married 673 Single 470 Divorced 327 Name: MaritalStatus, dtype: int64
data['OverTime_Encoded'] = ord_enc.fit_transform(data[['OverTime']])
### YES = 1
### NO = 0
category_overtime = ['OverTime']
for c in category_overtime:
bar_plot(c)
OverTime: No 1054 Yes 416 Name: OverTime, dtype: int64
* Over Time - Attrition
* Job Role - Attrition
* Marital Status - Attrition
data['Attrition_Binary'] = ord_enc.fit_transform(data[['Attrition']])
#data[['Attrition_Binary','Attrition']].head(10)
# 1 = YES
# 0 = NO
data[['OverTime','Attrition_Binary']].groupby(['OverTime'], as_index = False).mean().sort_values(by='Attrition_Binary',ascending=False)
OverTime | Attrition_Binary | |
---|---|---|
1 | Yes | 0.305288 |
0 | No | 0.104364 |
data[['JobRole','Attrition_Binary']].groupby(['JobRole'], as_index = False).mean().sort_values(by='Attrition_Binary',ascending=False)
JobRole | Attrition_Binary | |
---|---|---|
8 | Sales Representative | 0.397590 |
2 | Laboratory Technician | 0.239382 |
1 | Human Resources | 0.230769 |
7 | Sales Executive | 0.174847 |
6 | Research Scientist | 0.160959 |
4 | Manufacturing Director | 0.068966 |
0 | Healthcare Representative | 0.068702 |
3 | Manager | 0.049020 |
5 | Research Director | 0.025000 |
data[['MaritalStatus','Attrition_Binary']].groupby(['MaritalStatus'], as_index = False).mean().sort_values(by='Attrition_Binary',ascending=False)
MaritalStatus | Attrition_Binary | |
---|---|---|
2 | Single | 0.255319 |
1 | Married | 0.124814 |
0 | Divorced | 0.100917 |
def detect_outliers(df,features):
outlier_indices = []
for c in features:
Q1 = np.percentile(df[c],25)
Q3 = np.percentile(df[c],75)
IQR = Q3 - Q1
outlier_step = IQR * 1.5
outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
outlier_indices.extend(outlier_list_col)
outlier_indices = Counter(outlier_indices)
multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
return multiple_outliers
data.loc[detect_outliers(data,['Age','DailyRate','DistanceFromHome','EmployeeNumber','MonthlyIncome','MonthlyRate','NumCompaniesWorked','PercentSalaryHike','PerformanceRating','StockOptionLevel','TotalWorkingYears','TrainingTimesLastYear','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager'])]
Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | ... | YearsSinceLastPromotion | YearsWithCurrManager | BusinessTravel_Encoded | Department_Encoded | EducationField_Encoded | Gender_Encoded | JobRole_Encoded | MaritalStatus_Encoded | OverTime_Encoded | Attrition_Binary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
45 | 41 | Yes | Travel_Rarely | 1360 | Research & Development | 12 | 3 | Technical Degree | 1 | 58 | ... | 15 | 8 | 2.0 | 1.0 | 5.0 | 0.0 | 5.0 | 1.0 | 0.0 | 1.0 |
62 | 50 | No | Travel_Rarely | 989 | Research & Development | 7 | 2 | Medical | 1 | 80 | ... | 13 | 8 | 2.0 | 1.0 | 3.0 | 0.0 | 5.0 | 0.0 | 1.0 | 0.0 |
105 | 59 | No | Non-Travel | 1420 | Human Resources | 2 | 4 | Human Resources | 1 | 140 | ... | 2 | 2 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 | 0.0 | 0.0 |
123 | 51 | No | Travel_Rarely | 684 | Research & Development | 6 | 3 | Life Sciences | 1 | 162 | ... | 15 | 15 | 2.0 | 1.0 | 1.0 | 1.0 | 5.0 | 2.0 | 0.0 | 0.0 |
186 | 40 | No | Travel_Rarely | 989 | Research & Development | 4 | 1 | Medical | 1 | 253 | ... | 9 | 9 | 2.0 | 1.0 | 3.0 | 0.0 | 3.0 | 1.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1086 | 50 | No | Travel_Frequently | 333 | Research & Development | 22 | 5 | Medical | 1 | 1539 | ... | 13 | 9 | 1.0 | 1.0 | 3.0 | 1.0 | 5.0 | 2.0 | 1.0 | 0.0 |
1138 | 50 | No | Travel_Frequently | 1234 | Research & Development | 20 | 5 | Medical | 1 | 1606 | ... | 12 | 13 | 1.0 | 1.0 | 3.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 |
1327 | 46 | No | Travel_Rarely | 1319 | Sales | 3 | 3 | Technical Degree | 1 | 1863 | ... | 2 | 8 | 2.0 | 2.0 | 5.0 | 0.0 | 7.0 | 0.0 | 0.0 | 0.0 |
926 | 43 | No | Travel_Rarely | 531 | Sales | 4 | 4 | Marketing | 1 | 1293 | ... | 15 | 17 | 2.0 | 2.0 | 2.0 | 0.0 | 7.0 | 2.0 | 0.0 | 0.0 |
1078 | 44 | No | Travel_Rarely | 136 | Research & Development | 28 | 3 | Life Sciences | 1 | 1523 | ... | 14 | 17 | 2.0 | 1.0 | 1.0 | 1.0 | 5.0 | 1.0 | 0.0 | 0.0 |
76 rows × 43 columns
data = data.drop(detect_outliers(data,['Age','DailyRate','DistanceFromHome','EmployeeNumber','MonthlyIncome','MonthlyRate','NumCompaniesWorked','PercentSalaryHike','PerformanceRating','StockOptionLevel','TotalWorkingYears','TrainingTimesLastYear','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']),axis = 0).reset_index(drop = True)
data.plot( kind = 'box', subplots = True, layout = (6,6), sharex = False, sharey = False,color='blue')
plt.show()
encoded_data_features = data[['Attrition_Binary','BusinessTravel_Encoded','Department_Encoded','EducationField_Encoded','Gender_Encoded','JobRole_Encoded','MaritalStatus_Encoded','OverTime_Encoded']]
f,ax = plt.subplots(figsize=(10,10))
sns.heatmap(encoded_data_features.corr(),annot=True, linewidths=5, ax=ax)
plt.show()
numerical_data_features = data[['Attrition_Binary','Age','DistanceFromHome','JobLevel','JobSatisfaction','MonthlyIncome','PercentSalaryHike','TotalWorkingYears','YearsAtCompany','YearsSinceLastPromotion']]
f,ax = plt.subplots(figsize=(10,10))
sns.heatmap(numerical_data_features.corr(),annot=True, linewidths=5, ax=ax)
plt.show()
Cluster represented by central reference vector which may not be a part of the original data e.g k-means clustering
* K-means Clustering
Connectivity based clustering based on the core idea that points are connected to points close by rather than
further away. A cluster can be defined largely by the maximum distance needed to connect different parts of the
cluster. Algorithms do not partition the dataset but instead construct a tree of points which are typically
merged together.
* Agglomerative Clustering
* BIRCH Clustering
Built on statistical distribution models - objects of a cluster are the ones which belong likely to same
distribution. Tend to be complex clustering models which might be prone to overfitting on data points
* Gausssian mixture models
Create clusters from areas which have a higher density of data points. Objects in sparse areas, which seperate
clusters, are considered noise and border points.
* DBSCAN Clustering
* Mean-shift Clustering
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.cluster import MeanShift
from sklearn.cluster import Birch
from sklearn.cluster import MiniBatchKMeans
IBM_data = data[['Attrition_Binary','OverTime_Encoded','MonthlyIncome','TotalWorkingYears']]
IBM_data.head()
Attrition_Binary | OverTime_Encoded | MonthlyIncome | TotalWorkingYears | |
---|---|---|---|---|
0 | 1.0 | 1.0 | 5993 | 8 |
1 | 0.0 | 0.0 | 5130 | 10 |
2 | 1.0 | 1.0 | 2090 | 7 |
3 | 0.0 | 1.0 | 2909 | 8 |
4 | 0.0 | 0.0 | 3468 | 6 |
IBM_data.shape
(1394, 4)
IBM_data = IBM_data.sample(frac=1).reset_index(drop=True)
IBM_data.head()
IBM_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1394 entries, 0 to 1393 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Binary 1394 non-null float64 1 OverTime_Encoded 1394 non-null float64 2 MonthlyIncome 1394 non-null int64 3 TotalWorkingYears 1394 non-null int64 dtypes: float64(2), int64(2) memory usage: 43.7 KB
IBM_data_features = IBM_data.drop('Attrition_Binary', axis=1)
IBM_data_features.head()
OverTime_Encoded | MonthlyIncome | TotalWorkingYears | |
---|---|---|---|
0 | 0.0 | 2370 | 8 |
1 | 0.0 | 6151 | 19 |
2 | 0.0 | 8392 | 10 |
3 | 1.0 | 8189 | 12 |
4 | 1.0 | 13726 | 30 |
IBM_data_attrition = IBM_data['Attrition_Binary']
IBM_data_attrition.sample(10)
102 0.0 82 0.0 635 0.0 1261 0.0 325 0.0 686 0.0 179 0.0 415 0.0 1144 1.0 1056 0.0 Name: Attrition_Binary, dtype: float64
Clustering satisfies homogeneity if all of its clusters contains only points which are members of a single class.
The actual label values do not matter i.e the fact that actual label 1 corresponds to cluster label 2 does
not affect this score
Clustering satisfies completeness if all the points that are members of the same class belong to the same cluster
Harmonic mean of homogeneity and completeness score - usually used to find the avarage of rates
Similarity measure between clusters which is adjusted for chance i.e random labeling of data points
Close to 0: data was randomly labeled
Exact 1: actual and predicted clusters are identical
Information obtained about one random variable by observing another random variable adjusted to account for chance
Close to 0: data was randomly labeled
Exact 1: actual and predicted clusters are identical
Uses a distance metric to measure how similar a point is to its own cluster and how dissimilar the point is from
points in other clusters. Ranges between -1 and 1 and positive values closer to 1 indicate that the clustering
was good
def BuildModel(clustering_model,data,labels):
model=clustering_model(data)
print('homo\tcompl\tv-means\tARI\tAMI\tsilhouette')
print(50*'_')
print('%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
%(metrics.homogeneity_score(labels, model.labels_),
metrics.completeness_score(labels, model.labels_),
metrics.v_measure_score(labels, model.labels_),
metrics.adjusted_rand_score(labels, model.labels_),
metrics.adjusted_mutual_info_score(labels, model.labels_),
metrics.silhouette_score(data,model.labels_)))
To process the learning data, the K-means algorithm in data mining starts with a first group of randomly
selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive)
calculations to optimize the positions of the centroids.It halts creating and optimizing clusters when either:
The centroids have stabilized — there is no change in their values because the clustering has been successful.
The defined number of iterations has been achieved.
* Need distance measure as well as way to aggregate points in a cluster
* Must represent data as vectors in N-dimensional hyperspace
* Data representation can be difficult for complex data types
* Variants can efficiently deal with very large datasets on disk
* Only need distance measure; do not need way to combine points in cluster
* No need to express data as vectors in N-dimensional hyperspace
* Relatively simple to represent even complex documents
* Even with careful construction too computationaly expensive for large datasets on disk
def k_means(data,n_clusters=2, max_iter=1000):
model = KMeans(n_clusters=n_clusters, max_iter=max_iter).fit(data)
return model
BuildModel(k_means,IBM_data_features,IBM_data_attrition)
homo compl v-means ARI AMI silhouette __________________________________________________ 0.006 0.006 0.006 -0.038 0.005 0.698
Agglomerative Clustering is bottom-up hierarchical clustering. We will have tree representation of our data points and merge that our data points to another ones. Each step of agglomerative clustering merges the two clusters nearest to each other
What is the metric for nearness ?
There are a few different approaches to solve this problem. Euclidian, L1, Cosine, Precomputed.
How is nearness measured ?
Linkage criterion determines the distance to be minimized when merging clusters. There are four linkage criterion.
* Single: Minimum of the distances between all points in two clusters.
* Complete: Maximum of the distances between all points in two clusters
* Average: Average distance between points in clusters.
* Ward: Minimizes the variances of the data points in the two clusters.
def agglomerative(data,n_clusters=2):
model = AgglomerativeClustering(n_clusters = n_clusters).fit(data)
return model
Bottom-up hierarchical clustering approach which recursively merges pairs of clusters, starting with single point
clusters.
Each merge tries to minimally increase the linkage distance between pairs of clusters.
The default linkage criterion is ward which minimizes the variances of clusters being merged.
BuildModel(agglomerative,IBM_data_features,IBM_data_attrition)
homo compl v-means ARI AMI silhouette __________________________________________________ 0.006 0.005 0.005 -0.032 0.005 0.675
As I show before, when we have large datasets and moderate cluster count we might think to use k-means or DBSCAN
K-means for even cluster sizes and flat surfaces
DBSCAN for uneven cluster sizes and manifolds
DBSCAN = Density Based Spatial Clustering of Applications with Noise
Density-based clustering groups together closely packed points
Points with few near neighbours are marked as outliers
Not as good as BIRCH at dealing with noise and outliers
There are two main parameters that we need to consider and specify
eps = Minimum distance -points closer than this are neighbors-
* If eps is too small most of the data will not be clustered
* Unclustered points will be considered to be outliers
* If eps is too large clustering will be too coarse
* Most of the points will be in the same cluster
min_samples = Minimum number of points to form a dense region
* Generally this should be greater than number of dimensions in the data
* Large values better for noisy data points, will form significant clusters
def dbscan(data, eps=0.45, min_samples=2):
model = DBSCAN(eps=eps, min_samples=min_samples).fit(data)
return model
Groups points that are close to each other based on a distance measure and a minimum number of points
All points within maximum distance are considered neighbors
eps value will determine what we consider a dense region - smaller values are preferred
min_samples in the neighborhood for the point to be a core point
BuildModel(dbscan,IBM_data_features,IBM_data_attrition)
homo compl v-means ARI AMI silhouette __________________________________________________ 0.007 0.031 0.011 0.006 -0.001 -0.818
Starts with a set of points in space
One particular point define a neighborhood for each point and performs this actions for all data points
Each point will have its own neighborhood
For each point, calculate a function based on all points in the neighborhood. It is called kernel
Mean-shift has different kind of kernels
* Flat kernal: Sum of all points in neighborhood. Each point gets the same weight
* Gaussian kernel: Probability-weighted sum of points. Distribution is Gaussian
* Need to specify number of clusters as hyperparameter
* Can not handle some complex non-linear data
* Less hyperparameter tuning needed
* Computationally less intensive
* O(N) in number of data points
* Struggles with outliers
* No need to specify number of clusters upfront as hyperparameter
* Uses density function to handle even complex non-linear data like pixels
* Hyperparameter tuning very important
* Computationally very intensive
* O(N2) in number of data points
* Copes better with outliers
def mean_shift(data, bandwidth=0.85):
model = MeanShift(bandwidth=bandwidth).fit(data)
return model
Algorithm tries to discover blobs in a smooth cluster of data points
Original seeds of the cluster are determined using a binning technique
BuildModel(mean_shift,IBM_data_features,IBM_data_attrition)
homo compl v-means ARI AMI silhouette __________________________________________________ 0.991 0.062 0.116 -0.000 -0.000 0.013
As I show before, when we have large datasets and many cluster count we might think to use agglomerative or BIRCH
BIRCH detects and removes outliers
Also incrementally processes incoming data and updates clusters
BIRCH = Balanced Iterative Reducing and Clustering using Hierarchies
* Very effective at handling noise and outliers
* Very memory and time efficient
* Entire dataset need not be loaded into memory
* Incrementally clusters incoming data points
* Updates clusters as new data arrives
* Can deal with online streaming data !!
def birch(data,n_clusters=2):
model = Birch(n_clusters=n_clusters).fit(data)
return model
BuildModel(birch,IBM_data_features,IBM_data_attrition)
homo compl v-means ARI AMI silhouette __________________________________________________ 0.006 0.005 0.005 -0.032 0.005 0.675
When we have large datasets and moderate cluster count we might think to use Mini-batch K-means
Perform K-means on a randomly sampled subsets
Iteratively performed on batches called mini-batches
Far faster than full K-means
Performance usually only slightly worse
def mini_batch_k_means(data, n_clusters=3, max_iter=1000):
model = MiniBatchKMeans(n_clusters=n_clusters, max_iter=max_iter, batch_size=20).fit(data)
return model
BuildModel(mini_batch_k_means,IBM_data_features,IBM_data_attrition)
homo compl v-means ARI AMI silhouette __________________________________________________ 0.029 0.013 0.018 -0.020 0.017 0.561
In this project, my main objective was introduce different clustering techniques and compare their performances
based on our dataset. Every clustering technique has designed for different type of data and clustering sizes.
Because of this, every dataset, every different data cleaning techniques and every different hyper parameter
tuning techniques are likely to change our results.
To sum up;
Small dataset and many number of clusters = Mean-shift, Affinity Propagation
Medium dataset and few number of clusters = Spectral
Large dataset and moderate number of clusters = K-means, DBSCAN
Large dataset and many number of clusters = BIRCH, Aggloremative
With our dataset, best efficient clustering techniques were K-means, Aggloremative and BIRCH. Because our dataset
is relatively big and our natural cluster sizes are a little more than avarage.
So why our other metrics than silhoutte were given us very bad scores?
Probably I needed to much better data preparation solutions like PCA and I did statistical mistakes.
What did I learn?
Clustering is much deeper than most of tutorial on the internet. For different scenarios we have to optimize
our clustering algorithm and to pick right one.
Which one did I like most?
I like BIRCH clustering most because its ability to handling online streaming data makes it a lot useful for
big data platforms and online machine learning services. Personally I like dynamic ML solutions a lot. I might think
to create a web project with BIRCH algorithm.
* https://www.pluralsight.com/
* https://cloud.google.com/bigquery-ml/docs/kmeans-tutorial
* https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68