!pip install -U scikit-learn

Collecting scikit-learn

ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'c:\\programdata\\anaconda3\\lib\\site-packages\\scikit_learn-0.23.2.dist-info\\COPYING'
Consider using the `--user` option or check the permissions.

  Using cached scikit_learn-0.24.1-cp38-cp38-win_amd64.whl (6.9 MB)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (1.19.2)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (1.5.2)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (2.1.0)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (0.17.0)
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.23.2
    Uninstalling scikit-learn-0.23.2:


## Necessary libraries imported
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.decomposition import PCA,KernelPCA
from sklearn.preprocessing import RobustScaler
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.filterwarnings("ignore")


data = pd.read_csv(r'C:\Users\Arda\Downloads\cardio_train.csv', sep=";" )

data.head()


data.dtypes

id               int64
age              int64
gender           int64
height           int64
weight         float64
ap_hi            int64
ap_lo            int64
cholesterol      int64
gluc             int64
smoke            int64
alco             int64
active           int64
cardio           int64
dtype: object


data.isnull().sum()

id             0
age            0
gender         0
height         0
weight         0
ap_hi          0
ap_lo          0
cholesterol    0
gluc           0
smoke          0
alco           0
active         0
cardio         0
dtype: int64


data.describe()


sns.set(style='white',font_scale=1.5, rc={'figure.figsize':(20,20)})
ax=data.hist(bins=20,color='blue')


x=data.drop(['cardio'],axis=1)
y=data['cardio']


scaler=RobustScaler()
x = scaler.fit_transform(x)


pca=PCA()
x_pca = pca.fit_transform(x)
plt.figure(figsize=(10,10))
plt.ylabel('Explained Variance')
plt.xlabel('Principal Components')
plt.plot(np.cumsum(pca.explained_variance_ratio_), 'ro-')
plt.grid()


def show_components(num_component):
    print(f"Percent explained variance: {pca.explained_variance_ratio_[num_component-1]*100:.4f}","%")


show_components(1)
show_components(2)

Percent explained variance: 85.0085 %
Percent explained variance: 14.1860 %


pca = PCA(n_components=2)
pca_result = pca.fit_transform(x)


plt.figure(figsize=(10,10))
plt.plot(range(2), pca.explained_variance_ratio_)
plt.plot(range(2), np.cumsum(pca.explained_variance_ratio_))
plt.title("Component-wise and Cumulative Explained Variance")

Text(0.5, 1.0, 'Component-wise and Cumulative Explained Variance')

	id	age	gender	height	weight	ap_hi	ap_lo	cholesterol	gluc	active	cardio
0	0	18393	2	168	62.0	110	80	1	1	1	0
1	1	20228	1	156	85.0	140	90	3	1	1	1
2	2	18857	1	165	64.0	130	70	3	1	0	1
3	3	17623	2	169	82.0	150	100	1	1	1	1
4	4	17474	1	156	56.0	100	60	1	1	0	0

	id	age	gender	height	weight	ap_hi	ap_lo	cholesterol	gluc	smoke	alco	active	cardio
count	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000
mean	49972.419900	19468.865814	1.349571	164.359229	74.205690	128.817286	96.630414	1.366871	1.226457	0.088129	0.053771	0.803729	0.499700
std	28851.302323	2467.251667	0.476838	8.210126	14.395757	154.011419	188.472530	0.680250	0.572270	0.283484	0.225568	0.397179	0.500003
min	0.000000	10798.000000	1.000000	55.000000	10.000000	-150.000000	-70.000000	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000
25%	25006.750000	17664.000000	1.000000	159.000000	65.000000	120.000000	80.000000	1.000000	1.000000	0.000000	0.000000	1.000000	0.000000
50%	50001.500000	19703.000000	1.000000	165.000000	72.000000	120.000000	80.000000	1.000000	1.000000	0.000000	0.000000	1.000000	0.000000
75%	74889.250000	21327.000000	2.000000	170.000000	82.000000	140.000000	90.000000	2.000000	1.000000	0.000000	0.000000	1.000000	1.000000
max	99999.000000	23713.000000	2.000000	250.000000	200.000000	16020.000000	11000.000000	3.000000	3.000000	1.000000	1.000000	1.000000	1.000000

Recep Arda Kaya - 435060¶

Principal Component Analysis to Cardiovascular Disease Dataset¶

Objective¶

Data¶

Table of Contents¶

Review of the Data ¶

Missing Value Check ¶

Descriptive Statistics of Data ¶

Univariate Variable Analysis ¶

PCA ¶

Defining "X" and "Y" variables ¶

Scaling the data to unit variance. ¶

Principal Component Analysis ¶

Total and Explained Variance ¶

explained_varianceratio returns the percentage of variance explained by each of the selected components. The resulting plot indicates that the first principal component alone accounts for approximately 85% of the variance and we can see that 2 principal component will be best for our model¶

According to results above, we can clearly see first 2 components are highly explanatory for our data and other components may be negligible¶

Inference ¶

1st PC: 85.0406 % 2nd PC: 14.1914 %¶

Total : 99.232 %¶

With only 2 components our power of explanation reached 99.232%. It means, other components can be dropped and our data will be less noisy after this process. We can also, use this components with classification algorithms like SVM later.¶

References ¶