* As number of x variables grows, several problems arise like problems in visualization, problems in training
and prediction. My main objective is to show how PCA minimize these problems and how we can observe this improvement in
accuracy score.
Due to more practical explanation, I am going to use Cardiovascular Disease dataset
!pip install -U scikit-learn
Collecting scikit-learn
ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'c:\\programdata\\anaconda3\\lib\\site-packages\\scikit_learn-0.23.2.dist-info\\COPYING' Consider using the `--user` option or check the permissions.
Using cached scikit_learn-0.24.1-cp38-cp38-win_amd64.whl (6.9 MB) Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (1.19.2) Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (1.5.2) Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (2.1.0) Requirement already satisfied, skipping upgrade: joblib>=0.11 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (0.17.0) Installing collected packages: scikit-learn Attempting uninstall: scikit-learn Found existing installation: scikit-learn 0.23.2 Uninstalling scikit-learn-0.23.2:
## Necessary libraries imported
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.decomposition import PCA,KernelPCA
from sklearn.preprocessing import RobustScaler
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv(r'C:\Users\Arda\Downloads\cardio_train.csv', sep=";" )
data.head()
id | age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 18393 | 2 | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
1 | 1 | 20228 | 1 | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 |
2 | 2 | 18857 | 1 | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 |
3 | 3 | 17623 | 2 | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 |
4 | 4 | 17474 | 1 | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 |
data.dtypes
id int64 age int64 gender int64 height int64 weight float64 ap_hi int64 ap_lo int64 cholesterol int64 gluc int64 smoke int64 alco int64 active int64 cardio int64 dtype: object
data.isnull().sum()
id 0 age 0 gender 0 height 0 weight 0 ap_hi 0 ap_lo 0 cholesterol 0 gluc 0 smoke 0 alco 0 active 0 cardio 0 dtype: int64
data.describe()
id | age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 | 70000.000000 |
mean | 49972.419900 | 19468.865814 | 1.349571 | 164.359229 | 74.205690 | 128.817286 | 96.630414 | 1.366871 | 1.226457 | 0.088129 | 0.053771 | 0.803729 | 0.499700 |
std | 28851.302323 | 2467.251667 | 0.476838 | 8.210126 | 14.395757 | 154.011419 | 188.472530 | 0.680250 | 0.572270 | 0.283484 | 0.225568 | 0.397179 | 0.500003 |
min | 0.000000 | 10798.000000 | 1.000000 | 55.000000 | 10.000000 | -150.000000 | -70.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 25006.750000 | 17664.000000 | 1.000000 | 159.000000 | 65.000000 | 120.000000 | 80.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
50% | 50001.500000 | 19703.000000 | 1.000000 | 165.000000 | 72.000000 | 120.000000 | 80.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
75% | 74889.250000 | 21327.000000 | 2.000000 | 170.000000 | 82.000000 | 140.000000 | 90.000000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
max | 99999.000000 | 23713.000000 | 2.000000 | 250.000000 | 200.000000 | 16020.000000 | 11000.000000 | 3.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
sns.set(style='white',font_scale=1.5, rc={'figure.figsize':(20,20)})
ax=data.hist(bins=20,color='blue')
x=data.drop(['cardio'],axis=1)
y=data['cardio']
scaler=RobustScaler()
x = scaler.fit_transform(x)
Dimensionality reduction is way to reduce the complexity of a model and avoid overfitting. There are two main categories of dimensionality reduction: feature selection and feature extraction. Via feature selection, we select a subset of the original features, whereas in feature extraction, we derive information from the feature set to construct a new feature subspace
Principal Component Analysis (PCA) algorithm used to compress a dataset onto a lower-dimensional feature subspace with the goal of maintaining most of the relevant information.
PCA helps us to identify patterns in data based on the correlation between features. In a nutshell, PCA aims to find the directions of maximum variance in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original one.
The variance explained ratio of an eigenvalue λ_j is simply the fraction of an eigenvalue λ_j and the total sum of the eigenvalues
pca=PCA()
x_pca = pca.fit_transform(x)
plt.figure(figsize=(10,10))
plt.ylabel('Explained Variance')
plt.xlabel('Principal Components')
plt.plot(np.cumsum(pca.explained_variance_ratio_), 'ro-')
plt.grid()
def show_components(num_component):
print(f"Percent explained variance: {pca.explained_variance_ratio_[num_component-1]*100:.4f}","%")
show_components(1)
show_components(2)
Percent explained variance: 85.0085 % Percent explained variance: 14.1860 %
pca = PCA(n_components=2)
pca_result = pca.fit_transform(x)
plt.figure(figsize=(10,10))
plt.plot(range(2), pca.explained_variance_ratio_)
plt.plot(range(2), np.cumsum(pca.explained_variance_ratio_))
plt.title("Component-wise and Cumulative Explained Variance")
Text(0.5, 1.0, 'Component-wise and Cumulative Explained Variance')